NLP: Stanford Parser with F# (.NET)
05/02/2013 21 Comments
All code samples from this post are available on GitHub.
Natural Language Processing is one more hot topic as Machine Learning. For sure, it is extremely important, but poorly developed.
What we have in .NET?
Lets start from what we already have.
- Abodit NLP
- SharpNLP (looks dead)
- NLP for .NET (discontinued)
- NTLK from IronPython
- Antelope Framework (shareware)
Looks really bad. It is hard to find something that really useful. Actually we have one more option, which is IKVM.NET. With IKVM.NET we should be able to use most of Java-based NLP frameworks. Let’s try to import Stanford Parser to .NET.
- A Java Virtual Machine implemented in .NET
- A .NET implementation of the Java class libraries
- Tools that enable Java and .NET interoperability
Read more about what you can do with IKVM.NET.
About Stanford NLP
The Stanford NLP Group makes parts of our Natural Language Processing software available to the public. These are statistical NLP toolkits for various major computational linguistics problems. They can be incorporated into applications with human language technology needs.
All the software we distribute is written in Java. All recent distributions require Sun/Oracle JDK 1.5+. Distribution packages include components for command-line invocation, jar files, a Java API, and source code.
IKVM .jar to .dll compilation
First of all, we need to download and install IKVM.NET. You can do it from SourceForge. The next step is to download Stanford Parser (current latest version is 2.0.4 from 2012-11-12). Now we need to compile stanford-parser.jar to .NET assembly. You can do it with the following command:
If you need a strongly typed one, then you should do two more steps.
ildasm.exe /all /out=stanford-parser.il stanford-parser.dll ilasm.exe /dll /key=myKey.snk stanford-parser.il
No signed stanford-parser.dll is available on GitHub.
That’s all! Now we are ready to start playing with Stanford Parser. I want to show up here one of the standard examples(ParserDemo.fs), the second one is available on the GitHub with other sources.
let demoAPI (lp:LexicalizedParser) = // This option shows parsing a list of correctly tokenized words let sent = [|"This"; "is"; "an"; "easy"; "sentence"; "." |] let rawWords = Sentence.toCoreLabelList(sent) let parse = lp.apply(rawWords) parse.pennPrint() // This option shows loading and using an explicit tokenizer let sent2 = "This is another sentence."; let tokenizerFactory = PTBTokenizer.factory(CoreLabelTokenFactory(), "") use sent2Reader = new StringReader(sent2) let rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize() let parse = lp.apply(rawWords2) let tlp = PennTreebankLanguagePack() let gsf = tlp.grammaticalStructureFactory() let gs = gsf.newGrammaticalStructure(parse) let tdl = gs.typedDependenciesCCprocessed() printfn "\n%O\n" tdl let tp = new TreePrint("penn,typedDependenciesCollapsed") tp.printTree(parse) let main fileName = let lp = LexicalizedParser.loadModel(@"..\..\..\..\StanfordNLPLibraries\stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz") match fileName with | Some(file) -> demoDP lp file | None -> demoAPI lp
What we are doing here? First of all, we instantiate LexicalizedParser and initialize it with englishPCFG.ser.gz model. Then we create two sentences. First is created from already tokenized string(from string array, in this sample). The second one is created from the string using PTBTokenizer. After that we create lexical parser that is trained on the Penn Treebank corpus. Finally, we are parsing our sentences using this parser. Result output can be found below.
[|"1"|] Loading parser from serialized file ..\..\..\..\StanfordNLPLibraries\ stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz ... done [1.5 sec]. (ROOT (S (NP (DT This)) (VP (VBZ is) (NP (DT an) (JJ easy) (NN sentence))) (. .))) [nsubj(sentence-4, This-1), cop(sentence-4, is-2), det(sentence-4, another-3), root(ROOT-0, sentence-4)] (ROOT (S (NP (DT This)) (VP (VBZ is) (NP (DT another) (NN sentence))) (. .))) nsubj(sentence-4, This-1) cop(sentence-4, is-2) det(sentence-4, another-3) root(ROOT-0, sentence-4)
I want to mention one more time, that full source code is available at the fsharp-stanford-nlp-samples GitHub repository. Feel free to use and extend it.