NLP: Stanford Parser with F# (.NET)

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

All code samples from this post are available on GitHub.

Natural Language Processing is one more hot topic as Machine Learning. For sure, it is extremely important, but poorly developed.

What we have in .NET?

Lets start from what we already have.

Looks really bad. It is hard to find something that really useful. Actually we have one more option, which is IKVM.NET. With IKVM.NET we should be able to use most of Java-based NLP frameworks. Let’s try to import Stanford Parser to .NET.

IKVM.NET overview.

IKVM.NET is an implementation of Java for Mono and the Microsoft .NET Framework. It includes the following components:

  • A Java Virtual Machine implemented in .NET
  • A .NET implementation of the Java class libraries
  • Tools that enable Java and .NET interoperability

Read more about what you can do with IKVM.NET.

About Stanford NLP nlp-logo-navbar

The Stanford NLP Group makes parts of our Natural Language Processing software available to the public. These are statistical NLP toolkits for various major computational linguistics problems. They can be incorporated into applications with human language technology needs.

All the software we distribute is written in Java. All recent distributions require Sun/Oracle JDK 1.5+. Distribution packages include components for command-line invocation, jar files, a Java API, and source code.

IKVM .jar to .dll compilation

First of all, we need to download and install IKVM.NET. You can do it from SourceForge. The next step is to download Stanford Parser (current latest version is 2.0.4 from 2012-11-12). Now we need to compile stanford-parser.jar to .NET assembly. You can do it with the following command:

ikvmc.exe stanford-parser.jar

If you need a strongly typed one, then you should do two more steps.

ildasm.exe /all /out=stanford-parser.il stanford-parser.dll
ilasm.exe /dll /key=myKey.snk stanford-parser.il

No signed stanford-parser.dll is available on GitHub.

Let’s play!

That’s all! Now we are ready to start playing with Stanford Parser.  I want to show up here one of the standard examples(ParserDemo.fs), the second one is available on the GitHub with other sources.

let demoAPI (lp:LexicalizedParser) =
  // This option shows parsing a list of correctly tokenized words
  let sent = [|"This"; "is"; "an"; "easy"; "sentence"; "." |]
  let rawWords = Sentence.toCoreLabelList(sent)
  let parse = lp.apply(rawWords)
  parse.pennPrint()

  // This option shows loading and using an explicit tokenizer
  let sent2 = "This is another sentence.";
  let tokenizerFactory = PTBTokenizer.factory(CoreLabelTokenFactory(), "")
  use sent2Reader = new StringReader(sent2)
  let rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize()
  let parse = lp.apply(rawWords2)

  let tlp = PennTreebankLanguagePack()
  let gsf = tlp.grammaticalStructureFactory()
  let gs = gsf.newGrammaticalStructure(parse)
  let tdl = gs.typedDependenciesCCprocessed()
  printfn "\n%O\n" tdl

  let tp = new TreePrint("penn,typedDependenciesCollapsed")
  tp.printTree(parse)

let main fileName =
  let lp = LexicalizedParser.loadModel(@"..\..\..\..\StanfordNLPLibraries\stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz")
  match fileName with
  | Some(file) -> demoDP lp file
  | None -> demoAPI lp

What we are doing here? First of all, we instantiate LexicalizedParser and initialize it with englishPCFG.ser.gz model. Then we create two sentences. First is created from already tokenized string(from string array, in this sample). The second one is created from the string using PTBTokenizer. After that we create lexical parser that is trained on the Penn Treebank corpus. Finally, we are parsing our sentences using this parser. Result output can be found below.

[|"1"|]
Loading parser from serialized file ..\..\..\..\StanfordNLPLibraries\
stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz ... 
done [1.5 sec].
(ROOT
 (S
 (NP (DT This))
 (VP (VBZ is)
 (NP (DT an) (JJ easy) (NN sentence)))
 (. .)))

[nsubj(sentence-4, This-1), cop(sentence-4, is-2), det(sentence-4, another-3), 
root(ROOT-0, sentence-4)]
(ROOT
 (S
 (NP (DT This))
 (VP (VBZ is)
 (NP (DT another) (NN sentence)))
 (. .)))
nsubj(sentence-4, This-1)
cop(sentence-4, is-2)
det(sentence-4, another-3)
root(ROOT-0, sentence-4)

I want to mention one more time, that full source code is available at the fsharp-stanford-nlp-samples GitHub repository. Feel free to use and extend it.

63 thoughts on “NLP: Stanford Parser with F# (.NET)

  1. It’s going to be finish of mine day, however before end I am reading this wonderful article to increase my knowledge.

  2. Your work looks to be very promising. Unfortunately, I have not made time just yet to become familiar with F#. Do you have any pointers on working with the Stanford objects in c#?
    Maybe a quick snippet showing construction of the parser and getting some simple POS?

    Best,
    B.M.

    1. You can port it really straightforward


      using java.io;
      using edu.stanford.nlp.process;
      using edu.stanford.nlp.ling;
      using edu.stanford.nlp.trees;
      using edu.stanford.nlp.parser.lexparser;
      namespace Stanford_Parser
      {
      class Program
      {
      static void demoAPI(LexicalizedParser lp)
      {
      // This option shows parsing a list of correctly tokenized words
      var sent = new[] { "This", "is", "an", "easy", "sentence", "." };
      var rawWords = Sentence.toCoreLabelList(sent);
      var parse = lp.apply(rawWords);
      parse.pennPrint();
      // This option shows loading and using an explicit tokenizer
      var sent2 = "This is another sentence.";
      var tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
      var sent2Reader = new StringReader(sent2);
      var rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize();
      parse = lp.apply(rawWords2);
      var tlp = new PennTreebankLanguagePack();
      var gsf = tlp.grammaticalStructureFactory();
      var gs = gsf.newGrammaticalStructure(parse);
      var tdl = gs.typedDependenciesCCprocessed();
      System.Console.WriteLine();
      for(var it=tdl.iterator(); it.hasNext();)
      System.Console.WriteLine("{0}", it.next());
      System.Console.WriteLine();
      var tp = new TreePrint("penn,typedDependenciesCollapsed");
      tp.printTree(parse);
      }
      static void Main(string[] args)
      {
      var lp = LexicalizedParser.loadModel(@"..\..\..\..\StanfordNLPLibraries\stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz");
      demoAPI(lp);
      }
      }
      }

      view raw

      gistfile1.cs

      hosted with ❤ by GitHub

      1. Sergey Tihon,
        Will you please send me the copmplete code with packages.

  3. Excellent, thanks!
    I was just about to decompile your demo .dll’s to see the c# methods generated.

    Its mainly this construct that threw me for a loop-
    let demoAPI (lp:LexicalizedParser)

    I really need to start working with F#, as your original code seems much more elegant than the c# version.

    Thank you very much for your time, now I can start experimenting with the parser in my c# project!

    -B.M.

  4. I’m also trying to reuse your work in a C# project, but I am having trouble to build your project : IKVM.Fsharp.dll can’t be built because of some errors in “Collections.fs”… In fact Visual Studio can’t interpret “open java.util” in this file and I assume this is normal as IKVM is supposed to be the library that actually define it, so I don’t get why this import is needed here.
    Maybe I missed some step of the process, do you have any idea of where it could come from?

      1. Well the NuGet package is working fine, in fact I was mainly interested by your NER implementation, the parser itself works fine if I import it in another project.

  5. I tried the same with Standford Segmenter (using C#). The main drawback is that the emitted files have no generics. For instance, I can’t write CRFClassifier, instead I only write CRFClassifier. Whenever I run my code I get RuntimeException, and I guess it is related to this problem.

      1. Many thanks, it worked fine. I just had some mistakes with classifier flags. Thank you for your help.

  6. Hello, I appreciate your sharing this – but I don’t see how to get your example-code to work. I installed the Nuget packages (all of them – there’re several) but how do you actually get a workable F# file? What exactly do you do to import the Stanford NLP code? I tried “open FSharp.NLP.Stanford.Parse”
    What is that “lp.” in this line: let demoAPI (lp:LexicalizedParser) =
    And then finally, my code is not recognizing PTBTokenizer (nor a lot of other things in this example).

    Any pointers would be appreciated. How can I get your illustrative sample-code to run?

      1. Thank you for responding Sergey. Sorry – I meant an .fs file that I want to compile into my Visual Studio solution (of which most is C#). Yes – I see the samples now. Which leads to the next question, if you don’t mind: How to get it to build? I downloaded it from that Github project as a zip file, unzipped it, and loaded the solution-file into VS 2013. I get 10 errors, possibly related to 204 warnings such as: “Could not located the assembly “IKVM.OpenJDK….
        I’m thinking there is probably an important setup step that I’m missing. I do see under “How to use it”, these instructions: “Download models from..” (but I’m not seeing how to inform the Visual Studio project how to know where those models get placed), and “Extract models from ‘stanford-parse-3.2.0-models.jar (just unzip it).” and again, no indication of how to inform how to locate those.

      2. 1)“Could not located the assembly IKVM.OpenJDK….” This mean that you should restore NuGet dependencies: Right click on the solution, click on the ‘Manage NuGet packages’, click on the `Restore`.
        2) Find lines of code where mentioned path to ‘englishPCFG.ser.gz’. This file actually packed into `stanford-parse-3.3.0-models.jar`. Update it to correct one (where you extracted it)

  7. Hi Sergey – thank you for trying to help. I don’t see a “Restore” option. Right-clicking on the solution is Vs2013, I see the option “Manage Nuget Packages for Solution…”. Within the resulting dialog, I checked out “Installed packages”, and “All”, and “Online”, and “Updates”. For “Installed packages”, I do see one pkg, named “IKVM.NET”, and it has a “Manage” button. Clicking on that – brings up a “Select Projects” dialog, with all of the several projects already selected. I do see that when I bring up the source-file Collections.fs, on the line with “open java.util” that “java” is underlined in red. As is the words “Iterator”, and ArrayList.

    And in the IKVM.FSharp project, looking in References – there’s a whole mess of references not found – all starting with “IKVM.” Looks like something is not in the right place?

    Sorry to be a whiner. I’m rather excited at the prospect of exploring this! Thanks for your advice,

    jh

      1. Done. Now I get 75 errors. Does “The namespace or module ‘edu’ is not defined” sound familiar? Which version of Visual Studio are you using?

  8. Evidently, the NuGet packaging is what is not working. I downloaded the IKVM.NET bit of software separately and uncompressed it, and that does have the DLL files that this solution complained about missing. So I added a reference to ikvm-7.2.4630.5/bin-x64/JVM.DLL, and now those references do show up within (for example) the StanfordNamedEntityRecognizerSamples project. It still gives an error when trying to run it, though, raising a FIleLoadException, because IKVM.Open.JDK.Core, 7.3.4830.0, does not match the manifest. Is that perhaps because the manifest calls for a different version? Has anyone ever gotten this to run? I have a fresh virtual machine with Vs 2010 to use, to try again from scratch. But a more explicit set of steps would probably help, so that I don’t waste another day trying every possible combination.

  9. Hello, what is the actual sequence of steps required to get your project working? All I need help with, I think, is just to get to the point of having one working sample. I believe I can take it from there.

    I used git clone https://github.com/sergey-tihon/fsharp-stanford-nlp-samples.git
    to get your repository onto a fresh virtual machine (vm), with Windows 7 x64, and Visual Studio 2010 Ultimate.

    I see that created a folder fsharp-stanford-nlp-samples, and there is a Visual Studio (VS) solution within that (which, by-the-way, I’m not able to open with VS 2010 – I had to shift over to another VM that I’d installed VS 2012 on). So I opened that solution file, tried to build.. 8 errors. You have to check “Allow NuGet to download missing packages during build.” from Tools/Options/Package Manager. Tried to build again: 7 Errors.

    Could not resolve this reference. Could not locate the assembly ‘IKVM.Open.IDK.SwingAWT, ..
    Ok – trying now to use NuGet to bring in dependencies. Opening “Manage NuGet Packages”, I do a search for “Stanford”, and see six different packages.

    Stanford.NLP.NER
    Stanford.NLP.Parser
    Stanford.NLP.POSTagger
    Stanford.NLP.CoreNLP
    FSharp.NLP.Stanford.Parser
    Stanford.NLP.Segmenter

    I wonder – which of these needs to be installed? What is the minimum needed, to start? I tried getting just Stanford.NLP.Parser. Building the solution now yields 107 Warnings, 10 Errors.

    So then I tried install all six of those packages, checking the checkboxes to ensure they were install for every project.

    Now a build of the solution yields: 107 Warnings, 9 Errors. The first warning is the same as shown above.

    I am thinking that, perhaps, it could be useful to have some steps explicitly laid out for people to use this. Unless (not unlikely) I am totally missing something obvious?

    Thank you for your help Sergey,
    James Hurst

  10. Hi, I think that I am partially reproduced this case. You should not reference all available packages. They may conflict with each other (the same types into the same namespaces). First of all decide which one you need and then reference it from NuGet (read more about packages on the Stanford NLP website http://nlp.stanford.edu/software/index.shtml).
    CoreNLP should be an umbrella project. Almost all available features should be insight.

  11. Hello Sergey-

    Is there any link or a tutorial like this to getting started to incorporate this parser in Java?

  12. Can you please tell me how exactly I should make ddl from .jar file. I am unable to do that..
    I have used your line of code
    ikvmc.exe stanford-parser.jar
    After downloading, I have two different folders one is ikvmbin-7.2.4630.5 and second is stanford-parser-2012-11-12

    Regards,
    Rohit

  13. Hey Sergey,

    Thanks for the help. Your inputs are always helpful.

    But, Wanted to ask one thing, can I make resume parser using Stanford Library?
    If yes, from which point should i start. Because exact way I am not getting it.

    Have seen all of the examples/demos for this library.

    Can you please help me on that.

    Thanks,
    Rohit

    1. Hi, It depends on multiple things:
      – What is you goal? What do you want to extract from your resume?
      – What is the format of your resume? Is it structured?

  14. Hey Sergey,

    1) What is you goal& What do you want to extract from you resume?
    — My aim is to extract candidate information from resume and store in the database. Want to extract almost every single information Like
    ( First name,last name,email id,mobile#,projects,experience,personal info, academic records, awards, achievements,skill,qualifications etc etc.).
    Currently I am concentrating Only on doc,docx, text files.
    So this information will be useful while searching a suitable candidate for a job.

    2) What is the format of you resume? Is it structured?
    It is Unstructured, every candidate will have different type of resume.

    1. I guess you try with RegEx, can use Expresso initially to test your Regular Expression. I’m telling so coz resume usually have “Name” like stings before candidate writes his name like wise for other details.

  15. I’m using Stanford Dependency Parser to resole dependencies in one of my projects. I have following problem , I hope you will help me,
    when in a review text where I’m analyzing dependencies it works great when sentence is short, but for long sentences it does not give all required dependencies. For example, when I try to find out dependencies in following sentence ,
    “The Navigation is better.” there is dependency nsubj that groups “Navigation” and “better”, telling me the review regarding navigation is positive.

    But when review sentence is bigger like
    “Navigation system is better then the Jeeps and as good as my husbands Audi A-8 system.”

    I don’t get any dependency relations grouping Navigation with better and Navigation with good. I tried using both basic and collapsed dependencies. I went through Stanford Dependencies Manual , but couldn’t figure out much that will help here. I just want whatever the aspect user is talking about should be grouped with its adjective and adverb.

    1. well there is a update I tried using all dependency models available in stanford.nlp.net , viz. .typedDependenciesCCprocessed(true); .typedDependenciesCollapsed(true); typedDependencies(true); typedDependenciesCollapsedTree(); allTypedDependencies();

  16. I cant able to find out the file stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz.
    Please Help me

  17. I have followed your instructions up to “IKVM .jar to .dll compilation”. I think I was successful up to there. I then created a F# project in Visual Studio 2013. I put your code into it. VS does not have a definition of LexicalizedParser. I understand C++ and C# but I do not understand F#. I assume we must add a reference but I do not know what to reference. Is there more to the F# program that does the equivalent of a “using” in C#? Am I correct that the F# program needs a little bit more such as that?

    I also used the tangiblesoftwaresolutions.com converter to convert the stanford-parser ParserDemo.java sample to C# but obviously that also needs a reference.

    I apologize for not being able to figure this out, but if you can help me to understand what to reference then I will appreciate it.

    I have seen your samples in your “Stanford Parser is available on NuGet for F# and C#” but the C# sample source also does not show what to reference and such. If that question is easily answered when I install what you have from that article then I should do that. Are the answers there?

    1. Okay I installed Stanford.NLP.Parser using the Package Manager Console. In the C# program (converted from the stanford-parser ParserDemo.java sample to C#) I managed to get:

      using edu.stanford.nlp.process;
      using edu.stanford.nlp.ling;
      using edu.stanford.nlp.trees;
      using edu.stanford.nlp.parser.lexparser;

      And that seems to work except there is one error that is outside the scope of here. So I tried using the following for your F# sample here:

      open edu.stanford.nlp.process
      open edu.stanford.nlp.ling
      open edu.stanford.nlp.trees
      open edu.stanford.nlp.parser.lexparser

      However VS says that “process” is reserved.

  18. I copied the C# code you have here, but the line “var gs = gsf.newGrammaticalStructure(tree);” is causing an error:

    “A first chance exception of type ‘edu.stanford.nlp.trees.tregex.TregexParser.LookaheadSuccess’ occurred in stanford-corenlp-3.5.2.dll”

    Any ideas?

      1. Unfortunately, I’m still getting the same error. It’s actually a stack overflow error, but the output window is printing “A first chance exception of type ‘edu.stanford.nlp.trees.tregex.TregexParser.LookaheadSuccess'” until the overflow occurs

      2. My code is almost identical to that example but does not work. I believe I have the most recent package from NuGet (3.5.2). I downloaded by typing “Install-Package Stanford.NLP.Parser” in the PM console as instructed.

Leave a comment