NLP: Stanford Parser with F# (.NET)


Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

All code samples from this post are available on GitHub.

Natural Language Processing is one more hot topic as Machine Learning. For sure, it is extremely important, but poorly developed.

What we have in .NET?

Lets start from what we already have.

Looks really bad. It is hard to find something that really useful. Actually we have one more option, which is IKVM.NET. With IKVM.NET we should be able to use most of Java-based NLP frameworks. Let’s try to import Stanford Parser to .NET.

IKVM.NET overview.

IKVM.NET is an implementation of Java for Mono and the Microsoft .NET Framework. It includes the following components:

  • A Java Virtual Machine implemented in .NET
  • A .NET implementation of the Java class libraries
  • Tools that enable Java and .NET interoperability

Read more about what you can do with IKVM.NET.

About Stanford NLP nlp-logo-navbar

The Stanford NLP Group makes parts of our Natural Language Processing software available to the public. These are statistical NLP toolkits for various major computational linguistics problems. They can be incorporated into applications with human language technology needs.

All the software we distribute is written in Java. All recent distributions require Sun/Oracle JDK 1.5+. Distribution packages include components for command-line invocation, jar files, a Java API, and source code.

IKVM .jar to .dll compilation

First of all, we need to download and install IKVM.NET. You can do it from SourceForge. The next step is to download Stanford Parser (current latest version is 2.0.4 from 2012-11-12). Now we need to compile stanford-parser.jar to .NET assembly. You can do it with the following command:

ikvmc.exe stanford-parser.jar

If you need a strongly typed one, then you should do two more steps.

ildasm.exe /all /out=stanford-parser.il stanford-parser.dll
ilasm.exe /dll /key=myKey.snk stanford-parser.il

No signed stanford-parser.dll is available on GitHub.

Let’s play!

That’s all! Now we are ready to start playing with Stanford Parser.  I want to show up here one of the standard examples(ParserDemo.fs), the second one is available on the GitHub with other sources.

let demoAPI (lp:LexicalizedParser) =
  // This option shows parsing a list of correctly tokenized words
  let sent = [|"This"; "is"; "an"; "easy"; "sentence"; "." |]
  let rawWords = Sentence.toCoreLabelList(sent)
  let parse = lp.apply(rawWords)
  parse.pennPrint()

  // This option shows loading and using an explicit tokenizer
  let sent2 = "This is another sentence.";
  let tokenizerFactory = PTBTokenizer.factory(CoreLabelTokenFactory(), "")
  use sent2Reader = new StringReader(sent2)
  let rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize()
  let parse = lp.apply(rawWords2)

  let tlp = PennTreebankLanguagePack()
  let gsf = tlp.grammaticalStructureFactory()
  let gs = gsf.newGrammaticalStructure(parse)
  let tdl = gs.typedDependenciesCCprocessed()
  printfn "\n%O\n" tdl

  let tp = new TreePrint("penn,typedDependenciesCollapsed")
  tp.printTree(parse)

let main fileName =
  let lp = LexicalizedParser.loadModel(@"..\..\..\..\StanfordNLPLibraries\stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz")
  match fileName with
  | Some(file) -> demoDP lp file
  | None -> demoAPI lp

What we are doing here? First of all, we instantiate LexicalizedParser and initialize it with englishPCFG.ser.gz model. Then we create two sentences. First is created from already tokenized string(from string array, in this sample). The second one is created from the string using PTBTokenizer. After that we create lexical parser that is trained on the Penn Treebank corpus. Finally, we are parsing our sentences using this parser. Result output can be found below.

[|"1"|]
Loading parser from serialized file ..\..\..\..\StanfordNLPLibraries\
stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz ... 
done [1.5 sec].
(ROOT
 (S
 (NP (DT This))
 (VP (VBZ is)
 (NP (DT an) (JJ easy) (NN sentence)))
 (. .)))

[nsubj(sentence-4, This-1), cop(sentence-4, is-2), det(sentence-4, another-3), 
root(ROOT-0, sentence-4)]
(ROOT
 (S
 (NP (DT This))
 (VP (VBZ is)
 (NP (DT another) (NN sentence)))
 (. .)))
nsubj(sentence-4, This-1)
cop(sentence-4, is-2)
det(sentence-4, another-3)
root(ROOT-0, sentence-4)

I want to mention one more time, that full source code is available at the fsharp-stanford-nlp-samples GitHub repository. Feel free to use and extend it.

About these ads

53 Responses to NLP: Stanford Parser with F# (.NET)

  1. Pingback: Don Syme's WebLog on F# and Related Topics

  2. Pingback: NLP: Stanford POS Tagger with F# (.NET) « Sergey Tihon's Blog

  3. Pingback: F# Weekly #6, 2013 « Sergey Tihon's Blog

  4. Pingback: NLP: Stanford Named Entity Recognizer with F# (.NET) « Sergey Tihon's Blog

  5. Pingback: F# IKVM Type Provider | FourEightThree

  6. Orville says:

    It’s going to be finish of mine day, however before end I am reading this wonderful article to increase my knowledge.

  7. bradodarb says:

    Your work looks to be very promising. Unfortunately, I have not made time just yet to become familiar with F#. Do you have any pointers on working with the Stanford objects in c#?
    Maybe a quick snippet showing construction of the parser and getting some simple POS?

    Best,
    B.M.

    • Sergey Tihon says:

      You can port it really straightforward

      • Likhil.T says:

        Sergey Tihon,
        Will you please send me the copmplete code with packages.

  8. bradodarb says:

    Excellent, thanks!
    I was just about to decompile your demo .dll’s to see the c# methods generated.

    Its mainly this construct that threw me for a loop-
    let demoAPI (lp:LexicalizedParser)

    I really need to start working with F#, as your original code seems much more elegant than the c# version.

    Thank you very much for your time, now I can start experimenting with the parser in my c# project!

    -B.M.

  9. Pingback: FSharp.NLP.Stanford.Parser available on NuGet | Sergey Tihon's Blog

  10. Pingback: Stanford Parser is available on NuGet | Sergey Tihon's Blog

  11. Flo says:

    I’m also trying to reuse your work in a C# project, but I am having trouble to build your project : IKVM.Fsharp.dll can’t be built because of some errors in “Collections.fs”… In fact Visual Studio can’t interpret “open java.util” in this file and I assume this is normal as IKVM is supposed to be the library that actually define it, so I don’t get why this import is needed here.
    Maybe I missed some step of the process, do you have any idea of where it could come from?

    • Sergey Tihon says:

      Please try NuGet Package https://nuget.org/packages/Stanford.NLP.Parser/ and try this sample http://sergeytihon.wordpress.com/2013/07/11/stanford-parser-is-available-on-nuget/

      • Flo says:

        Well the NuGet package is working fine, in fact I was mainly interested by your NER implementation, the parser itself works fine if I import it in another project.

    • Sergey Tihon says:

      Please wait a bit, I will publish NER very soon. It is may happen even today.

      • Flo says:

        Oh, okay, thanks again for your great work :)

  12. Pingback: Stanford Named Entity Recognizer (NER) is available on NuGet | Sergey Tihon's Blog

  13. Pingback: let runFAKE = Download >> Unzip >> IKVMCompile >> Sign >> NuGet | Sergey Tihon's Blog

  14. Arnaoty says:

    I tried the same with Standford Segmenter (using C#). The main drawback is that the emitted files have no generics. For instance, I can’t write CRFClassifier, instead I only write CRFClassifier. Whenever I run my code I get RuntimeException, and I guess it is related to this problem.

    • Sergey Tihon says:

      it is strange, because it works for me. I did NuGet package by your request https://www.nuget.org/packages/Stanford.NLP.Segmenter/3.2.0.0 details how it works you can find in the post http://sergeytihon.wordpress.com/2013/09/09/stanford-word-segmenter-is-available-on-nuget/

      • Arnaoty says:

        Many thanks, it worked fine. I just had some mistakes with classifier flags. Thank you for your help.

  15. Hello, I appreciate your sharing this – but I don’t see how to get your example-code to work. I installed the Nuget packages (all of them – there’re several) but how do you actually get a workable F# file? What exactly do you do to import the Stanford NLP code? I tried “open FSharp.NLP.Stanford.Parse”
    What is that “lp.” in this line: let demoAPI (lp:LexicalizedParser) =
    And then finally, my code is not recognizing PTBTokenizer (nor a lot of other things in this example).

    Any pointers would be appreciated. How can I get your illustrative sample-code to run?

    • Sergey Tihon says:

      What do you mean by F# file? Are you trying to use it from *.fsx (F# script file) or compile *.fs.
      If you need to compile you code you can look at full code sample on GitHub – https://github.com/sergey-tihon/fsharp-stanford-nlp-samples/tree/master/fsharp-stanford-nlp-samples/StanfordParser.Samples . If you need to do it from fsx, you need to load required assemblies in FSI (#r “…”).

      • Thank you for responding Sergey. Sorry – I meant an .fs file that I want to compile into my Visual Studio solution (of which most is C#). Yes – I see the samples now. Which leads to the next question, if you don’t mind: How to get it to build? I downloaded it from that Github project as a zip file, unzipped it, and loaded the solution-file into VS 2013. I get 10 errors, possibly related to 204 warnings such as: “Could not located the assembly “IKVM.OpenJDK….
        I’m thinking there is probably an important setup step that I’m missing. I do see under “How to use it”, these instructions: “Download models from..” (but I’m not seeing how to inform the Visual Studio project how to know where those models get placed), and “Extract models from ‘stanford-parse-3.2.0-models.jar (just unzip it).” and again, no indication of how to inform how to locate those.

      • Sergey Tihon says:

        1)“Could not located the assembly IKVM.OpenJDK….” This mean that you should restore NuGet dependencies: Right click on the solution, click on the ‘Manage NuGet packages’, click on the `Restore`.
        2) Find lines of code where mentioned path to ‘englishPCFG.ser.gz’. This file actually packed into `stanford-parse-3.3.0-models.jar`. Update it to correct one (where you extracted it)

  16. Hi Sergey – thank you for trying to help. I don’t see a “Restore” option. Right-clicking on the solution is Vs2013, I see the option “Manage Nuget Packages for Solution…”. Within the resulting dialog, I checked out “Installed packages”, and “All”, and “Online”, and “Updates”. For “Installed packages”, I do see one pkg, named “IKVM.NET”, and it has a “Manage” button. Clicking on that – brings up a “Select Projects” dialog, with all of the several projects already selected. I do see that when I bring up the source-file Collections.fs, on the line with “open java.util” that “java” is underlined in red. As is the words “Iterator”, and ArrayList.

    And in the IKVM.FSharp project, looking in References – there’s a whole mess of references not found – all starting with “IKVM.” Looks like something is not in the right place?

    Sorry to be a whiner. I’m rather excited at the prospect of exploring this! Thanks for your advice,

    jh

    • Sergey Tihon says:

      It is very strange… Could you try to re-install IKVM.NET from NuGet (remove and install again)? It looks like the simplest way …

      • Done. Now I get 75 errors. Does “The namespace or module ‘edu’ is not defined” sound familiar? Which version of Visual Studio are you using?

      • Sergey Tihon says:

        VS2013 or VS2012. It is not important.
        ‘edu’ look like you do not reference Stanford.NLP.Parser NuGet package

  17. Evidently, the NuGet packaging is what is not working. I downloaded the IKVM.NET bit of software separately and uncompressed it, and that does have the DLL files that this solution complained about missing. So I added a reference to ikvm-7.2.4630.5/bin-x64/JVM.DLL, and now those references do show up within (for example) the StanfordNamedEntityRecognizerSamples project. It still gives an error when trying to run it, though, raising a FIleLoadException, because IKVM.Open.JDK.Core, 7.3.4830.0, does not match the manifest. Is that perhaps because the manifest calls for a different version? Has anyone ever gotten this to run? I have a fresh virtual machine with Vs 2010 to use, to try again from scratch. But a more explicit set of steps would probably help, so that I don’t waste another day trying every possible combination.

  18. Hello, what is the actual sequence of steps required to get your project working? All I need help with, I think, is just to get to the point of having one working sample. I believe I can take it from there.

    I used git clone https://github.com/sergey-tihon/fsharp-stanford-nlp-samples.git
    to get your repository onto a fresh virtual machine (vm), with Windows 7 x64, and Visual Studio 2010 Ultimate.

    I see that created a folder fsharp-stanford-nlp-samples, and there is a Visual Studio (VS) solution within that (which, by-the-way, I’m not able to open with VS 2010 – I had to shift over to another VM that I’d installed VS 2012 on). So I opened that solution file, tried to build.. 8 errors. You have to check “Allow NuGet to download missing packages during build.” from Tools/Options/Package Manager. Tried to build again: 7 Errors.

    Could not resolve this reference. Could not locate the assembly ‘IKVM.Open.IDK.SwingAWT, ..
    Ok – trying now to use NuGet to bring in dependencies. Opening “Manage NuGet Packages”, I do a search for “Stanford”, and see six different packages.

    Stanford.NLP.NER
    Stanford.NLP.Parser
    Stanford.NLP.POSTagger
    Stanford.NLP.CoreNLP
    FSharp.NLP.Stanford.Parser
    Stanford.NLP.Segmenter

    I wonder – which of these needs to be installed? What is the minimum needed, to start? I tried getting just Stanford.NLP.Parser. Building the solution now yields 107 Warnings, 10 Errors.

    So then I tried install all six of those packages, checking the checkboxes to ensure they were install for every project.

    Now a build of the solution yields: 107 Warnings, 9 Errors. The first warning is the same as shown above.

    I am thinking that, perhaps, it could be useful to have some steps explicitly laid out for people to use this. Unless (not unlikely) I am totally missing something obvious?

    Thank you for your help Sergey,
    James Hurst

  19. Sergey Tihon says:

    Hi, I think that I am partially reproduced this case. You should not reference all available packages. They may conflict with each other (the same types into the same namespaces). First of all decide which one you need and then reference it from NuGet (read more about packages on the Stanford NLP website http://nlp.stanford.edu/software/index.shtml).
    CoreNLP should be an umbrella project. Almost all available features should be insight.

  20. Waq says:

    Hello Sergey-

    Is there any link or a tutorial like this to getting started to incorporate this parser in Java?

    • Sergey Tihon says:

      Originally, it is a Java parser. Instructions are available on the original site – http://www-nlp.stanford.edu/software/lex-parser.shtml

      • Waq says:

        Thank you Sergey !

  21. Rohit says:

    Can you please tell me how exactly I should make ddl from .jar file. I am unable to do that..
    I have used your line of code
    ikvmc.exe stanford-parser.jar
    After downloading, I have two different folders one is ikvmbin-7.2.4630.5 and second is stanford-parser-2012-11-12

    Regards,
    Rohit

    • Sergey Tihon says:

      Hi, you should not do it by yourself. You can download recompiled version from NuGet https://www.nuget.org/packages/Stanford.NLP.Parser/ .
      Up-to-date samples are available here http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordParser.html

  22. Rohit says:

    Where can i find english.all.3class.distsim.crf.ser.gz file ??
    Its throwing the exception as ‘TypeInitializationException’…
    I am referencing code from here..

    http://www.stewh.com/2013/11/extracting-named-entities-in-c-using-the-stanford-nlp-parser/

    Thanks

  23. Rohit says:

    Hi Sergey,

    I found the link below where they have used some txt files for state,names etc..

    http://grepcode.com/file/repo1.maven.org/maven2/edu.stanford.nlp/stanford-corenlp/1.2.0/edu/stanford/nlp/models/dcoref/state-abbreviations.txt?av=f

    So, my question how to include these files and use it in C#.Net code

    Regards,
    Rohit

  24. Sergey Tihon says:

    All files are packed in zip archive (that is referenced from page http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordCoreNLP.html ) You need to download zip, unpack it and use files from inside. There are two options: Temporary change current directory or manually specify paths to all required files https://github.com/sergey-tihon/Stanford.NLP.NET/blob/master/tests/Stanford.NLP.CoreNLP.FSharp.Tests/CoreNLP.fs

  25. Rohit says:

    Hey Sergey,

    Thanks for the help. Your inputs are always helpful.

    But, Wanted to ask one thing, can I make resume parser using Stanford Library?
    If yes, from which point should i start. Because exact way I am not getting it.

    Have seen all of the examples/demos for this library.

    Can you please help me on that.

    Thanks,
    Rohit

    • Sergey Tihon says:

      Hi, It depends on multiple things:
      – What is you goal? What do you want to extract from your resume?
      – What is the format of your resume? Is it structured?

  26. Rohit says:

    Hey Sergey,

    1) What is you goal& What do you want to extract from you resume?
    — My aim is to extract candidate information from resume and store in the database. Want to extract almost every single information Like
    ( First name,last name,email id,mobile#,projects,experience,personal info, academic records, awards, achievements,skill,qualifications etc etc.).
    Currently I am concentrating Only on doc,docx, text files.
    So this information will be useful while searching a suitable candidate for a job.

    2) What is the format of you resume? Is it structured?
    It is Unstructured, every candidate will have different type of resume.

    • I guess you try with RegEx, can use Expresso initially to test your Regular Expression. I’m telling so coz resume usually have “Name” like stings before candidate writes his name like wise for other details.

  27. I’m using Stanford Dependency Parser to resole dependencies in one of my projects. I have following problem , I hope you will help me,
    when in a review text where I’m analyzing dependencies it works great when sentence is short, but for long sentences it does not give all required dependencies. For example, when I try to find out dependencies in following sentence ,
    “The Navigation is better.” there is dependency nsubj that groups “Navigation” and “better”, telling me the review regarding navigation is positive.

    But when review sentence is bigger like
    “Navigation system is better then the Jeeps and as good as my husbands Audi A-8 system.”

    I don’t get any dependency relations grouping Navigation with better and Navigation with good. I tried using both basic and collapsed dependencies. I went through Stanford Dependencies Manual , but couldn’t figure out much that will help here. I just want whatever the aspect user is talking about should be grouped with its adjective and adverb.

    • I’m trying with CCprocessed dependecy ….

    • well there is a update I tried using all dependency models available in stanford.nlp.net , viz. .typedDependenciesCCprocessed(true); .typedDependenciesCollapsed(true); typedDependencies(true); typedDependenciesCollapsedTree(); allTypedDependencies();

      • Sergey Tihon says:

        Hello, please ask this question on SO http://stackoverflow.com/questions/tagged/stanford-nlp

  28. Likhil.T says:

    I cant able to find out the file stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz.
    Please Help me

    • Likhil.T says:

      Its Very Urgent.

    • Sergey Tihon says:

      Models are inside the `*models.jar` in this zip: http://nlp.stanford.edu/software/stanford-parser-full-2014-06-16.zip

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 135 other followers

%d bloggers like this: