Stanford Parser is available on NuGet for F# and C#

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

nlp-logo-navbarI have already wrote small series of posts about porting of Stanford NLP Products to .NET using IKVM.NET. The first was about Stanford Parser “NLP: Stanford Parser with F# (.NET)“. It shows how to recompile and use parser from F#. Recently I wrote one more post “FSharp.NLP.Stanford.Parser available on NuGet” that announced already recompiled version of Stanford Parser included into NuGet package with some helpers functionality for F# devs.

As I see, it is still not so simple as it should be. I’ve seen sometimes questions from C# guys about different NLP tasks with answers pointing to my “The Stanford Natural Language Processing Samples, in F#” repository (like this). Probably, it is no so easy to find the latest version of IKVM.NET Compiler (it is not included into IKVM.NET NuGet package) and manage to quickly rebuild Stanford Parser from the scratch for the first time.

I have decided to create a NuGet package for clear porting of Stanford Parser to .NET with strongly signed assemblies and without dependencies to F#. My primary goal has been to find a clear, simple and intuitive way to try NLP magic from .NET for all NLP lovers. Now, it is simpler then ever:

F# Sample

F# sample is not much different from one mentioned in “NLP: Stanford Parser with F# (.NET)” post. For more details see source code on GitHub.

let demoDP (lp:LexicalizedParser) (fileName:string) =
    // This option shows loading and sentence-segment and tokenizing
    // a file using DocumentPreprocessor
    let tlp = PennTreebankLanguagePack();
    let gsf = tlp.grammaticalStructureFactory();
    // You could also create a tokenizer here (as below) and pass it
    // to DocumentPreprocessor
    DocumentPreprocessor(fileName)
    |> Iterable.toSeq
    |> Seq.cast<List>
    |> Seq.iter (fun sentence ->
        let parse = lp.apply(sentence);
        parse.pennPrint();

        let gs = gsf.newGrammaticalStructure(parse);
        let tdl = gs.typedDependenciesCCprocessed(true);
        printfn "\n%O\n" tdl
    )

let demoAPI (lp:LexicalizedParser) =
    // This option shows parsing a list of correctly tokenized words
    let sent = [|"This"; "is"; "an"; "easy"; "sentence"; "." |]
    let rawWords = Sentence.toCoreLabelList(sent)
    let parse = lp.apply(rawWords)
    parse.pennPrint()

    // This option shows loading and using an explicit tokenizer
    let sent2 = "This is another sentence."
    let tokenizerFactory = PTBTokenizer.factory(CoreLabelTokenFactory(), "")
    use sent2Reader = new StringReader(sent2)
    let rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize()
    let parse = lp.apply(rawWords2)

    let tlp = PennTreebankLanguagePack()
    let gsf = tlp.grammaticalStructureFactory()
    let gs = gsf.newGrammaticalStructure(parse)
    let tdl = gs.typedDependenciesCCprocessed()
    printfn "\n%O\n" tdl

    let tp = new TreePrint("penn,typedDependenciesCollapsed")
    tp.printTree(parse)

let main fileName =
    let lp = LexicalizedParser.loadModel(@"...\englishPCFG.ser.gz")
    match fileName with
    | Some(file) -> demoDP lp file
    | None -> demoAPI lp

C# Sample

C# version is quite similar. For more details see source code on GitHub.

public static class ParserDemo
{
    public static void DemoDP(LexicalizedParser lp, string fileName)
    {
        // This option shows loading and sentence-segment and tokenizing
        // a file using DocumentPreprocessor
        var tlp = new PennTreebankLanguagePack();
        var gsf = tlp.grammaticalStructureFactory();
        // You could also create a tokenizer here (as below) and pass it
        // to DocumentPreprocessor
        foreach (List sentence in new DocumentPreprocessor(fileName))
        {
            var parse = lp.apply(sentence);
            parse.pennPrint();

            var gs = gsf.newGrammaticalStructure(parse);
            var tdl = gs.typedDependenciesCCprocessed(true);
            System.Console.WriteLine("\n{0}\n", tdl);
        }
    }

    public static void DemoAPI(LexicalizedParser lp)
    {
        // This option shows parsing a list of correctly tokenized words
        var sent = new[] { "This", "is", "an", "easy", "sentence", "." };
        var rawWords = Sentence.toCoreLabelList(sent);
        var parse = lp.apply(rawWords);
        parse.pennPrint();

        // This option shows loading and using an explicit tokenizer
        const string Sent2 = "This is another sentence.";
        var tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
        var sent2Reader = new StringReader(Sent2);
        var rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize();
        parse = lp.apply(rawWords2);

        var tlp = new PennTreebankLanguagePack();
        var gsf = tlp.grammaticalStructureFactory();
        var gs = gsf.newGrammaticalStructure(parse);
        var tdl = gs.typedDependenciesCCprocessed();
        System.Console.WriteLine("\n{0}\n", tdl);

        var tp = new TreePrint("penn,typedDependenciesCollapsed");
        tp.printTree(parse);
    }

    public static void Start(string fileName)
    {
         var lp =LexicalizedParser.loadModel(Program.ParserModel);
         if (!String.IsNullOrEmpty(fileName))
              DemoDP(lp, fileName);
         else
              DemoAPI(lp);
    }
}

As a result of both samples you will see the following output:

Loading parser from serialized file ..\..\..\..\StanfordNLPLibraries\
stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz ... 
done [1.5 sec].
(ROOT
 (S
 (NP (DT This))
 (VP (VBZ is)
 (NP (DT an) (JJ easy) (NN sentence)))
 (. .)))

[nsubj(sentence-4, This-1), cop(sentence-4, is-2), det(sentence-4, another-3), 
root(ROOT-0, sentence-4)]
(ROOT
 (S
 (NP (DT This))
 (VP (VBZ is)
 (NP (DT another) (NN sentence)))
 (. .)))
nsubj(sentence-4, This-1)
cop(sentence-4, is-2)
det(sentence-4, another-3)
root(ROOT-0, sentence-4)

55 thoughts on “Stanford Parser is available on NuGet for F# and C#

  1. First, this is very cool.

    I have followed the steps above and having a problem.

    My code is in C#. I took your code above and found that the following two lines have compile time problems:

    var sent2Reader = new StringReader(Sent2);
    var rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize();

    The getTokenizer function can’t take in a .NET System.IO.StringReader. It wants a Java.IO.Reader.

    I decided to comment this out and use the default parser which works great.

    You might want to update your sample…

    Best,

    Peter

  2. First of all: thank you so much,

    Second, I am trying to run the C# parser Demo. However, When I run it needs an arg which I think a file name. I could not figure out what is the file needed since your example is done with the sentence ” “This”, “is”, “an”, “easy”, “sentence”, “.” ” . Can you tell what is the args needed for program.c for the Parser Demo ?

    1. Ok, so I was able to run the demo using the following command :

      StanfordParser.Csharp.Samples 1 englishPCFG.ser.gz

      and I needed to copy englishPCFG.ser.gz to the exe location. However, the result was like an infinite parsing tree, Here is a part of it :

      [number(1r-2, Q-1), num(~-27, 1r-2), amod(~-27, sq-3), amod(~-27, ~-4), amod(~-2
      7, -LSB–5), amod(~-27, su-6), nn(~-11, \-8), nn(~-11, u-9), nn(~-11, blsq-10),
      prep_s(su-6, ~-11), num(sq-21, 2-12), number(2-14, 2-13), num(sq-21, 2-14), amod
      (sq-21, 1r-15), nn(sq-21, sq-16), nn(sq-21, ~-17), num(sq-21, 2-18), number(2-20
      , 2-19), num(sq-21, 2-20), dep(~-11, sq-21), number(%-23, ~-22), dep(sq-21, %-23
      ), cc(%-23, &-24), nn(~-27, %-25), nn(~-27, E?sq-26), nsubj(sq-32, ~-27), partmo
      d(~-27, sq-28), amod(l-30, ~-29), dobj(sq-28, l-30), nsubj(sq-32, l-31), root(RO
      OT-0, sq-32), nn(Asq-43, ~-33), num(Asq-43, 2-34), num(Asq-43, 2-35), num(Asq-43
      , 2-36), num(Asq-43, sq-37), num(Asq-43, ~-38), num(Asq-43, 0v-39), num(Asq-43,
      0-40), num(Asq-43, 0F-41), nn(Asq-43, ?-42), dobj(sq-32, Asq-43), partmod(Asq-43
      , ~-44), dobj(~-44, ?-45)]

      (ROOT
      (S
      (NP (JJ 1r) (NN sq))
      (VP (SYM ~)
      (NP ($ $) (CD -LRB-)))
      (. !)))

      [amod(sq-2, 1r-1), nsubj($-4, sq-2), dep($-4, ~-3), root(ROOT-0, $-4)]

      (ROOT
      (S
      (NP
      (NP (NNP sq ~ ♫’?? ‘? ‘? ♣???▬?sq ~ ♫2??? ? 2? 2?? sq ~ ♫[s@ s
      \ @??~sq ~ ♫/??) (-LRB- -LRB-) (NNP /))
      (NP (NNP /) (-LRB- -LRB-) (NNP sq)))
      (VP (VBZ ~)
      (NP
      (NP ($ $) (CD Hq))
      (: 🙂
      (NP
      (NP ($ $) (CD p))
      (NP ($ $) (CD l)))
      (: :)))
      (. !)))

      [nn(/-3, sq ~ ♫’?? ‘? ‘? ♣???▬?sq ~ ♫2??? ? 2? 2?? sq ~ ♫[s@ s \ @?
      ?~sq ~ ♫/??-1), nsubj(~-7, /-3), nn(sq-6, /-4), dep(/-3, sq-6), root(ROOT-0, ~-7
      ), dobj(~-7, $-8), num($-8, Hq-9), dep($-8, $-11), num($-11, p-12), dep($-11, $-
      13), num($-13, l-14)]

      (ROOT
      (FRAG
      (NP
      (NP (NNP \))
      (NP (NNP sq) (NNP ~)
      (PRN (: /)
      (NP (NNP O))
      (: /))
      (NNP O)))
      (: /)
      (SINV
      (ADVP (RB sq))
      (VP (VBD ~)
      (NP
      (NP (CD ,0)
      (ADJP
      (QP (CD 4) (CD 7)))
      (JJ sq) (JJ ~) (JJ 1r) (NN sq) (NNS ~))
      (X (SYM *)))
      (: 🙂
      (S
      (NP (DT A)
      (S
      (S
      (X
      (X (SYM *))
      (NP (CD 8)))
      (X (SYM *))
      (NP (DT A) (NN sq) (NN ~))
      (VP (VBP sq)
      (NP
      (NP (NNP ~) (POS ‘))
      (NP (NNP C) (NNP d) (POS ‘))
      (” ‘) (NNS sq))
      (S
      (VP (VBG ~)
      (NP
      (NP
      (NP
      (NP
      (NP (NN h) (NN h))
      (NP
      (NP (NNP J) (NNP Jsq) (NNP ~) (POS ‘))
      (NNP C) (NNP d) (” ‘)))
      (POS ‘))
      (NNP ?) (NNP Tsq) (NNP ~))
      (X (SYM *)))))))
      (: 🙂
      (S
      (NP (PRP I))
      (VP (VBG *)
      (NP (CD 8))
      (X (SYM *))))))
      (NP (PRP I))))
      (NP (JJ sq ~ ♫0↔?d /? 02 d?ds:sq ~ ♫▲z?? ▲? ▲d ? sq ~ ♫!`?a ?
      !` !a?@??sq ~ ♫☼9▼p ☼▲ ☼6 ☺p???Qsq ~ ♫☼r?? ☼? ☼} ??1r↑sq ~ ♫♦▼?← ? ♦
      ▼ ♥←?-P?sq ~ ♫/??? /? /? ??☻↕sq ~ ♫ ☺☻ ☺? ☺? ☺??←?sq ~ ♫↓?u? ☻l ↓? ↓
      ???Tsq ~ ♫1L~? 0~ 1) (NNP |) (NNP sq) (NNP ~) (NNP _) (NNP sq) (NNP ~) (NNP -R
      SB-) (NNP -RSB-) (NNP sq) (NNP ~) (NNP ?) (NNP GC) (NNP sq) (NNP ~) (NNP X) (NNP
      X)))
      (. ?)))

      [root(ROOT-0, \-1), nn(O-7, sq-2), nn(O-7, ~-3), punct(O-5, /-4), dep(O-7, O-5),
      punct(O-5, /-6), dep(\-1, O-7), punct(\-1, /-8), advmod(~-10, sq-9), dep(\-1, ~
      -10), num(~-18, ,0-11), number(7-13, 4-12), num(~-18, 7-13), amod(~-18, sq-14),
      amod(~-18, ~-15), amod(~-18, 1r-16), nn(~-18, sq-17), dobj(~-10, ~-18), dep(~-18
      , *-19), nsubj(I-56, A-21), dep(8-23, *-22), dep(sq-28, 8-23), dep(sq-28, *-24),
      det(~-27, A-25), nn(~-27, sq-26), nsubj(sq-28, ~-27), dep(A-21, sq-28), poss(sq
      -35, ~-29), nn(d-32, C-31), poss(sq-35, d-32), dobj(sq-28, sq-35), iobj(sq-28, s
      q-35), xcomp(sq-28, ~-36), nn(h-38, h-37), poss(~-49, h-38), nn(~-41, J-39), nn(
      ~-41, Jsq-40), poss(d-44, ~-41), nn(d-44, C-43), dep(h-38, d-44), nn(~-49, ?-47)
      , nn(~-49, Tsq-48), dobj(~-36, ~-49), dep(~-49, *-50), nsubj(*-53, I-52), parata
      xis(sq-28, *-53), dobj(*-53, 8-54), dep(*-53, *-55), parataxis(~-10, I-56), xcom
      p(~-10, I-56), amod(X-73, sq ~ ♫0↔?d /? 02 d?ds:sq ~ ♫▲z?? ▲? ▲d ? sq
      ~ ♫!`?a ? !` !a?@??sq ~ ♫☼9▼p ☼▲ ☼6 ☺p???Qsq ~ ♫☼r?? ☼? ☼} ??1r↑sq ~
      ♫♦▼?← ? ♦▼ ♥←?-P?sq ~ ♫/??? /? /? ??☻↕sq ~ ♫ ☺☻ ☺? ☺? ☺??←?sq ~ ♫↓?u
      ? ☻l ↓? ↓???Tsq ~ ♫1L~? 0~ 1-57), nn(X-73, |-58), nn(X-73, sq-59), nn(X-73,
      ~-60), nn(X-73, _-61), nn(X-73, sq-62), nn(X-73, ~-63), nn(X-73, -RSB–64), nn(
      X-73, -RSB–65), nn(X-73, sq-66), nn(X-73, ~-67), nn(X-73, ?-68), nn(X-73, GC-69
      ), nn(X-73, sq-70), nn(X-73, ~-71), nn(X-73, X-72), nsubj(~-10, X-73)]

      1. thanks,

        I did fix it a long time ago. I don’t remember what was the problem but I remember that it was very small thing.This project helped me a lot in my ongoing research.

        Now the only problem that I have is that it take a very long time to pars comparing to the java version. I am running it in a window application not in a console version. But that should not cause any additional overhead should it ?

      2. Yes, it is slower then Java version. I think that it is question to IKVM.NET. I saw a slowdown up to 2x times vs Java version. Sometime you can optimize you program to make it faster (split text into sentences for example), but it still will be slower than the same code executed on JVM.

  3. i am trying parser demo code, the problem i am suffering is, there is error in the line 3, on ParserModel. how i can handle this..

    1 public static void Start(string fileName)
    2 {
    3 var lp =LexicalizedParser.loadModel(Program.ParserModel);
    4 if (!String.IsNullOrEmpty(fileName))
    5 DemoDP(lp, fileName);
    6 else
    DemoAPI(lp);
    }

      1. Pleae what change i need to do to parse Arabic sentence

        Also please i have error in this var rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize();

  4. I want to take output trees and dependencies in a textbox/text file instead of console window , after studying code I found to print trees I need to edit parse.pennPrint(); tp.printTree(parse); which took me to edu.stanford.nlp.trees namespace. Where can I find further code ???

  5. I’m using Stanford Dependency Parser to resole dependencies in one of my projects.
    when in a review text where I’m analyzing dependencies it works great when sentence is short, but for long sentences it does not give all required dependencies. For example, when I try to find out dependencies in following sentence ,
    “The Navigation is better.” there is dependency nsubj that groups “Navigation” and “better”, telling me the review regarding navigation is positive.

    But when review sentence is bigger like
    “Navigation system is better then the Jeeps and as good as my husbands Audi A-8 system.”

    I don’t get any dependency relations grouping Navigation with better and Navigation with good. I tried using all dependencies available in stanford.nlp.net. I went through Stanford Dependencies Manual , but couldn’t figure out much that will help here. I just want whatever the aspect user is talking about should be grouped with its adjective and adverb.

    1. i used .typedDependenciesCCprocessed(true); .typedDependenciesCollapsed(true); typedDependencies(true); typedDependenciesCollapsedTree(); allTypedDependencies();

  6. I am facing the same problem for which you have suggested to use model path on local machine.
    “””” Change Program.ParserModel to the correct path to model file on your machine.”””
    can you please share the path of model file (englishPCFG.ser.gz) …

    1. I have downloaded the model file but now i am getting following exception
      Source: stanford-corenlp-3.3.1
      Message = “englishPCFG.ser.gz: expecting BEGIN block;

      at edu.stanford.nlp.parser.lexparser.LexicalizedParser.confirmBeginBlock(String A_0, String A_1)
      at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromTextFile(String textFileOrUrl, Options op)
      at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromFile(String parserFileOrUrl, Options op)
      at edu.stanford.nlp.parser.lexparser.LexicalizedParser.loadModel(String parserFileOrUrl, Options op, String[] extraFlags)
      at edu.stanford.nlp.parser.lexparser.LexicalizedParser.loadModel(String parserFileOrUrl, String[] extraFlags)
      at ConsoleApplication1.Program.Main(String[] args) in G:\ThesisRND\ConsoleApplication1\ConsoleApplication1\Program.cs:line 57
      at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
      at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
      at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
      at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
      at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx)
      at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
      at System.Threading.ThreadHelper.ThreadStart()

      kindly help me to fix this issue…
      Thanking you in advance ….

  7. I would like to ask if it supports arabic language or not, if not: can you recommend one plz

  8. another question if you don’t mind,

    do you make any comparisons between Stanford parser and any other parser, to decided
    which is related to our needs?

  9. I’m using Stanford.NLP.NET installed as IKVM nugget in my current C# project. From which I’m extracting PoS tags from dependency tree. But for some reasons I want to aggregate various types of noun, adjective, verb and adverb tags labels.

    For example,

    “n” label for all noun types

    NN Noun, singular or mass

    NNS Noun, plural

    NNP Proper noun, singular

    NNPS Proper noun, plural

    “a” label for all adjective types

    JJ Adjective

    JJR Adjective, comparative

    JJS Adjective, superlative

    “r” label for all adverb types

    RB Adverb

    RBR Adverb, comparative

    RBS Adverb, superlative

    “v” label for all verb types

    VBD Verb, past tense

    VBG Verb, gerund or present participle

    VBN Verb, past participle

    VBP Verb, non-3rd person singular present

    VBZ Verb, 3rd person singular present

    Where and what change should I make?

      1. ok sir. I have other problem. there is

        “foreach (List sentence in new DocumentPreprocessor(clfile)) ”
        in demodp function of stanford.nlp.sharp,

        I want to remove certain elements of List sentence, for that I’m using

        sentence.remove(“-LSB-, ASPECT, -RSB-,”);
        but its not working, what kind of list is this “List sentence”

      2. yes it is but don’t know why sentence.remove(“something”) is not working. What is datatype of elements of list hat is returned by documentpreprocessor?

    1. Sorry, I do not understand your question. Sure, all stanford nlp java classes were recompiled to .net, at least you received them from parser.

  10. sir I’m trying to get a sub-tree starting with certain specific word, I have written following code,

    TregexPattern tgrepPattern = TregexPattern.compile(“steering”);
    TregexMatcher m1 = tgrepPattern.matcher(parse);
    while (m1.find())
    {
    Tree subtree = m1.getMatch();

    }

    where I’m trying to get only sub-tree of word “steering”, who’s original tree is as follow,
    (ROOT [179.075]
    (S [178.923]
    (S [28.434]
    (NP [12.947] (NN handling))
    (VP [14.932] (VBZ is)
    (ADJP [10.053] (JJ incredible))))
    (CC and)
    (S [144.858]
    (NP [22.872] (NN **steering**) (NN response))
    (VP [121.432] (VBZ is)
    (ADJP [116.113] (JJ nice)
    (SBAR [105.697]
    (S [105.297]
    (S [70.940]
    (NP [15.377] (NNP ))
    (VP [55.008] (MD Can)
    (VP [50.440] (VB connect)
    (NP [14.432] (NN iPod))
    (PP [23.852] (IN into)
    (NP [19.388] (JJ stereo) (NN system))))))
    (CC and)
    (S [29.339]
    (NP [13.820] (NN stereo))
    (VP [14.964] (VBZ is)
    (ADJP [10.085] (JJ awesome)))))))))
    (. .)))

    but when I debug , subtree only shows one word “steering” and same single word is generated as tree. What I’m missing??

  11. Hello, I’m trying to get the code below to work, and it generates a “edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model” error when instantiating (new StanfordCoreNLP(props)).

    public static string TestMe()
    {
    string text = “Kosgi Santosh sent an email to Stanford University. He didn’t get a reply.”;

    Properties props = new Properties();
    props.setProperty(“annotators”, “tokenize, ssplit, pos, lemma, ner, parse, dcoref”);
    props.setProperty(“sutime.binders”, “0”);

    StanfordCoreNLP standfordCoreNLP = new StanfordCoreNLP(props); //Need to add pointer to model files.

    //annotate
    Annotation annotation = new Annotation(text);
    standfordCoreNLP.annotate(annotation);

    //output result
    return standfordCoreNLP.toString();
    }

    I unzipped the stanford-parser-3.2.0-models.jar file to the project folder. What might I have missed? Thanks.

  12. I want to convert following foreach to Parallel Foreach, its form your code. Will it be possible

    foreach (List sentence in new DocumentPreprocessor(fileName))
    {
    //some processing
    }

      1. Ok I did it, had to convert java list to c# lists array for parallel foreach. Its now taking about 40 mins for 10 MB data against 70 min earlier. I think loading and separation of documents into sentences by DocumentPreprcessor is taking much time. Would be great of that can be reduced somehow.

      2. As told by you I did performance analysis and it was not document preprocessor. PLease find it in image below, can you suggest some wayout to improve performance.

        [IMG]http://i59.tinypic.com/wkoh0y.jpg[/IMG]

      1. i mean algorithms like association rule , naive bayes, if it is implemented or not??

Leave a comment