NLP: Stanford POS Tagger with F# (.NET)
08/02/2013 4 Comments
Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.
All code samples from this post are available on GitHub.
Continuing the theme of porting Stanford NLP libraries to .NET, I am glad to introduce one more library - Stanford Log-linear Part-Of-Speech Tagger.
To compile stanford-postagger.jar to .NET assembly you need nothing special, just follow the steps from my previous post “NLP: Stanford Parser with F# (.NET)“. Also you can download already compiled version from GitHub.
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like ‘noun-plural’.
Read more about Part-of-speech tagging on Wikipedia.
I was really surprised with performance of .NET version of Stanford POS Tagger. It is fast enough! If you do not need advanced syntactic dependencies between the words and part-of-speech information is enough, then do not use Stanford Parser, Stanford POS Tagger is just what you need.
module TaggerDemo open java.io open java.util open edu.stanford.nlp.ling open edu.stanford.nlp.tagger.maxent; open IKVM.FSharp let model = @"..\..\..\..\StanfordNLPLibraries\stanford-postagger\models\wsj-0-18-left3words.tagger" let tagReader (reader:Reader) = let tagger = MaxentTagger(model) MaxentTagger.tokenizeText(reader).iterator() |> Collections.toSeq |> Seq.iter (fun sentence -> let tSentence = tagger.tagSentence(sentence :?> List) printfn "%O" (Sentence.listToString(tSentence, false)) ) let tagFile (fileName:string) = tagReader (new BufferedReader(new FileReader(fileName))) let tagText (text:string) = tagReader (new StringReader(text))
As you see, it is really simple to use. We instantiate MaxentParser and initialize it with wsj-0-18-left3words.tagger model. After that we are loading text, tokenize it to sentences and tag sentences one by one.
Let’s test tagger on the F# Software Foundation Mission Statement =).
The mission of the F# Software Foundation is to promote, protect, and advance the F# programming language, and to support and facilitate the growth of a diverse and international community of F# programmers.
Mission/NNP Statement/NNP The/NNP mission/NN of/IN the/DT F/NN #/# Software/NNP Foundation/NNP is/VBZ to/TO promote/VB ,/, protect/VB ,/, and/CC advance/NN the/DT F/NN #/# programming/VBG language/NN ,/, and/CC to/TO support/VB and/CC facilitate/VB the/DT growth/NN of/IN a/DT diverse/JJ and/CC international/JJ community/NN of/IN F/NN #/# programmers/NNS ./.
Descriptions of POS tags you can find here.