NLP: Stanford Named Entity Recognizer with F# (.NET)

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

All code samples from this post are available on GitHub.

Samples for one more Stanford NLP library were ported to .NET. It is Stanford Named Entity Recognizer (NER).

To compile stanford-ner.jar to .NET assembly you need to follow the steps from my post “NLP: Stanford Parser with F# (.NET)“. Also you can download already compiled version from GitHub.

What is Stanford Named Entity Recognizer (NER)?nlp-logo-navbar

Stanford NER (also known as CRFClassifier) is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. The software provides a general (arbitrary order) implementation of linear chain Conditional Random Field (CRF) sequence models, coupled with well-engineered feature extractors for Named Entity Recognition. (CRF models were pioneered by Lafferty, McCallum, and Pereira (2001); see Sutton and McCallum (2006) for a better introduction.) Included with the download are good 3 class (PERSON, ORGANIZATION, LOCATION) named entity recognizers for English (in versions with and without additional distributional similarity features) and another pair of models trained on the CoNLL 2003 English training data. The distributional similarity features improve performance but the models require considerably more memory.

Read more about Named-entity recognition on Wikipedia.

Let’s play!

So, again, code is pretty straightforward and easy to read and understand. It looks procedural with some extra noise of type casting because of Java runtime nature.

open edu.stanford.nlp.ling

open java.util
open System.IO
open IKVM.FSharp

let main file =
    let classifier =
    match file with
    | Some(fileName) ->
        let fileContents = File.ReadAllText(fileName)
        |> Collections.toSeq
        |> Seq.cast<java.util.List>
        |> Seq.iter (fun sentence ->
            |> Collections.toSeq
            |> Seq.cast<CoreLabel>
            |> Seq.iter (fun word ->
                printf "%s/%O "
            printfn ""
    | None ->
        let s1 = "Good afternoon Rajat Raina, how are you today?"
        let s2 = "I go to school at Stanford University, which is located in California."
        printfn "%s\n" (classifier.classifyToString(s1))
        printfn "%s\n" (classifier.classifyWithInlineXML(s2))
        printfn "%s\n" (classifier.classifyToString(s2, "xml", true));
        |> Collections.toSeq
        |> Seq.iteri (fun i coreLabel ->
            printfn "%d\n:%O\n" i coreLabel

Let’s test NER on the text from Don Syme wiki page =).

Don Syme is an Australian computer scientist and a Principal Researcher at Microsoft Research, Cambridge, U.K. He is the designer and architect of the F# programming language, described by a reporter as being regarded as “the most original new face in computer languages since Bjarne Stroustrup developed C++ in the early 1980s.

Earlier, Syme created generics in the .NET Common Language Runtime, including the initial design of generics for the C# programming language, along with others including Andrew Kennedy and later Anders Hejlsberg. Kennedy, Syme and Yu also formalized this widely used system.

He holds a Ph.D. from the University of Cambridge, and is a member of the WG2.8 working group on functional programming. He is a co-author of the book Expert F# 2.0.

In the past he also worked on formal specification, interactive proof, automated verification and proof description languages.

Named-entity recognition result:

Don/PERSON Syme/PERSON is/O an/O Australian/O computer/O scientist/O and/O a/O Principal/O Researcher/O at/O Microsoft/ORGANIZATION Research/ORGANIZATION ,/O Cambridge/LOCATION ,/O U.K./LOCATION ./O He/O is/O the/O designer/O and/O architect/O of/O the/O F/O #/O programming/O language/O ,/O described/O by/O a/O reporter/O as/O being/O regarded/O as/O “/O the/O most/O original/O new/O face/O in/O computer/O languages/O since/O Bjarne/PERSON Stroustrup/PERSON developed/O C/O +/O +/O in/O the/O early/O 1980s/O ./O

Earlier/O ,/O Syme/PERSON created/O generics/O in/O the/O ./O NET/O Common/O Language/O Runtime/O ,/O including/O the/O initial/O design/O of/O generics/O for/O the/O C/O #/O programming/O language/O ,/O along/O with/O others/O including/O Andrew/PERSON Kennedy/PERSON and/O later/O Anders/PERSON Hejlsberg/PERSON ./O Kennedy/PERSON ,/O Syme/PERSON and/O Yu/PERSON also/O formalized/O this/O widely/O used/O system/O ./O

He/O holds/O a/O Ph.D./O from/O the/O University/ORGANIZATION of/ORGANIZATION Cambridge/ORGANIZATION ,/O and/O is/O a/O member/O of/O the/O WG2/O .8/O working/O group/O on/O functional/O programming/O ./O He/O is/O a/O co-author/O of/O the/O book/O Expert/O F/O #/O 2.0/O ./O

In/O the/O past/O he/O also/O worked/O on/O formal/O specification/O ,/O interactive/O proof/O ,/O automated/O verification/O and/O proof/O description/O languages/O ./O

4 thoughts on “NLP: Stanford Named Entity Recognizer with F# (.NET)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s