FSharp.NLP.Stanford.Parser justification or StackOverflow questions understanding.

Some weeks ago, I announced FSharp.NLP.Stanford.Parser and now I want to clarify the goals of this project and show an example of usage.

First of all, this is not an attempt to re-implement some functionality of Stanford Parser. It is just a tiny dust layer that aimed to simplify interaction with Java collections (especially Iterable interface) and bring the power of F# constructs (like pattern matching and discrimination unions) to the code that deals with tagging results.

Task

Let’s start with some sample NLP task: We want to show related questions before user asks a new one (as it works on StackOverflow). There are many possible solutions for this task. Let’s look at one that at the first step tries to understand key phrases that identify this question and runs the search using them.

Approach

First of all, let’s choose some real questions from StackOverflow to analyze them:

Now we can use Stanford Parser GUI to visualize the structure of these questions:

As you can see this question is about “F# project” and “object browser”

This question about "WebSharper", "Mono 3.0" and "Mac" — This question is about “WebSharper”, “Mono 3.0” and “Mac”

This one about "extra methods", "type providers" and "F#" — This one is about “extra methods”, “type providers” and “F#”

The last one about "MonoDevelop" and "F# projects". — The last one is about “MonoDevelop” and “F# projects”.

We can notice that all phrases that we have selected are parts of noun phrases(NP). As a first solution we can try to analyze tags in the tree and select NP that contains word level tags like (NN,NNS,NNP,NNPS).

Solution

#r @"..\packages\IKVM.7.3.4830.0\lib\IKVM.Runtime.dll"
#r @"..\packages\IKVM.7.3.4830.0\lib\IKVM.OpenJDK.Core.dll"
#r @"..\packages\Stanford.NLP.Parser.3.2.0.0\lib\ejml-0.19-nogui.dll"
#r @"..\packages\Stanford.NLP.Parser.3.2.0.0\lib\stanford-parser.dll"

open edu.stanford.nlp.parser.lexparser
open edu.stanford.nlp.trees
open System

let model = @"d:\englishPCFG.ser.gz";

let options = [|"-maxLength"; "500";"-retainTmpSubcategories"; "-MAX_ITEMS"; "500000";"-outputFormat"; "penn,typedDependenciesCollapsed"|]
let lp = LexicalizedParser.loadModel(model, options)

let tlp = PennTreebankLanguagePack();
let gsf = tlp.grammaticalStructureFactory();

open java.util
let toSeq (iter:Iterator) =
    let rec loop (x:Iterator) = 
        seq { 
            yield x.next()
            if x.hasNext() then 
                yield! (loop x)
            }
    loop iter

let getTree question = 
    let toke = tlp.getTokenizerFactory().getTokenizer(new java.io.StringReader(question));
    let sentence = toke.tokenize();
    lp.apply(sentence)

let getKeyPhrases (tree:Tree) = 
    let isNPwithNNx (node:Tree)= 
        if (node.label().value() <> "NP") then false
        else node.getChildrenAsList().iterator()
             |> toSeq 
             |> Seq.cast<Tree>
             |> Seq.exists (fun x-> 
                let y = x.label().value()
                y= "NN" || y = "NNS" || y = "NNP" || y = "NNPS")
    let rec foldTree acc (node:Tree) = 
        let acc = 
            if (node.isLeaf()) then acc
            else node.getChildrenAsList().iterator()
                 |> toSeq 
                 |> Seq.cast<Tree>
                 |> Seq.fold 
                    (fun state x -> foldTree state x)
                    acc
        if isNPwithNNx node 
          then node :: acc
          else acc
    foldTree [] tree

let questions = 
    [|"How to make an F# project work with the object browser";
      "How can I build WebSharper on Mono 3.0 on Mac?";
      "Adding extra methods as type extensions in F#";
      "How to get MonoDevelop to compile F# projects?"|]

questions
|> Seq.iter (fun question ->
    printfn "Question : %s" question
    question 
    |> getTree 
    |> getKeyPhrases
    |> List.rev
    |> List.iter (fun p ->
        p.getLeaves().iterator() 
        |> toSeq 
        |> Seq.cast<Tree> 
        |> Seq.map(fun x-> x.label().value()) 
        |> Seq.toArray
        |> printfn "\t%A")
)

If you run this script, you will see the following:

Question : How to make an F# project work with the object browser
[|”an”; “F”; “#”; “project”; “work”|]
[|”the”; “object”; “browser”|]
Question : How can I build WebSharper on Mono 3.0 on Mac?
[|”WebSharper”|]
[|”Mono”; “3.0”|]
[|”Mac”|]
Question : Adding extra methods as type extensions in F#
[|”extra”; “methods”|]
[|”type”; “extensions”|]
[|”F”; “#”|]
Question : How to get MonoDevelop to compile F# projects?
[|”MonoDevelop”|]
[|”F”; “#”; “projects”|]

It is almost what we have expected. Results are good enough, but we can simplify the code and make it more readable using FSharp.NLP.Stanford.Parser.

#r @"..\packages\IKVM.7.3.4830.0\lib\IKVM.Runtime.dll"
#r @"..\packages\IKVM.7.3.4830.0\lib\IKVM.OpenJDK.Core.dll"
#r @"..\packages\Stanford.NLP.Parser.3.2.0.0\lib\ejml-0.19-nogui.dll"
#r @"..\packages\Stanford.NLP.Parser.3.2.0.0\lib\stanford-parser.dll"
#r @"..\packages\FSharp.NLP.Stanford.Parser.0.0.3\lib\FSharp.NLP.Stanford.Parser.dll"

open edu.stanford.nlp.parser.lexparser
open edu.stanford.nlp.trees
open System
open FSharp.IKVM.Util
open FSharp.NLP.Stanford.Parser

let model = @"d:\englishPCFG.ser.gz";

let options = [|"-maxLength"; "500";"-retainTmpSubcategories"; "-MAX_ITEMS"; "500000";"-outputFormat"; "penn,typedDependenciesCollapsed"|]
let lp = LexicalizedParser.loadModel(model, options)

let tlp = PennTreebankLanguagePack();
let gsf = tlp.grammaticalStructureFactory();

let getTree question = 
    let toke = tlp.getTokenizerFactory().getTokenizer(new java.io.StringReader(question));
    let sentence = toke.tokenize();
    lp.apply(sentence)

let getKeyPhrases (tree:Tree) = 
    let isNNx = function
        | Label NN | Label NNS | Label NNP | Label NNPS -> true
        | _ -> false
    let isNPwithNNx = function
        | Label NP as node 
            when node.getChildrenAsList() |> Iterable.castToSeq<Tree> |> Seq.exists isNNx
            -> true
        | _ -> false
    let rec foldTree acc (node:Tree) = 
        let acc = 
            if (node.isLeaf()) then acc
            else node.getChildrenAsList()
                 |> Iterable.castToSeq<Tree>
                 |> Seq.fold 
                    (fun state x -> foldTree state x)
                    acc
        if isNPwithNNx node 
          then node :: acc
          else acc
    foldTree [] tree

let questions = 
    [|"How to make an F# project work with the object browser";
      "How can I build WebSharper on Mono 3.0 on Mac?";
      "Adding extra methods as type extensions in F#";
      "How to get MonoDevelop to compile F# projects?"|]

questions
|> Seq.iter (fun question ->
    printfn "Question : %s" question
    question 
    |> getTree 
    |> getKeyPhrases
    |> List.rev
    |> List.iter (fun p ->
        p.getLeaves()
        |> Iterable.castToArray<Tree>
        |> Array.map(fun x-> x.label().value()) 
        |> printfn "\t%A")
)

Look more carefully at getKeyPhrases function. All tags are strongly typed now. You can be sure that you will never make a typo, code is more readable and self explained:

STTags

8 thoughts on “FSharp.NLP.Stanford.Parser justification or StackOverflow questions understanding.”

Pingback: F# Weekly #29 2013 | Sergey Tihon's Blog
Stu says:

25/07/2013 at 07:58

Hi Sergey, Amazing work getting NLP as a Nuget service it’s so easy to use now. Can you help me get “function tags” working? eg I get (NP (NN yesterday)) for “yesterday” but I have seen some people get (NP-TMP (NN yesterday)) showing it’s temporal function. I am using this c# code

static LexicalizedParser lp = LexicalizedParser.loadModel(“c:\\englishPCFG.ser.gz”);
public static string Parse(string sent)
{
CoreLabelTokenFactory cltf = new CoreLabelTokenFactory();
TokenizerFactory tokenizerFactory = PTBTokenizer.factory(cltf, “”);
StringReader sent2Reader = new StringReader(sent);
List rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize();
Tree parse = lp.apply(rawWords2);
string output = parse.pennString();
return output;
}

Thanks!

Sergey Tihon says:

25/07/2013 at 09:04

Hi, It is temporal NPs feature of Stanford Parser. You need to call LexicalizedParser.loadModel with “-retainTmpSubcategories” option (as it does in my samples).
More about this is here http://nlp.stanford.edu/software/parser-faq.shtml#s

Ellen says:

03/09/2014 at 01:22

Hi Mr. Tihon,

I’m interested in the phrase chunking extension to Stanford parser in this article. Unfortunately, I’ve never programmed in F#, and I still have problem understanding lambda expression after going through some basic tutorial. Do you have the solution in C# by any chance?

Thanks,
Ellen

1. Sergey Tihon says:
  
  03/09/2014 at 12:21
  
  Hello, all C# samples available here – http://sergey-tihon.github.io/Stanford.NLP.NET/
  
  1. Ellen says:
    
    04/09/2014 at 00:57
    
    I’m looking specifically into the translation of the following lambda expression in C#:
    – toSeq (iter:iterator)
    – getKeyPhrases (tree: Tree)
    
    I did try to go through C# samples posted in GitHub to find the matching methods, but I failed to find them. Could you kindly link me to the exact location if the source is publicly available?
  2. Sergey Tihon says:
    
    04/09/2014 at 14:47
    
    Sorry, but I do not have C# equivalents for these methods.
    toSeq – converts Java iterator to .NET IEnumerable. It should be easy to rewrite it in C#.
    But it will be a bit harder to rewrite getKeyPhrases – it is so short and simple due to power of F#.
dasdasd says:

16/10/2015 at 15:18

Hello mates, its wonderful article concerning cultureand entirely explained, keep
it up all the time.

FSharp.NLP.Stanford.Parser justification or StackOverflow questions understanding.

Task

Approach

Solution

Published by Sergey Tihon 🦔🦀

8 thoughts on “FSharp.NLP.Stanford.Parser justification or StackOverflow questions understanding.”

Leave a comment Cancel reply

Task

Approach

Solution

Share this:

Published by Sergey Tihon 🦔🦀

8 thoughts on “FSharp.NLP.Stanford.Parser justification or StackOverflow questions understanding.”

Leave a comment Cancel reply

Discover more from Sergey Tihon's Blog