Some weeks ago, I announced FSharp.NLP.Stanford.Parser and now I want to clarify the goals of this project and show an example of usage.
First of all, this is not an attempt to re-implement some functionality of Stanford Parser. It is just a tiny dust layer that aimed to simplify interaction with Java collections (especially Iterable interface) and bring the power of F# constructs (like pattern matching and discrimination unions) to the code that deals with tagging results.
Task
Let’s start with some sample NLP task: We want to show related questions before user asks a new one (as it works on StackOverflow). There are many possible solutions for this task. Let’s look at one that at the first step tries to understand key phrases that identify this question and runs the search using them.
Approach
First of all, let’s choose some real questions from StackOverflow to analyze them:
- How to make an F# project work with the object browser
- How can I build WebSharper on Mono 3.0 on Mac?
- Adding extra methods as type extensions in F#
- How to get MonoDevelop to compile F# projects?
Now we can use Stanford Parser GUI to visualize the structure of these questions:




We can notice that all phrases that we have selected are parts of noun phrases(NP). As a first solution we can try to analyze tags in the tree and select NP that contains word level tags like (NN,NNS,NNP,NNPS).
Solution
#r @"..\packages\IKVM.7.3.4830.0\lib\IKVM.Runtime.dll"
#r @"..\packages\IKVM.7.3.4830.0\lib\IKVM.OpenJDK.Core.dll"
#r @"..\packages\Stanford.NLP.Parser.3.2.0.0\lib\ejml-0.19-nogui.dll"
#r @"..\packages\Stanford.NLP.Parser.3.2.0.0\lib\stanford-parser.dll"
open edu.stanford.nlp.parser.lexparser
open edu.stanford.nlp.trees
open System
let model = @"d:\englishPCFG.ser.gz";
let options = [|"-maxLength"; "500";"-retainTmpSubcategories"; "-MAX_ITEMS"; "500000";"-outputFormat"; "penn,typedDependenciesCollapsed"|]
let lp = LexicalizedParser.loadModel(model, options)
let tlp = PennTreebankLanguagePack();
let gsf = tlp.grammaticalStructureFactory();
open java.util
let toSeq (iter:Iterator) =
let rec loop (x:Iterator) =
seq {
yield x.next()
if x.hasNext() then
yield! (loop x)
}
loop iter
let getTree question =
let toke = tlp.getTokenizerFactory().getTokenizer(new java.io.StringReader(question));
let sentence = toke.tokenize();
lp.apply(sentence)
let getKeyPhrases (tree:Tree) =
let isNPwithNNx (node:Tree)=
if (node.label().value() <> "NP") then false
else node.getChildrenAsList().iterator()
|> toSeq
|> Seq.cast<Tree>
|> Seq.exists (fun x->
let y = x.label().value()
y= "NN" || y = "NNS" || y = "NNP" || y = "NNPS")
let rec foldTree acc (node:Tree) =
let acc =
if (node.isLeaf()) then acc
else node.getChildrenAsList().iterator()
|> toSeq
|> Seq.cast<Tree>
|> Seq.fold
(fun state x -> foldTree state x)
acc
if isNPwithNNx node
then node :: acc
else acc
foldTree [] tree
let questions =
[|"How to make an F# project work with the object browser";
"How can I build WebSharper on Mono 3.0 on Mac?";
"Adding extra methods as type extensions in F#";
"How to get MonoDevelop to compile F# projects?"|]
questions
|> Seq.iter (fun question ->
printfn "Question : %s" question
question
|> getTree
|> getKeyPhrases
|> List.rev
|> List.iter (fun p ->
p.getLeaves().iterator()
|> toSeq
|> Seq.cast<Tree>
|> Seq.map(fun x-> x.label().value())
|> Seq.toArray
|> printfn "\t%A")
)
If you run this script, you will see the following:
Question : How to make an F# project work with the object browser
[|”an”; “F”; “#”; “project”; “work”|]
[|”the”; “object”; “browser”|]
Question : How can I build WebSharper on Mono 3.0 on Mac?
[|”WebSharper”|]
[|”Mono”; “3.0”|]
[|”Mac”|]
Question : Adding extra methods as type extensions in F#
[|”extra”; “methods”|]
[|”type”; “extensions”|]
[|”F”; “#”|]
Question : How to get MonoDevelop to compile F# projects?
[|”MonoDevelop”|]
[|”F”; “#”; “projects”|]
It is almost what we have expected. Results are good enough, but we can simplify the code and make it more readable using FSharp.NLP.Stanford.Parser.
#r @"..\packages\IKVM.7.3.4830.0\lib\IKVM.Runtime.dll" #r @"..\packages\IKVM.7.3.4830.0\lib\IKVM.OpenJDK.Core.dll" #r @"..\packages\Stanford.NLP.Parser.3.2.0.0\lib\ejml-0.19-nogui.dll" #r @"..\packages\Stanford.NLP.Parser.3.2.0.0\lib\stanford-parser.dll" #r @"..\packages\FSharp.NLP.Stanford.Parser.0.0.3\lib\FSharp.NLP.Stanford.Parser.dll" open edu.stanford.nlp.parser.lexparser open edu.stanford.nlp.trees open System open FSharp.IKVM.Util open FSharp.NLP.Stanford.Parser let model = @"d:\englishPCFG.ser.gz"; let options = [|"-maxLength"; "500";"-retainTmpSubcategories"; "-MAX_ITEMS"; "500000";"-outputFormat"; "penn,typedDependenciesCollapsed"|] let lp = LexicalizedParser.loadModel(model, options) let tlp = PennTreebankLanguagePack(); let gsf = tlp.grammaticalStructureFactory(); let getTree question = let toke = tlp.getTokenizerFactory().getTokenizer(new java.io.StringReader(question)); let sentence = toke.tokenize(); lp.apply(sentence) let getKeyPhrases (tree:Tree) = let isNNx = function | Label NN | Label NNS | Label NNP | Label NNPS -> true | _ -> false let isNPwithNNx = function | Label NP as node when node.getChildrenAsList() |> Iterable.castToSeq<Tree> |> Seq.exists isNNx -> true | _ -> false let rec foldTree acc (node:Tree) = let acc = if (node.isLeaf()) then acc else node.getChildrenAsList() |> Iterable.castToSeq<Tree> |> Seq.fold (fun state x -> foldTree state x) acc if isNPwithNNx node then node :: acc else acc foldTree [] tree let questions = [|"How to make an F# project work with the object browser"; "How can I build WebSharper on Mono 3.0 on Mac?"; "Adding extra methods as type extensions in F#"; "How to get MonoDevelop to compile F# projects?"|] questions |> Seq.iter (fun question -> printfn "Question : %s" question question |> getTree |> getKeyPhrases |> List.rev |> List.iter (fun p -> p.getLeaves() |> Iterable.castToArray<Tree> |> Array.map(fun x-> x.label().value()) |> printfn "\t%A") )
Look more carefully at getKeyPhrases function. All tags are strongly typed now. You can be sure that you will never make a typo, code is more readable and self explained:











There is one more tool that has become ready on NuGet today. It is a 
