NLP: Stanford POS Tagger with F# (.NET)

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

All code samples from this post are available on GitHub.

Continuing the theme of porting Stanford NLP libraries to .NET, I am glad to introduce one more library – Stanford Log-linear Part-Of-Speech Tagger.

To compile stanford-postagger.jar to .NET assembly you need nothing special, just follow the steps from my previous post “NLP: Stanford Parser with F# (.NET)“. Also you can download already compiled version from GitHub.

What is Stanford POS Tagger?nlp-logo-navbar

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like ‘noun-plural’.

Read more about Part-of-speech tagging on Wikipedia.

Let’s play!

I was really surprised with performance of .NET version of Stanford POS Tagger.  It is fast enough! If you do not need advanced syntactic dependencies between the words and part-of-speech information is enough, then do not use Stanford Parser, Stanford POS Tagger is just what you need.

module TaggerDemo

open java.io
open java.util

open edu.stanford.nlp.ling
open edu.stanford.nlp.tagger.maxent;

open IKVM.FSharp
let model = @"..\..\..\..\StanfordNLPLibraries\stanford-postagger\models\wsj-0-18-left3words.tagger"

let tagReader (reader:Reader) =
    let tagger = MaxentTagger(model)
    MaxentTagger.tokenizeText(reader).iterator()
    |> Collections.toSeq
    |> Seq.iter (fun sentence ->
        let tSentence = tagger.tagSentence(sentence :?> List)
        printfn "%O" (Sentence.listToString(tSentence, false))
        )

let tagFile (fileName:string) =
    tagReader (new BufferedReader(new FileReader(fileName)))
let tagText (text:string) =
    tagReader (new StringReader(text))

As you see, it is really simple to use. We instantiate MaxentParser and initialize it with wsj-0-18-left3words.tagger model. After that we are loading text, tokenize it to sentences and tag sentences one by one.

Let’s test tagger on the F# Software Foundation Mission Statement =).

Mission Statement

The mission of the F# Software Foundation is to promote, protect, and advance the F# programming language, and to support and facilitate the growth of a diverse and international community of F# programmers.

Tagging result:

Mission/NNP Statement/NNP 
The/NNP mission/NN of/IN the/DT F/NN #/# Software/NNP Foundation/NNP is/VBZ 
to/TO promote/VB ,/, protect/VB ,/, and/CC advance/NN the/DT F/NN #/# 
programming/VBG language/NN ,/, and/CC to/TO support/VB and/CC facilitate/VB 
the/DT growth/NN of/IN a/DT diverse/JJ and/CC international/JJ community/NN 
of/IN F/NN #/# programmers/NNS ./.

Descriptions of POS tags you can find here.

NLP: Stanford Parser with F# (.NET)

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

All code samples from this post are available on GitHub.

Natural Language Processing is one more hot topic as Machine Learning. For sure, it is extremely important, but poorly developed.

What we have in .NET?

Lets start from what we already have.

Looks really bad. It is hard to find something that really useful. Actually we have one more option, which is IKVM.NET. With IKVM.NET we should be able to use most of Java-based NLP frameworks. Let’s try to import Stanford Parser to .NET.

IKVM.NET overview.

IKVM.NET is an implementation of Java for Mono and the Microsoft .NET Framework. It includes the following components:

  • A Java Virtual Machine implemented in .NET
  • A .NET implementation of the Java class libraries
  • Tools that enable Java and .NET interoperability

Read more about what you can do with IKVM.NET.

About Stanford NLP nlp-logo-navbar

The Stanford NLP Group makes parts of our Natural Language Processing software available to the public. These are statistical NLP toolkits for various major computational linguistics problems. They can be incorporated into applications with human language technology needs.

All the software we distribute is written in Java. All recent distributions require Sun/Oracle JDK 1.5+. Distribution packages include components for command-line invocation, jar files, a Java API, and source code.

IKVM .jar to .dll compilation

First of all, we need to download and install IKVM.NET. You can do it from SourceForge. The next step is to download Stanford Parser (current latest version is 2.0.4 from 2012-11-12). Now we need to compile stanford-parser.jar to .NET assembly. You can do it with the following command:

ikvmc.exe stanford-parser.jar

If you need a strongly typed one, then you should do two more steps.

ildasm.exe /all /out=stanford-parser.il stanford-parser.dll
ilasm.exe /dll /key=myKey.snk stanford-parser.il

No signed stanford-parser.dll is available on GitHub.

Let’s play!

That’s all! Now we are ready to start playing with Stanford Parser.  I want to show up here one of the standard examples(ParserDemo.fs), the second one is available on the GitHub with other sources.

let demoAPI (lp:LexicalizedParser) =
  // This option shows parsing a list of correctly tokenized words
  let sent = [|"This"; "is"; "an"; "easy"; "sentence"; "." |]
  let rawWords = Sentence.toCoreLabelList(sent)
  let parse = lp.apply(rawWords)
  parse.pennPrint()

  // This option shows loading and using an explicit tokenizer
  let sent2 = "This is another sentence.";
  let tokenizerFactory = PTBTokenizer.factory(CoreLabelTokenFactory(), "")
  use sent2Reader = new StringReader(sent2)
  let rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize()
  let parse = lp.apply(rawWords2)

  let tlp = PennTreebankLanguagePack()
  let gsf = tlp.grammaticalStructureFactory()
  let gs = gsf.newGrammaticalStructure(parse)
  let tdl = gs.typedDependenciesCCprocessed()
  printfn "\n%O\n" tdl

  let tp = new TreePrint("penn,typedDependenciesCollapsed")
  tp.printTree(parse)

let main fileName =
  let lp = LexicalizedParser.loadModel(@"..\..\..\..\StanfordNLPLibraries\stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz")
  match fileName with
  | Some(file) -> demoDP lp file
  | None -> demoAPI lp

What we are doing here? First of all, we instantiate LexicalizedParser and initialize it with englishPCFG.ser.gz model. Then we create two sentences. First is created from already tokenized string(from string array, in this sample). The second one is created from the string using PTBTokenizer. After that we create lexical parser that is trained on the Penn Treebank corpus. Finally, we are parsing our sentences using this parser. Result output can be found below.

[|"1"|]
Loading parser from serialized file ..\..\..\..\StanfordNLPLibraries\
stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz ... 
done [1.5 sec].
(ROOT
 (S
 (NP (DT This))
 (VP (VBZ is)
 (NP (DT an) (JJ easy) (NN sentence)))
 (. .)))

[nsubj(sentence-4, This-1), cop(sentence-4, is-2), det(sentence-4, another-3), 
root(ROOT-0, sentence-4)]
(ROOT
 (S
 (NP (DT This))
 (VP (VBZ is)
 (NP (DT another) (NN sentence)))
 (. .)))
nsubj(sentence-4, This-1)
cop(sentence-4, is-2)
det(sentence-4, another-3)
root(ROOT-0, sentence-4)

I want to mention one more time, that full source code is available at the fsharp-stanford-nlp-samples GitHub repository. Feel free to use and extend it.

FSharp.ML – industry needs. (Machine Learning for .NET)

Machine Learning is a hot topic for nowadays. ML is a core part of Data Analysis and an auxiliary tool in a lot of domains (NLP, search engines, e-commerce solutions and etc). Many ML related courses available on the Coursera  in “Statistics, Data Analysis, and Scientific Computing” and “Computer Science: Artificial Intelligence, Robotics, Vision” sections. Kaggle holds ML competitions more and more often.

Java has some popular and recognized ML libraries such as Mahout and Weka, but it is much harder to find .NET high performance ML library (which does not run on the IKVM.NET).

What is already available in .NET World?

As Don Syme said, it would be cool to have an independent comparison of already available ML libraries. We need to understand what is suitable for what needs.

Also I want to mention some most promising of them:

What can we do?

We are talking that F# is great for data scientists and statisticians and so it is! We still do not have mature F# ML library, but we have a lot of posts about ML and a lot of interest in this domain:

It is time to put it all together into FShapr.ML.  This can be done in two parts: a complete functional ML framework plus a collection of useful customizable samples.

F# Weekly #5, 2013

Welcome to F# Weekly,

A roundup of F# content from this past week:

News

Blogs

That’s all for now.  Have a great week.

Previous F# Weekly edition – #4

Why F#?

Didactic Code

If you’ve been following along with my posts over the past six months or so you can probably imagine that I’ve been asked some variation of this post’s title more than a few times. One question that I keep getting is why I chose F# over some other functional languages like Haskell, Erlang, or Scala. The problem with that question though is that it’s predicated on the assumption that I actually set out to learn a functional language. The truth is that moving to F# was more of a long but natural progression from C# rather than a conscious decision.

The story begins about two and a half years ago. I had pretty much burned out and was deep into what I can only begin to describe as a stagnation coma. I stopped attending user group meetings; I cut way back on reading; I pretty much stopped…

View original post 2,555 more words

F# Weekly #4, 2013

Welcome to F# Weekly,

It was a really nice and full of news week. A roundup of F# content from this past week:

News

Blogs

A bit more about new Try F#:

That’s all for now.  Have a great week.

Previous F# Weekly edition – #3

Explain rank in Sharepoint 2013 Search

Add your thoughts here… (optional)

Insights into search black magic

Default ranking model in Sharepoint 2013 is completely different from what we’ve seen in FS4SP and is definitely a step forward comparing to SP2010. In a nutshell it uses multiple features (to take into account query terms, it’s proximity; document click-through, authority, anchor text, url depth etc ) which are mixed with help of neural network as a final step. Details can be found in patent claim http://www.google.com/patents/US8296292.

Hopefully there’s a way to bring more light into this black magic. I’ve modified default Display Template and added “Explain Rank” link to each item. This link redirects user to ExplainRank page which hosted under {search_center_url}/_layouts/15/ folder.

1 - starwars

I used

  • d=ctx.CurrentItem.Path
  • q=ctx.DataProvider.$2_3.k  (have found this hidden property using trial & error method)
  • another option is to extract value for q= from QueryText from ctx.ListData.Properties.SerializedQuery which value was<Query Culture=”en-US” EnableStemming=”True” EnablePhonetic=”False” EnableNicknames=”False” IgnoreAllNoiseQuery=”True” SummaryLength=”180″ MaxSnippetLength=”180″ DesiredSnippetLength=”90″ KeywordInclusion=”0″ QueryText=”star wars” QueryTemplate=”{searchboxquery}” TrimDuplicates=”True”…

View original post 150 more words

F# Weekly #3, 2013

Welcome to F# Weekly,

A roundup of F# content from this past week:

News

Videos

Blogs

That’s all for now.  Have a great week.

Previous F# Weekly edition – #2

F# Weekly #2, 2013

Welcome to F# Weekly,

A roundup of F# content from this past week:

News

Blogs & Tutorials

Books

That’s all for now.  Have a great week.

Previous F# Weekly edition – #1