Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.
All code samples from this post are available on GitHub.
Natural Language Processing is one more hot topic as Machine Learning. For sure, it is extremely important, but poorly developed.
What we have in .NET?
Lets start from what we already have.
- Abodit NLP
- SharpNLP (looks dead)
- NLP for .NET (discontinued)
- NTLK from IronPython
- Antelope Framework (shareware)
Looks really bad. It is hard to find something that really useful. Actually we have one more option, which is IKVM.NET. With IKVM.NET we should be able to use most of Java-based NLP frameworks. Let’s try to import Stanford Parser to .NET.
IKVM.NET is an implementation of Java for Mono and the Microsoft .NET Framework. It includes the following components:
- A Java Virtual Machine implemented in .NET
- A .NET implementation of the Java class libraries
- Tools that enable Java and .NET interoperability
Read more about what you can do with IKVM.NET.
About Stanford NLP
The Stanford NLP Group makes parts of our Natural Language Processing software available to the public. These are statistical NLP toolkits for various major computational linguistics problems. They can be incorporated into applications with human language technology needs.
All the software we distribute is written in Java. All recent distributions require Sun/Oracle JDK 1.5+. Distribution packages include components for command-line invocation, jar files, a Java API, and source code.
IKVM .jar to .dll compilation
First of all, we need to download and install IKVM.NET. You can do it from SourceForge. The next step is to download Stanford Parser (current latest version is 2.0.4 from 2012-11-12). Now we need to compile stanford-parser.jar to .NET assembly. You can do it with the following command:
If you need a strongly typed one, then you should do two more steps.
ildasm.exe /all /out=stanford-parser.il stanford-parser.dll ilasm.exe /dll /key=myKey.snk stanford-parser.il
No signed stanford-parser.dll is available on GitHub.
That’s all! Now we are ready to start playing with Stanford Parser. I want to show up here one of the standard examples(ParserDemo.fs), the second one is available on the GitHub with other sources.
let demoAPI (lp:LexicalizedParser) = // This option shows parsing a list of correctly tokenized words let sent = [|"This"; "is"; "an"; "easy"; "sentence"; "." |] let rawWords = Sentence.toCoreLabelList(sent) let parse = lp.apply(rawWords) parse.pennPrint() // This option shows loading and using an explicit tokenizer let sent2 = "This is another sentence."; let tokenizerFactory = PTBTokenizer.factory(CoreLabelTokenFactory(), "") use sent2Reader = new StringReader(sent2) let rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize() let parse = lp.apply(rawWords2) let tlp = PennTreebankLanguagePack() let gsf = tlp.grammaticalStructureFactory() let gs = gsf.newGrammaticalStructure(parse) let tdl = gs.typedDependenciesCCprocessed() printfn "\n%O\n" tdl let tp = new TreePrint("penn,typedDependenciesCollapsed") tp.printTree(parse) let main fileName = let lp = LexicalizedParser.loadModel(@"..\..\..\..\StanfordNLPLibraries\stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz") match fileName with | Some(file) -> demoDP lp file | None -> demoAPI lp
What we are doing here? First of all, we instantiate LexicalizedParser and initialize it with englishPCFG.ser.gz model. Then we create two sentences. First is created from already tokenized string(from string array, in this sample). The second one is created from the string using PTBTokenizer. After that we create lexical parser that is trained on the Penn Treebank corpus. Finally, we are parsing our sentences using this parser. Result output can be found below.
[|"1"|] Loading parser from serialized file ..\..\..\..\StanfordNLPLibraries\ stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz ... done [1.5 sec]. (ROOT (S (NP (DT This)) (VP (VBZ is) (NP (DT an) (JJ easy) (NN sentence))) (. .))) [nsubj(sentence-4, This-1), cop(sentence-4, is-2), det(sentence-4, another-3), root(ROOT-0, sentence-4)] (ROOT (S (NP (DT This)) (VP (VBZ is) (NP (DT another) (NN sentence))) (. .))) nsubj(sentence-4, This-1) cop(sentence-4, is-2) det(sentence-4, another-3) root(ROOT-0, sentence-4)
I want to mention one more time, that full source code is available at the fsharp-stanford-nlp-samples GitHub repository. Feel free to use and extend it.
63 thoughts on “NLP: Stanford Parser with F# (.NET)”
It’s going to be finish of mine day, however before end I am reading this wonderful article to increase my knowledge.
Your work looks to be very promising. Unfortunately, I have not made time just yet to become familiar with F#. Do you have any pointers on working with the Stanford objects in c#?
Maybe a quick snippet showing construction of the parser and getting some simple POS?
You can port it really straightforward
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hosted with ❤ by GitHub
Will you please send me the copmplete code with packages.
I was just about to decompile your demo .dll’s to see the c# methods generated.
Its mainly this construct that threw me for a loop-
let demoAPI (lp:LexicalizedParser)
I really need to start working with F#, as your original code seems much more elegant than the c# version.
Thank you very much for your time, now I can start experimenting with the parser in my c# project!
I’m also trying to reuse your work in a C# project, but I am having trouble to build your project : IKVM.Fsharp.dll can’t be built because of some errors in “Collections.fs”… In fact Visual Studio can’t interpret “open java.util” in this file and I assume this is normal as IKVM is supposed to be the library that actually define it, so I don’t get why this import is needed here.
Maybe I missed some step of the process, do you have any idea of where it could come from?
Please try NuGet Package https://nuget.org/packages/Stanford.NLP.Parser/ and try this sample https://sergeytihon.wordpress.com/2013/07/11/stanford-parser-is-available-on-nuget/
Well the NuGet package is working fine, in fact I was mainly interested by your NER implementation, the parser itself works fine if I import it in another project.
Please wait a bit, I will publish NER very soon. It is may happen even today.
Oh, okay, thanks again for your great work 🙂
I tried the same with Standford Segmenter (using C#). The main drawback is that the emitted files have no generics. For instance, I can’t write CRFClassifier, instead I only write CRFClassifier. Whenever I run my code I get RuntimeException, and I guess it is related to this problem.
it is strange, because it works for me. I did NuGet package by your request https://www.nuget.org/packages/Stanford.NLP.Segmenter/220.127.116.11 details how it works you can find in the post https://sergeytihon.wordpress.com/2013/09/09/stanford-word-segmenter-is-available-on-nuget/
Many thanks, it worked fine. I just had some mistakes with classifier flags. Thank you for your help.
Hello, I appreciate your sharing this – but I don’t see how to get your example-code to work. I installed the Nuget packages (all of them – there’re several) but how do you actually get a workable F# file? What exactly do you do to import the Stanford NLP code? I tried “open FSharp.NLP.Stanford.Parse”
What is that “lp.” in this line: let demoAPI (lp:LexicalizedParser) =
And then finally, my code is not recognizing PTBTokenizer (nor a lot of other things in this example).
Any pointers would be appreciated. How can I get your illustrative sample-code to run?
What do you mean by F# file? Are you trying to use it from *.fsx (F# script file) or compile *.fs.
If you need to compile you code you can look at full code sample on GitHub – https://github.com/sergey-tihon/fsharp-stanford-nlp-samples/tree/master/fsharp-stanford-nlp-samples/StanfordParser.Samples . If you need to do it from fsx, you need to load required assemblies in FSI (#r “…”).
Thank you for responding Sergey. Sorry – I meant an .fs file that I want to compile into my Visual Studio solution (of which most is C#). Yes – I see the samples now. Which leads to the next question, if you don’t mind: How to get it to build? I downloaded it from that Github project as a zip file, unzipped it, and loaded the solution-file into VS 2013. I get 10 errors, possibly related to 204 warnings such as: “Could not located the assembly “IKVM.OpenJDK….
I’m thinking there is probably an important setup step that I’m missing. I do see under “How to use it”, these instructions: “Download models from..” (but I’m not seeing how to inform the Visual Studio project how to know where those models get placed), and “Extract models from ‘stanford-parse-3.2.0-models.jar (just unzip it).” and again, no indication of how to inform how to locate those.
1)“Could not located the assembly IKVM.OpenJDK….” This mean that you should restore NuGet dependencies: Right click on the solution, click on the ‘Manage NuGet packages’, click on the `Restore`.
2) Find lines of code where mentioned path to ‘englishPCFG.ser.gz’. This file actually packed into `stanford-parse-3.3.0-models.jar`. Update it to correct one (where you extracted it)
Hi Sergey – thank you for trying to help. I don’t see a “Restore” option. Right-clicking on the solution is Vs2013, I see the option “Manage Nuget Packages for Solution…”. Within the resulting dialog, I checked out “Installed packages”, and “All”, and “Online”, and “Updates”. For “Installed packages”, I do see one pkg, named “IKVM.NET”, and it has a “Manage” button. Clicking on that – brings up a “Select Projects” dialog, with all of the several projects already selected. I do see that when I bring up the source-file Collections.fs, on the line with “open java.util” that “java” is underlined in red. As is the words “Iterator”, and ArrayList.
And in the IKVM.FSharp project, looking in References – there’s a whole mess of references not found – all starting with “IKVM.” Looks like something is not in the right place?
Sorry to be a whiner. I’m rather excited at the prospect of exploring this! Thanks for your advice,
It is very strange… Could you try to re-install IKVM.NET from NuGet (remove and install again)? It looks like the simplest way …
Done. Now I get 75 errors. Does “The namespace or module ‘edu’ is not defined” sound familiar? Which version of Visual Studio are you using?
VS2013 or VS2012. It is not important.
‘edu’ look like you do not reference Stanford.NLP.Parser NuGet package
Evidently, the NuGet packaging is what is not working. I downloaded the IKVM.NET bit of software separately and uncompressed it, and that does have the DLL files that this solution complained about missing. So I added a reference to ikvm-7.2.4630.5/bin-x64/JVM.DLL, and now those references do show up within (for example) the StanfordNamedEntityRecognizerSamples project. It still gives an error when trying to run it, though, raising a FIleLoadException, because IKVM.Open.JDK.Core, 7.3.4830.0, does not match the manifest. Is that perhaps because the manifest calls for a different version? Has anyone ever gotten this to run? I have a fresh virtual machine with Vs 2010 to use, to try again from scratch. But a more explicit set of steps would probably help, so that I don’t waste another day trying every possible combination.
Hello, what is the actual sequence of steps required to get your project working? All I need help with, I think, is just to get to the point of having one working sample. I believe I can take it from there.
I used git clone https://github.com/sergey-tihon/fsharp-stanford-nlp-samples.git
to get your repository onto a fresh virtual machine (vm), with Windows 7 x64, and Visual Studio 2010 Ultimate.
I see that created a folder fsharp-stanford-nlp-samples, and there is a Visual Studio (VS) solution within that (which, by-the-way, I’m not able to open with VS 2010 – I had to shift over to another VM that I’d installed VS 2012 on). So I opened that solution file, tried to build.. 8 errors. You have to check “Allow NuGet to download missing packages during build.” from Tools/Options/Package Manager. Tried to build again: 7 Errors.
Could not resolve this reference. Could not locate the assembly ‘IKVM.Open.IDK.SwingAWT, ..
Ok – trying now to use NuGet to bring in dependencies. Opening “Manage NuGet Packages”, I do a search for “Stanford”, and see six different packages.
I wonder – which of these needs to be installed? What is the minimum needed, to start? I tried getting just Stanford.NLP.Parser. Building the solution now yields 107 Warnings, 10 Errors.
So then I tried install all six of those packages, checking the checkboxes to ensure they were install for every project.
Now a build of the solution yields: 107 Warnings, 9 Errors. The first warning is the same as shown above.
I am thinking that, perhaps, it could be useful to have some steps explicitly laid out for people to use this. Unless (not unlikely) I am totally missing something obvious?
Thank you for your help Sergey,
Hi, I think that I am partially reproduced this case. You should not reference all available packages. They may conflict with each other (the same types into the same namespaces). First of all decide which one you need and then reference it from NuGet (read more about packages on the Stanford NLP website http://nlp.stanford.edu/software/index.shtml).
CoreNLP should be an umbrella project. Almost all available features should be insight.
Is there any link or a tutorial like this to getting started to incorporate this parser in Java?
Originally, it is a Java parser. Instructions are available on the original site – http://www-nlp.stanford.edu/software/lex-parser.shtml
Thank you Sergey !
Can you please tell me how exactly I should make ddl from .jar file. I am unable to do that..
I have used your line of code
After downloading, I have two different folders one is ikvmbin-7.2.4630.5 and second is stanford-parser-2012-11-12
Hi, you should not do it by yourself. You can download recompiled version from NuGet https://www.nuget.org/packages/Stanford.NLP.Parser/ .
Up-to-date samples are available here http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordParser.html
Where can i find english.all.3class.distsim.crf.ser.gz file ??
Its throwing the exception as ‘TypeInitializationException’…
I am referencing code from here..
Here it is http://www-nlp.stanford.edu/software/stanford-parser-full-2013-11-12.zip
Big blue button on http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordParser.html page.
I found the link below where they have used some txt files for state,names etc..
So, my question how to include these files and use it in C#.Net code
All files are packed in zip archive (that is referenced from page http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordCoreNLP.html ) You need to download zip, unpack it and use files from inside. There are two options: Temporary change current directory or manually specify paths to all required files https://github.com/sergey-tihon/Stanford.NLP.NET/blob/master/tests/Stanford.NLP.CoreNLP.FSharp.Tests/CoreNLP.fs
Thanks for the help. Your inputs are always helpful.
But, Wanted to ask one thing, can I make resume parser using Stanford Library?
If yes, from which point should i start. Because exact way I am not getting it.
Have seen all of the examples/demos for this library.
Can you please help me on that.
Hi, It depends on multiple things:
– What is you goal? What do you want to extract from your resume?
– What is the format of your resume? Is it structured?
1) What is you goal& What do you want to extract from you resume?
— My aim is to extract candidate information from resume and store in the database. Want to extract almost every single information Like
( First name,last name,email id,mobile#,projects,experience,personal info, academic records, awards, achievements,skill,qualifications etc etc.).
Currently I am concentrating Only on doc,docx, text files.
So this information will be useful while searching a suitable candidate for a job.
2) What is the format of you resume? Is it structured?
It is Unstructured, every candidate will have different type of resume.
I guess you try with RegEx, can use Expresso initially to test your Regular Expression. I’m telling so coz resume usually have “Name” like stings before candidate writes his name like wise for other details.
I’m using Stanford Dependency Parser to resole dependencies in one of my projects. I have following problem , I hope you will help me,
when in a review text where I’m analyzing dependencies it works great when sentence is short, but for long sentences it does not give all required dependencies. For example, when I try to find out dependencies in following sentence ,
“The Navigation is better.” there is dependency nsubj that groups “Navigation” and “better”, telling me the review regarding navigation is positive.
But when review sentence is bigger like
“Navigation system is better then the Jeeps and as good as my husbands Audi A-8 system.”
I don’t get any dependency relations grouping Navigation with better and Navigation with good. I tried using both basic and collapsed dependencies. I went through Stanford Dependencies Manual , but couldn’t figure out much that will help here. I just want whatever the aspect user is talking about should be grouped with its adjective and adverb.
I’m trying with CCprocessed dependecy ….
well there is a update I tried using all dependency models available in stanford.nlp.net , viz. .typedDependenciesCCprocessed(true); .typedDependenciesCollapsed(true); typedDependencies(true); typedDependenciesCollapsedTree(); allTypedDependencies();
Hello, please ask this question on SO http://stackoverflow.com/questions/tagged/stanford-nlp
I cant able to find out the file stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz.
Please Help me
Its Very Urgent.
Models are inside the `*models.jar` in this zip: http://nlp.stanford.edu/software/stanford-parser-full-2014-06-16.zip
I have followed your instructions up to “IKVM .jar to .dll compilation”. I think I was successful up to there. I then created a F# project in Visual Studio 2013. I put your code into it. VS does not have a definition of LexicalizedParser. I understand C++ and C# but I do not understand F#. I assume we must add a reference but I do not know what to reference. Is there more to the F# program that does the equivalent of a “using” in C#? Am I correct that the F# program needs a little bit more such as that?
I also used the tangiblesoftwaresolutions.com converter to convert the stanford-parser ParserDemo.java sample to C# but obviously that also needs a reference.
I apologize for not being able to figure this out, but if you can help me to understand what to reference then I will appreciate it.
I have seen your samples in your “Stanford Parser is available on NuGet for F# and C#” but the C# sample source also does not show what to reference and such. If that question is easily answered when I install what you have from that article then I should do that. Are the answers there?
Okay I installed Stanford.NLP.Parser using the Package Manager Console. In the C# program (converted from the stanford-parser ParserDemo.java sample to C#) I managed to get:
And that seems to work except there is one error that is outside the scope of here. So I tried using the following for your F# sample here:
However VS says that “process” is reserved.
Hi, could you please have a look at C# sample here http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordParser.html. Does it help for somehow?
I copied the C# code you have here, but the line “var gs = gsf.newGrammaticalStructure(tree);” is causing an error:
“A first chance exception of type ‘edu.stanford.nlp.trees.tregex.TregexParser.LookaheadSuccess’ occurred in stanford-corenlp-3.5.2.dll”
Hi, try this one https://github.com/sergey-tihon/Stanford.NLP.NET/issues/19#issuecomment-109420786
Unfortunately, I’m still getting the same error. It’s actually a stack overflow error, but the output window is printing “A first chance exception of type ‘edu.stanford.nlp.trees.tregex.TregexParser.LookaheadSuccess'” until the overflow occurs
Actually, this sample https://github.com/sergey-tihon/Stanford.NLP.NET/blob/master/samples/Stanford.NLP.Parser.CSharp/Program.cs works on my machine. What NuGet package did you referenced?
My code is almost identical to that example but does not work. I believe I have the most recent package from NuGet (3.5.2). I downloaded by typing “Install-Package Stanford.NLP.Parser” in the PM console as instructed.
Sorry, no. It should work with latest nuget and latest model from Stanford site.