NuGet dependency visualizer with F# and Graphviz

Script for this article is available as public Gist.

For a long time I was interested in what is going on on NuGet. I think that NuGet UI does not provide one important piece of information – which packages depend on current packages. This information is very useful for package authors and also can help a user to find packages that provide more sophisticated implementation.

Other interesting thing is to see the big picture and answer some global questions:

  • What is inside the technology I use? What is the dependency of packages that I use? What are dependencies of dependencies and so on.
  • What is built on top of packages that I maintain? What beautiful applications of my ideas other can find? What is the cost of releasing a broken package? Who can be harmed?

I believe that there are a lot of other answers we can find from the high view.

Some time ago I have found that NuGet team provides NuGet.Core package that has all  API required to communicate with NuGet. The API is not really fast if you are going to download information about all versions of all NuGet packages 😉 But NuGet team is working on a new v3 API that is going to be much faster than current v2. For current research I have downloaded info about all packages and all their versions to my FSI session to be able to run different types of analysis and create visualizations without further communication with NuGet. This operation is slow enough: it took about 1 hour last time when I run it, but it really depends on NuGet workload and your internet connection.

The second thing is visualization of the result, here I want to say thank you to Scott Wlaschin for his great script type-dependency-graph.fsx (that was built for ‘Cycles and modularity in the wild‘ analysis). I took his GraphViz module and slightly modified it to allow colorful graphs printing. To use it you need to download GraphViz from the official site.

That is all tooling that we need. So we are ready to describe a structure of an analysis – script extracts 3 subsets of NuGet packages and visualizes them in different colors with all dependencies.

  1. Packages that are in scope of analysis (green on the graphs)
  2. Packages that we depend on (grey on the graph) – Package is in the set 2 if exists a dependency path from any package from Set 1 to this package.
  3. Packages that depend on us (blue on the graph) – Package is in the set 3 if exists a dependency path from this package to any package from Set 1.

NOTE: In this analysis I ignored package version and considered only latest package version and dependencies of latest version. If you need more accurate analysis you should adjust script a bit.

Running this script I did some interesting observations:

FsPickler is already highly used by other packages!
FsPickler is already highly used by other packages!
FSharp.Compiler.Service is already deeply incorporated in tooling!
FSharp.Compiler.Service is already deeply incorporated in tooling!
Too many different Fsharp.Core packages of NuGet adopted by different tools.
Too many different Fsharp.Core packages on NuGet adopted by different tools.
FSharpx is still alive! ;)
FSharpx is still alive! 😉
Roslyn.dot
Roslyn ohhhh Roslyn

 

F# Ecosystem is HUGE! (see full svg version here, surry bug without FunScipt ;))
F# Ecosystem is HUGE! (see full svg version here, sorry but without FunScipt ;))

I hope that you’ll also find this script useful and discover something interesting on NuGet.

 

F# Kung Fu #4: Avoid using relative paths in #r directives

Today, Vladimir Makarov faced with quite interesting ‘bug'(very unexpected behavior) of FSI. The initial goal was quite simple – count number of NuGet packages, which have “ASP.NET” in title. As a result, there was created a script that perfectly works in compiled form and crashes in FSI, here it is:

#r "../packages/nuget.core.2.8.2/lib/net40-Client/Nuget.Core.dll"
#r "System.Xml.Linq.dll"

let repository =
    NuGet.PackageRepositoryFactory.Default.CreateRepository
        "https://nuget.org/api/v2"

let aspnet = query {
    for p in repository.GetPackages() do
    where (p.Title.Contains "ASP.NET")
    count
}

When we run this in FSI we got an exception

System.ArgumentException: Incorrect instance type
Parameter name: obj
at Microsoft.FSharp.Quotations.PatternsModule.mkInstanceMethodCall(FSharpExpr obj, MethodInfo minfo, FSharpList`1 args)
at Microsoft.FSharp.Quotations.ExprShapeModule.RebuildShapeCombination(Object shape, FSharpList`1 arguments)
at Microsoft.FSharp.Primitives.Basics.List.map[T,TResult](FSharpFunc`2 mapping, FSharpList`1 x)
at Microsoft.FSharp.Linq.QueryModule.walk@933-1[a](FSharpFunc`2 f, FSharpExpr p)
at Microsoft.FSharp.Linq.QueryModule.EvalNonNestedInner(CanEliminate canElim, FSharpExpr queryProducingSequence)
at Microsoft.FSharp.Linq.QueryModule.EvalNonNestedOuter(CanEliminate canElim, FSharpExpr tm)
at Microsoft.FSharp.Linq.QueryModule.clo@1741-1.Microsoft-FSharp-Linq-ForwardDeclarations-IQueryMethods-Execute[a,b](FSharpExpr`1 )
at <StartupCode$FSI_0002>.$FSI_0002.main@() in D:\Downloads\test.fsx:line 4

When we looked more carefully to FSI output, we saw that:

nuget.core

WAT? FSI lies to us?! First message says that correct version of DLL was referenced, but then FSI loads completely wrong old version installed with ASP.NET. Why? Let’s check what #r actually does…

#r means to reference by dll-path; focusing on name. This means that FSI will use the file name first, looking in the system-wide search path and only then try to use the string after #r as a directory-relative hint

So that means, #r is not reliable way to reference assemblies. You can get into the situation when your script depends on the environment: assemblies in GAC, installed software (like version of ASP.NET) and so on. To avoid this it is better to explicitly specify an assembly search path (#I) and then reference the assembly:

#I "../packages/nuget.core.2.8.2/lib/net40-Client"
#r "Nuget.Core.dll"
#r "System.Xml.Linq.dll"

let repository =
    NuGet.PackageRepositoryFactory.Default.CreateRepository
        "https://nuget.org/api/v2"

let aspnet = query {
    for p in repository.GetPackages() do
    where (p.Title.Contains "ASP.NET")
    count
}

Thanks to Vladimir Makarov for interesting challenge and be careful in your scripts.

Stanford CoreNLP is available on NuGet for F#/C# devs

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

nlp-logo-navbar

Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. Stanford CoreNLP is an integrated framework, which make it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools on it with just two lines of code. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Stanford CoreNLP integrates all Stanford NLP tools, including the part-of-speech (POS) taggerthe named entity recognizer (NER)the parser, and the coreference resolution system, and provides model files for analysis of English. The goal of this project is to enable people to quickly and painlessly get complete linguistic annotations of natural language texts. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled.

Stanford CoreNLP is here and available on NuGet. It is probably the most powerful package from whole The Stanford NLP Group software packages. Please, read usage overview on Stanford CoreNLP home page to understand what it can do, how you can configure an annotation pipeline, what steps are available for you, what models you need to have and so on.

I want to say thank you to Anonymous 😉 and @OneFrameLink for their contribution and stimulating me to finish this work.

Please follow next steps to get started:

Before using Stanford CoreNLP, we need to define and specify annotation pipeline. For example, annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref.

The next thing we need to do is to create StanfordCoreNLP pipeline. But to instantiate a pipeline, we need to specify all required properties or at least paths to all models used by pipeline that are specified in annotators string. Before starting samples, let’s define some helper function that will be used across all source code pieces: jarRoot is a path to folder where we extracted files from stanford-corenlp-3.2.0-models.jar; modelsRoot is a path to folder with all models files; ‘!’ is overloaded operator that converts model name to relative path to the model file.

let (@@) a b = System.IO.Path.Combine(a,b)
let jarRoot = __SOURCE_DIRECTORY__ @@ @"..\..\temp\stanford-corenlp-full-2013-06-20\stanford-corenlp-3.2.0-models\"
let modelsRoot = jarRoot @@ @"edu\stanford\nlp\models\"
let (!) path = modelsRoot @@ path

Now we are ready to instantiate the pipeline, but we need to do a small trick. Pipeline is configured to use default model files (for simplicity) and all paths are specified relatively to the root of stanford-corenlp-3.2.0-models.jar. To make things easier, we can temporary change current directory to the jarRoot, instantiate a pipeline and then change current directory back. This trick helps us dramatically decrease the number of code lines.

let props = Properties()
props.setProperty("annotators","tokenize, ssplit, pos, lemma, ner, parse, dcoref") |> ignore
props.setProperty("sutime.binders","0") |> ignore

let curDir = System.Environment.CurrentDirectory
System.IO.Directory.SetCurrentDirectory(jarRoot)
let pipeline = StanfordCoreNLP(props)
System.IO.Directory.SetCurrentDirectory(curDir)

However,  you do not have to do it. You can configure all models manually. The number of properties (especially paths to models) that you need to specify depends on the annotators value. Let’s assume for a moment that we are in Java world and we want to configure our pipeline in a custom way. Especially for this case, stanford-corenlp-3.2.0-models.jar contains StanfordCoreNLP.properties (you can find it in the folder with extracted files), where you can specify new property values out of code. Most of properties that we need to use for configuration are already mentioned in this file and you can easily understand what it what. But it is not enough to get it work, also you need to look into source code of Stanford CoreNLP. By the way, some days ago Stanford was moved CoreNLP source code into GitHub – now it is much easier to browse it.  Default paths to the models are specified in DefaultPaths.java file, property keys are listed in Constants.java file and information about which path match to which property name is contained in Dictionaries.java. Thus, you are able to dive deeper into pipeline configuration and do whatever you want. For lazy people I already have a working sample.

let props = Properties()
let (<==) key value = props.setProperty(key, value) |> ignore
"annotators"    <== "tokenize, ssplit, pos, lemma, ner, parse, dcoref"
"pos.model"     <== ! @"pos-tagger\english-bidirectional\english-bidirectional-distsim.tagger"
"ner.model"     <== ! @"ner\english.all.3class.distsim.crf.ser.gz"
"parse.model"   <== ! @"lexparser\englishPCFG.ser.gz"

"dcoref.demonym"            <== ! @"dcoref\demonyms.txt"
"dcoref.states"             <== ! @"dcoref\state-abbreviations.txt"
"dcoref.animate"            <== ! @"dcoref\animate.unigrams.txt"
"dcoref.inanimate"          <== ! @"dcoref\inanimate.unigrams.txt"
"dcoref.male"               <== ! @"dcoref\male.unigrams.txt"
"dcoref.neutral"            <== ! @"dcoref\neutral.unigrams.txt"
"dcoref.female"             <== ! @"dcoref\female.unigrams.txt"
"dcoref.plural"             <== ! @"dcoref\plural.unigrams.txt"
"dcoref.singular"           <== ! @"dcoref\singular.unigrams.txt"
"dcoref.countries"          <== ! @"dcoref\countries"
"dcoref.extra.gender"       <== ! @"dcoref\namegender.combine.txt"
"dcoref.states.provinces"   <== ! @"dcoref\statesandprovinces"
"dcoref.singleton.predictor"<== ! @"dcoref\singleton.predictor.ser"

let sutimeRules =
    [| ! @"sutime\defs.sutime.txt";
       ! @"sutime\english.holidays.sutime.txt";
       ! @"sutime\english.sutime.txt" |]
    |> String.concat ","
"sutime.rules"      <== sutimeRules
"sutime.binders"    <== "0"

let pipeline = StanfordCoreNLP(props)

As you see, this option is much longer and harder to do. I recommend to use the first one, especially if you do not need to change the default configuration.

And now the fun part. Everything else is pretty easy: we create an annotation from your text, path it through the pipeline and interpret the results.

let text = "Kosgi Santosh sent an email to Stanford University. He didn't get a reply.";

let annotation = Annotation(text)
pipeline.annotate(annotation)
use stream = new ByteArrayOutputStream()
pipeline.prettyPrint(annotation, new PrintWriter(stream))
printfn "%O" (stream.toString())

Certainly, you can extract all processing results from annotated test.

let customAnnotationPrint (annotation:Annotation) =
    printfn "-------------"
    printfn "Custom print:"
    printfn "-------------"
    let sentences = annotation.get(CoreAnnotations.SentencesAnnotation().getClass()) :?> java.util.ArrayList
    for sentence in sentences |> Seq.cast<CoreMap> do
        printfn "\n\nSentence : '%O'" sentence

    let tokens = sentence.get(CoreAnnotations.TokensAnnotation().getClass()) :?> java.util.ArrayList
    for token in (tokens |> Seq.cast<CoreLabel>) do
       let word = token.get(CoreAnnotations.TextAnnotation().getClass())
       let pos  = token.get(CoreAnnotations.PartOfSpeechAnnotation().getClass())
       let ner  = token.get(CoreAnnotations.NamedEntityTagAnnotation().getClass())
       printfn "%O \t[pos=%O; ner=%O]" word pos ner

    printfn "\nTree:"
    let tree = sentence.get(TreeCoreAnnotations.TreeAnnotation().getClass()) :?> Tree
    use stream = new ByteArrayOutputStream()
    tree.pennPrint(new PrintWriter(stream))
    printfn "The first sentence parsed is:\n %O" (stream.toString())

    printfn "\nDependencies:"
    let deps = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation().getClass()) :?> SemanticGraph
    for edge in deps.edgeListSorted().toArray() |> Seq.cast<SemanticGraphEdge> do
        let gov = edge.getGovernor()
        let dep = edge.getDependent()
        printfn "%O(%s-%d,%s-%d)"
            (edge.getRelation())
            (gov.word()) (gov.index())
            (dep.word()) (dep.index())

The full code sample is available on GutHub, if you run it, you will see the following result:

Sentence #1 (9 tokens):
Kosgi Santosh sent an email to Stanford University.
[Text=Kosgi CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NNP Lemma=Kosgi NamedEntityTag=PERSON] [Text=Santosh CharacterOffsetBegin=6 CharacterOffsetEnd=13 PartOfSpeech=NNP Lemma=Santosh NamedEntityTag=PERSON] [Text=sent CharacterOffsetBegin=14 CharacterOffsetEnd=18 PartOfSpeech=VBD Lemma=send NamedEntityTag=O] [Text=an CharacterOffsetBegin=19 CharacterOffsetEnd=21 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=email CharacterOffsetBegin=22 CharacterOffsetEnd=27 PartOfSpeech=NN Lemma=email NamedEntityTag=O] [Text=to CharacterOffsetBegin=28 CharacterOffsetEnd=30 PartOfSpeech=TO Lemma=to NamedEntityTag=O] [Text=Stanford CharacterOffsetBegin=31 CharacterOffsetEnd=39 PartOfSpeech=NNP Lemma=Stanford NamedEntityTag=ORGANIZATION] [Text=University CharacterOffsetBegin=40 CharacterOffsetEnd=50 PartOfSpeech=NNP Lemma=University NamedEntityTag=ORGANIZATION] [Text=. CharacterOffsetBegin=50 CharacterOffsetEnd=51 PartOfSpeech=. Lemma=. NamedEntityTag=O]
(ROOT
(S
(NP (NNP Kosgi) (NNP Santosh))
(VP (VBD sent)
(NP (DT an) (NN email))
(PP (TO to)
(NP (NNP Stanford) (NNP University))))
(. .)))

nn(Santosh-2, Kosgi-1)
nsubj(sent-3, Santosh-2)
root(ROOT-0, sent-3)
det(email-5, an-4)
dobj(sent-3, email-5)
nn(University-8, Stanford-7)
prep_to(sent-3, University-8)

Sentence #2 (7 tokens):
He didn’t get a reply.
[Text=He CharacterOffsetBegin=52 CharacterOffsetEnd=54 PartOfSpeech=PRP Lemma=he NamedEntityTag=O] [Text=did CharacterOffsetBegin=55 CharacterOffsetEnd=58 PartOfSpeech=VBD Lemma=do NamedEntityTag=O] [Text=n’t CharacterOffsetBegin=58 CharacterOffsetEnd=61 PartOfSpeech=RB Lemma=not NamedEntityTag=O] [Text=get CharacterOffsetBegin=62 CharacterOffsetEnd=65 PartOfSpeech=VB Lemma=get NamedEntityTag=O] [Text=a CharacterOffsetBegin=66 CharacterOffsetEnd=67 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=reply CharacterOffsetBegin=68 CharacterOffsetEnd=73 PartOfSpeech=NN Lemma=reply NamedEntityTag=O] [Text=. CharacterOffsetBegin=73 CharacterOffsetEnd=74 PartOfSpeech=. Lemma=. NamedEntityTag=O]
(ROOT
(S
(NP (PRP He))
(VP (VBD did) (RB n’t)
(VP (VB get)
(NP (DT a) (NN reply))))
(. .)))

nsubj(get-4, He-1)
aux(get-4, did-2)
neg(get-4, n’t-3)
root(ROOT-0, get-4)
det(reply-6, a-5)
dobj(get-4, reply-6)

Coreference set:
(2,1,[1,2)) -> (1,2,[1,3)), that is: “He” -> “Kosgi Santosh”

C# Sample

C# samples are also available on GitHub.

Stanford Temporal Tagger(SUTime)

nlp-logo-navbar

SUTime is a library for recognizing and normalizing time expressions. SUTime is available as part of the Stanford CoreNLP pipeline and can be used to annotate documents with temporal information. It is a deterministic rule-based system designed for extensibility.

There is one more useful thing that we can do with CoreNLP – time extraction. The way that we use CoreNLP is pretty similar to the previous sample. Firstly, we create an annotation pipeline and add there all required annotators. (Notice that this sample also use the operator defined at the beginning of the post)

let pipeline = AnnotationPipeline()
pipeline.addAnnotator(PTBTokenizerAnnotator(false))
pipeline.addAnnotator(WordsToSentencesAnnotator(false))

let tagger = MaxentTagger(! @"pos-tagger\english-bidirectional\english-bidirectional-distsim.tagger")
pipeline.addAnnotator(POSTaggerAnnotator(tagger))

let sutimeRules =
    [| ! @"sutime\defs.sutime.txt";
       ! @"sutime\english.holidays.sutime.txt";
       ! @"sutime\english.sutime.txt" |]
    |> String.concat ","
let props = Properties()
props.setProperty("sutime.rules", sutimeRules ) |> ignore
props.setProperty("sutime.binders", "0") |> ignore
pipeline.addAnnotator(TimeAnnotator("sutime", props))

Now we are ready to annotate something. This part is also equal to the same one from the previous sample.

let text = "Three interesting dates are 18 Feb 1997, the 20th of july and 4 days from today."
let annotation = Annotation(text)
annotation.set(CoreAnnotations.DocDateAnnotation().getClass(), "2013-07-14") |> ignore
pipeline.annotate(annotation)

And finally, we need to interpret annotating results.

printfn "%O\n" (annotation.get(CoreAnnotations.TextAnnotation().getClass()))
let timexAnnsAll = annotation.get(TimeAnnotations.TimexAnnotations().getClass()) :?> java.util.ArrayList
for cm in timexAnnsAll |> Seq.cast<CoreMap> do
    let tokens = cm.get(CoreAnnotations.TokensAnnotation().getClass()) :?> java.util.List
    let first = tokens.get(0)
    let last = tokens.get(tokens.size() - 1)
    let time = cm.get(TimeExpression.Annotation().getClass()) :?> TimeExpression
    printfn "%A [from char offset '%A' to '%A'] --> %A"
        cm first last (time.getTemporal())

The full code sample is available on GutHub, if you run it you will see the following result:

18 Feb 1997 [from char offset ’18’ to ‘1997’] –> 1997-2-18
the 20th of july [from char offset ‘the’ to ‘July’] –> XXXX-7-20
4 days from today [from char offset ‘4’ to ‘today’] –> THIS P1D OFFSET P4D

C# Sample

C# samples are also available on GitHub.

Conclusion

There is a pretty awesome library. I hope you enjoy it. Try it out right now!

There are some other more specific Stanford packages that are already available on NuGet:

Stanford Word Segmenter is available on NuGet

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

nlp-logo-navbar

Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.

The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications.

One more tool from Stanford NLP Software Package become ready on NuGet today. It is a Stanford Word Segmenter. This is a fourth one Stanford NuGet package published by me, previous ones were a “Stanford Parser“, “Stanford Named Entity Recognizer (NER)” and “Stanford Log-linear Part-Of-Speech Tagger“. Please follow next steps to get started:

F# Sample

For more details see source code on GitHub.

open java.util
open edu.stanford.nlp.ie.crf

[<EntryPoint>]
let main argv =
if (argv.Length <> 1) then
printf "usage: StanfordSegmenter.Csharp.Samples.exe filename"
else
let props = Properties();
props.setProperty("sighanCorporaDict", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data") |> ignore
props.setProperty("serDictionary", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\dict-chris6.ser.gz") |> ignore
props.setProperty("testFile", argv.[0]) |> ignore
props.setProperty("inputEncoding", "UTF-8") |> ignore
props.setProperty("sighanPostProcessing", "true") |> ignore

let segmenter = CRFClassifier(props)
segmenter.loadClassifierNoExceptions(@"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\ctb.gz", props)
segmenter.classifyAndWriteAnswers(argv.[0])
0

C# Sample

For more details see source code on GitHub.

using java.util;
using edu.stanford.nlp.ie.crf;

namespace StanfordSegmenter.Csharp.Samples
{
class Program
{
static void Main(string[] args)
{
if (args.Length != 1)
{
System.Console.WriteLine("usage: StanfordSegmenter.Csharp.Samples.exe filename");
return;
}

var props = new Properties();
props.setProperty("sighanCorporaDict", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data");
props.setProperty("serDictionary", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\dict-chris6.ser.gz");
props.setProperty("testFile", args[0]);
props.setProperty("inputEncoding", "UTF-8");
props.setProperty("sighanPostProcessing", "true");

var segmenter = new CRFClassifier(props);
segmenter.loadClassifierNoExceptions(@"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\ctb.gz", props);
segmenter.classifyAndWriteAnswers(args[0]);
}
}
}

let runFAKE = Download >> Unzip >> IKVMCompile >> Sign >> NuGet

This post is about one more FAKE use case. It will be not usual, but I hope useful script.

The problem I have faced to is recompilation of Stanford NLP products to .NET using IKVM.NET. I am sick of doing it manually. I posted instructions on how to do it, but I think that not many people have tried to do it. I believe that I can automate it end to end from downloading *.jar files to building NuGet packages. Of course, I have chosen FAKE for this task (Thanks to Steffen Forkmann for help with building NuGet packages).

The build scenario is the following:

  1. Download zip archive with *.jar files and trained models from Stanford NLP site (They can be large, up to 200Mb like for Stanford Parser, and I do not want to store all this stuff in my repository)
  2. Download IKVM.NET compiler as a zip archive. (It is not distributed with NuGet package and is not referenced from IKVM.NET site. It is really tricky to find it for the first time)
  3. Unzip all downloaded archives.
  4. Carefully recompile all required *.jar files considering all references.
  5. Sign all compiled assemblies to be able to deploy them to the GAC if needed.
  6. Compile NuGet package.

Steps 1-5 are not covered by FAKE OOTB tasks and I needed to implement them by myself. Since I wanted to use F# 3.0 features and .NET 4.5 capabilities (like System.IO.Compression.FileSystem.ZipFile for unzipping) I have chosen pre-release version of FAKE 2 that uses .NET 4 runtime. Pre-release version of FAKE can be restored from NuGet as follows:

"nuget.exe" "install" "FAKE" "-Pre" "-OutputDirectory" "..\build" "-ExcludeVersion"

Download manager

Requirements: For sure, I do not want to download files from the Internet during each build. Before downloading files, I want to check their presence on the file system, if they are missed then start downloading. During downloading, I want to see the progress status to be sure that everything works. The code that does it:

#r "System.IO.Compression.FileSystem.dll"
let downloadDir = @".\Download\"

let restoreFile url =
    let downloadFile file url =
        printfn "Downloading file '%s' to '%s'..." url file
        let BUFFER_SIZE = 16*1024
        use outputFileStream = File.Create(file, BUFFER_SIZE)
        let req = System.Net.WebRequest.Create(url)
        use response = req.GetResponse()
        use responseStream = response.GetResponseStream()
        let printStep = 100L*1024L
        let buffer = Array.create<byte> BUFFER_SIZE 0uy
        let rec download downloadedBytes =
            let bytesRead = responseStream.Read(buffer, 0, BUFFER_SIZE)
            outputFileStream.Write(buffer, 0, bytesRead)
            if (downloadedBytes/printStep <> (downloadedBytes-int64(bytesRead))/printStep)
                then printfn "\tDownloaded '%d' bytes" downloadedBytes
            if (bytesRead > 0) then download (downloadedBytes + int64(bytesRead))
        download 0L
    let file = downloadDir @@ System.IO.Path.GetFileName(url)
    if (not <| File.Exists(file))
        then url |> downloadFile file
    file
let unZipTo toDir file =
    printfn "Unzipping file '%s' to '%s'" file toDir
    Compression.ZipFile.ExtractToDirectory(file, toDir)
let restoreFolderFromUrl folder url =
    if not <| Directory.Exists folder
        then url |> restoreFile |> unZipTo (folder @@ @"..\")

let restoreFolderFromFile folder zipFile =
    if not <| Directory.Exists folder
        then zipFile |> unZipTo (folder @@ @"..\")

IKVM.NET Compiler

Compiler should be able to rebuild any number of *.jar files with predefined dependencies and sign result *.dll files if required.

let ikvmc =
    restoreFolderFromUrl @".\temp\ikvm-7.3.4830.0" "http://www.frijters.net/ikvmbin-7.3.4830.0.zip"
    @".\temp\ikvm-7.3.4830.0\bin\ikvmc.exe"
let ildasm = @"c:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\Bin\x64\ildasm.exe"
let ilasm = @"c:\Windows\Microsoft.NET\Framework64\v2.0.50727\ilasm.exe"
type IKVMcTask(jar:string) =
    member val JarFile = jar
    member val Version = "" with get, set
    member val Dependencies = List.empty<IKVMcTask> with get, set
let timeOut = TimeSpan.FromSeconds(120.0)
let IKVMCompile workingDirectory keyFile tasks =
    let getNewFileName newExtension (fileName:string) =
        Path.GetFileName(fileName).Replace(Path.GetExtension(fileName), newExtension)
    let startProcess fileName args =
        let result =
            ExecProcess
                (fun info ->
                    info.FileName <- fileName
                    info.WorkingDirectory <- FullName workingDirectory
                    info.Arguments <- args)
                timeOut
        if result<> 0 then
            failwithf "Process '%s' failed with exit code '%d'" fileName result
    let newKeyFile =
        let file = workingDirectory @@ (Path.GetFileName(keyFile))
        File.Copy(keyFile, file, true)
        Path.GetFileName(file)
    let rec compile (task:IKVMcTask) =
        let getIKVMCommandLineArgs() =
            let sb = Text.StringBuilder()
            task.Dependencies |> Seq.iter
               (fun x ->
                   compile x
                   x.JarFile |> getNewFileName ".dll" |> bprintf sb " -r:%s")
            if not <| String.IsNullOrEmpty(task.Version)
                then task.Version |> bprintf sb " -version:%s"
            bprintf sb " %s -out:%s"
                (task.JarFile |> getNewFileName ".jar")
                (task.JarFile |> getNewFileName ".dll")
            sb.ToString()
        File.Copy(task.JarFile, workingDirectory @@ (Path.GetFileName(task.JarFile)) ,true)
        startProcess ikvmc (getIKVMCommandLineArgs())

        if (File.Exists(keyFile)) then
            let dllFile = task.JarFile |> getNewFileName ".dll"
            let ilFile = task.JarFile |> getNewFileName ".il"
            startProcess ildasm (sprintf " /all /out=%s %s" ilFile dllFile)
            File.Delete(dllFile)
            startProcess ilasm (sprintf " /dll /key=%s %s" (newKeyFile) ilFile)
    tasks |> Seq.iter compile

Results

Using this helper function, build scripts come out pretty straightforward and easy. For example, recompilation of Stanford Parser looks as follows:

Target "RunIKVMCompiler" (fun _ ->
    restoreFolderFromUrl
        @".\temp\stanford-parser-full-2013-06-20"
        "http://nlp.stanford.edu/software/stanford-parser-full-2013-06-20.zip"
    restoreFolderFromFile
        @".\temp\stanford-parser-full-2013-06-20\edu"
        @".\temp\stanford-parser-full-2013-06-20\stanford-parser-3.2.0-models.jar"

    [IKVMcTask(@"temp\stanford-parser-full-2013-06-20\stanford-parser.jar",
        Version=version,
        Dependencies =
            [IKVMcTask(@"temp\stanford-parser-full-2013-06-20\ejml-0.19-nogui.jar",
                       Version="0.19.0.0")])]
    |> IKVMCompile ikvmDir @".\Stanford.NLP.snk"
)

All source code is available on GitHub.

Stanford Log-linear Part-Of-Speech Tagger is available on NuGet

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

nlp-logo-navbarThere is one more tool that has become ready on NuGet today. It is a Stanford Log-linear Part-Of-Speech Tagger. This is a third one Stanford NuGet package published by me, previous ones were a “Stanford Parser“ and “Stanford Named Entity Recognizer (NER)“. I have already posted about this tool with guidance on how to recompile it and use from F# (see “NLP: Stanford POS Tagger with F# (.NET)“). Please follow next steps to get started:

F# Sample

For more details see source code on GitHub.

let model = @"..\..\..\..\temp\stanford-postagger-2013-06-20\models\wsj-0-18-bidirectional-nodistsim.tagger"

let tagReader (reader:Reader) =
    let tagger = MaxentTagger(model)
    MaxentTagger.tokenizeText(reader)
    |> Iterable.toSeq
    |> Seq.iter (fun sentence ->
        let tSentence = tagger.tagSentence(sentence :?> List)
        printfn "%O" (Sentence.listToString(tSentence, false))
    )

let tagFile (fileName:string) =
    tagReader (new BufferedReader(new FileReader(fileName)))

let tagText (text:string) =
    tagReader (new StringReader(text))

C# Sample

For more details see source code on GitHub.

public static class TaggerDemo
{
    public const string Model =
        @"..\..\..\..\temp\stanford-postagger-2013-06-20\models\wsj-0-18-bidirectional-nodistsim.tagger";

    private static void TagReader(Reader reader)
    {
        var tagger = new MaxentTagger(Model);
        foreach (List sentence in MaxentTagger.tokenizeText(reader).toArray())
        {
             var tSentence = tagger.tagSentence(sentence);
             System.Console.WriteLine(Sentence.listToString(tSentence, false));
        }
    }

    public static void TagFile (string fileName)
    {
        TagReader(new BufferedReader(new FileReader(fileName)));
    }

    public static void TagText(string text)
    {
        TagReader(new StringReader(text));
    }
}

As a result of both samples you will see the same output. For example, if you start program with these parameters:

1 text "A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads 
text in some language and assigns parts of speech to each word (and other token), 
such as noun, verb, adjective, etc., although generally computational 
applications use more fine-grained POS tags like 'noun-plural'."

Then you will see following on your screen:

A/DT Part-Of-Speech/NNP Tagger/NNP -LRB-/-LRB- POS/NNP Tagger/NNP -RRB-/-RRB- 
is/VBZ a/DT piece/NN of/IN software/NN that/WDT reads/VBZ text/NN in/IN some/DT 
language/NN and/CC assigns/VBZ parts/NNS of/IN speech/NN to/TO each/DT word/NN 
-LRB-/-LRB- and/CC other/JJ token/JJ -RRB-/-RRB- ,/, such/JJ as/IN noun/JJ ,/, 
verb/JJ ,/, adjective/JJ ,/, etc./FW ,/, although/IN generally/RB computational/JJ 
applications/NNS use/VBP more/RBR fine-grained/JJ POS/NNP tags/NNS like/IN `/`` 
noun-plural/JJ '/'' ./.

Stanford Named Entity Recognizer (NER) is available on NuGet

Update (2017, July 24): Links and/or samples in this post might be outdated. The latest version of samples is available on new Stanford.NLP.NET site.

nlp-logo-navbarOne more tool from Stanford NLP product line became available on NuGet today. It is the second library that was recompiled and published to the NuGet. The first one was the “Stanford Parser“. The second one is Stanford Named Entity Recognizer (NER). I have already posted about this tool with guidance on how to recompile it and use from F# (see “NLP: Stanford Named Entity Recognizer with F# (.NET)“). There are some other interesting things happen, NER is kind of hot topic. I recently saw a question about C# NER on CodeProject, Flo asked me about NER in the comment of another post. So, I am happy to make it wider available. The flow of use is as follows:

F# Sample

F# sample is pretty much the same as in ”NLP: Stanford Named Entity Recognizer with F# (.NET)” post. For more details see source code on GitHub.

let main file =
    let classifier =
        CRFClassifier.getClassifierNoExceptions(
             @"..\..\..\..\temp\stanford-ner-2013-06-20\classifiers\english.all.3class.distsim.crf.ser.gz")
    // For either a file to annotate or for the hardcoded text example,
    // this demo file shows two ways to process the output, for teaching
    // purposes.  For the file, it shows both how to run NER on a String
    // and how to run it on a whole file.  For the hard-coded String,
    // it shows how to run it on a single sentence, and how to do this
    // and produce an inline XML output format.
    match file with
    | Some(fileName) ->
        let fileContents = File.ReadAllText(fileName)
        classifier.classify(fileContents)
        |> Iterable.toSeq
        |> Seq.cast<java.util.List>
        |> Seq.iter (fun sentence ->
            sentence
            |> Iterable.toSeq
            |> Seq.cast<CoreLabel>
            |> Seq.iter (fun word ->
                 printf "%s/%O " (word.word()) (word.get(CoreAnnotations.AnswerAnnotation().getClass()))
            )
            printfn ""
        )
    | None ->
        let s1 = "Good afternoon Rajat Raina, how are you today?"
        let s2 = "I go to school at Stanford University, which is located in California."
        printfn "%s\n" (classifier.classifyToString(s1))
        printfn "%s\n" (classifier.classifyWithInlineXML(s2))
        printfn "%s\n" (classifier.classifyToString(s2, "xml", true));
        classifier.classify(s2)
        |> Iterable.toSeq
        |> Seq.iteri (fun i coreLabel ->
            printfn "%d\n:%O\n" i coreLabel
        )

C# Sample

C# version is quite similar. For more details see source code on GitHub.

class Program
{
    public static CRFClassifier Classifier =
        CRFClassifier.getClassifierNoExceptions(
             @"..\..\..\..\temp\stanford-ner-2013-06-20\classifiers\english.all.3class.distsim.crf.ser.gz");

    // For either a file to annotate or for the hardcoded text example,
    // this demo file shows two ways to process the output, for teaching
    // purposes.  For the file, it shows both how to run NER on a String
    // and how to run it on a whole file.  For the hard-coded String,
    // it shows how to run it on a single sentence, and how to do this
    // and produce an inline XML output format.

    static void Main(string[] args)
    {
        if (args.Length > 0)
        {
            var fileContent = File.ReadAllText(args[0]);
            foreach (List sentence in Classifier.classify(fileContent).toArray())
            {
                foreach (CoreLabel word in sentence.toArray())
                {
                    Console.Write( "{0}/{1} ", word.word(), word.get(new CoreAnnotations.AnswerAnnotation().getClass()));
                }
                Console.WriteLine();
            }
        } else
        {
            const string S1 = "Good afternoon Rajat Raina, how are you today?";
            const string S2 = "I go to school at Stanford University, which is located in California.";
            Console.WriteLine("{0}\n", Classifier.classifyToString(S1));
            Console.WriteLine("{0}\n", Classifier.classifyWithInlineXML(S2));
            Console.WriteLine("{0}\n", Classifier.classifyToString(S2, "xml", true));

            var classification = Classifier.classify(S2).toArray();
            for (var i = 0; i < classification.Length; i++)
            {
                Console.WriteLine("{0}\n:{1}\n", i, classification[i]);
            }
        }
    }
}

As a result of both samples you will see the following output:

Don/PERSON Syme/PERSON is/O an/O Australian/O computer/O scientist/O and/O a/O 
Principal/O Researcher/O at/O Microsoft/ORGANIZATION Research/ORGANIZATION ,/O 
Cambridge/LOCATION ,/O U.K./LOCATION ./O He/O is/O the/O designer/O and/O 
architect/O of/O the/O F/O #/O programming/O language/O ,/O described/O by/O 
a/O reporter/O as/O being/O regarded/O as/O ``/O the/O most/O original/O new/O 
face/O in/O computer/O languages/O since/O Bjarne/PERSON Stroustrup/PERSON 
developed/O C/O +/O +/O in/O the/O early/O 1980s/O ./O
Earlier/O ,/O Syme/PERSON created/O generics/O in/O the/O ./O NET/O Common/O 
Language/O Runtime/O ,/O including/O the/O initial/O design/O of/O generics/O 
for/O the/O C/O #/O programming/O language/O ,/O along/O with/O others/O 
including/O Andrew/PERSON Kennedy/PERSON and/O later/O Anders/PERSON 
Hejlsberg/PERSON ./O Kennedy/PERSON ,/O Syme/PERSON and/O Yu/PERSON also/O 
formalized/O this/O widely/O used/O system/O ./O
He/O holds/O a/O Ph.D./O from/O the/O University/ORGANIZATION of/ORGANIZATION 
Cambridge/ORGANIZATION ,/O and/O is/O a/O member/O of/O the/O WG2/O .8/O 
working/O group/O on/O functional/O programming/O ./O He/O is/O a/O co-author/O 
of/O the/O book/O Expert/O F/O #/O 2.0/O ./O
In/O the/O past/O he/O also/O worked/O on/O formal/O specification/O ,/O 
interactive/O proof/O ,/O automated/O verification/O and/O proof/O description/O 
languages/O ./O

Stanford Parser is available on NuGet for F# and C#

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

nlp-logo-navbarI have already wrote small series of posts about porting of Stanford NLP Products to .NET using IKVM.NET. The first was about Stanford Parser “NLP: Stanford Parser with F# (.NET)“. It shows how to recompile and use parser from F#. Recently I wrote one more post “FSharp.NLP.Stanford.Parser available on NuGet” that announced already recompiled version of Stanford Parser included into NuGet package with some helpers functionality for F# devs.

As I see, it is still not so simple as it should be. I’ve seen sometimes questions from C# guys about different NLP tasks with answers pointing to my “The Stanford Natural Language Processing Samples, in F#” repository (like this). Probably, it is no so easy to find the latest version of IKVM.NET Compiler (it is not included into IKVM.NET NuGet package) and manage to quickly rebuild Stanford Parser from the scratch for the first time.

I have decided to create a NuGet package for clear porting of Stanford Parser to .NET with strongly signed assemblies and without dependencies to F#. My primary goal has been to find a clear, simple and intuitive way to try NLP magic from .NET for all NLP lovers. Now, it is simpler then ever:

F# Sample

F# sample is not much different from one mentioned in “NLP: Stanford Parser with F# (.NET)” post. For more details see source code on GitHub.

let demoDP (lp:LexicalizedParser) (fileName:string) =
    // This option shows loading and sentence-segment and tokenizing
    // a file using DocumentPreprocessor
    let tlp = PennTreebankLanguagePack();
    let gsf = tlp.grammaticalStructureFactory();
    // You could also create a tokenizer here (as below) and pass it
    // to DocumentPreprocessor
    DocumentPreprocessor(fileName)
    |> Iterable.toSeq
    |> Seq.cast<List>
    |> Seq.iter (fun sentence ->
        let parse = lp.apply(sentence);
        parse.pennPrint();

        let gs = gsf.newGrammaticalStructure(parse);
        let tdl = gs.typedDependenciesCCprocessed(true);
        printfn "\n%O\n" tdl
    )

let demoAPI (lp:LexicalizedParser) =
    // This option shows parsing a list of correctly tokenized words
    let sent = [|"This"; "is"; "an"; "easy"; "sentence"; "." |]
    let rawWords = Sentence.toCoreLabelList(sent)
    let parse = lp.apply(rawWords)
    parse.pennPrint()

    // This option shows loading and using an explicit tokenizer
    let sent2 = "This is another sentence."
    let tokenizerFactory = PTBTokenizer.factory(CoreLabelTokenFactory(), "")
    use sent2Reader = new StringReader(sent2)
    let rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize()
    let parse = lp.apply(rawWords2)

    let tlp = PennTreebankLanguagePack()
    let gsf = tlp.grammaticalStructureFactory()
    let gs = gsf.newGrammaticalStructure(parse)
    let tdl = gs.typedDependenciesCCprocessed()
    printfn "\n%O\n" tdl

    let tp = new TreePrint("penn,typedDependenciesCollapsed")
    tp.printTree(parse)

let main fileName =
    let lp = LexicalizedParser.loadModel(@"...\englishPCFG.ser.gz")
    match fileName with
    | Some(file) -> demoDP lp file
    | None -> demoAPI lp

C# Sample

C# version is quite similar. For more details see source code on GitHub.

public static class ParserDemo
{
    public static void DemoDP(LexicalizedParser lp, string fileName)
    {
        // This option shows loading and sentence-segment and tokenizing
        // a file using DocumentPreprocessor
        var tlp = new PennTreebankLanguagePack();
        var gsf = tlp.grammaticalStructureFactory();
        // You could also create a tokenizer here (as below) and pass it
        // to DocumentPreprocessor
        foreach (List sentence in new DocumentPreprocessor(fileName))
        {
            var parse = lp.apply(sentence);
            parse.pennPrint();

            var gs = gsf.newGrammaticalStructure(parse);
            var tdl = gs.typedDependenciesCCprocessed(true);
            System.Console.WriteLine("\n{0}\n", tdl);
        }
    }

    public static void DemoAPI(LexicalizedParser lp)
    {
        // This option shows parsing a list of correctly tokenized words
        var sent = new[] { "This", "is", "an", "easy", "sentence", "." };
        var rawWords = Sentence.toCoreLabelList(sent);
        var parse = lp.apply(rawWords);
        parse.pennPrint();

        // This option shows loading and using an explicit tokenizer
        const string Sent2 = "This is another sentence.";
        var tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
        var sent2Reader = new StringReader(Sent2);
        var rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize();
        parse = lp.apply(rawWords2);

        var tlp = new PennTreebankLanguagePack();
        var gsf = tlp.grammaticalStructureFactory();
        var gs = gsf.newGrammaticalStructure(parse);
        var tdl = gs.typedDependenciesCCprocessed();
        System.Console.WriteLine("\n{0}\n", tdl);

        var tp = new TreePrint("penn,typedDependenciesCollapsed");
        tp.printTree(parse);
    }

    public static void Start(string fileName)
    {
         var lp =LexicalizedParser.loadModel(Program.ParserModel);
         if (!String.IsNullOrEmpty(fileName))
              DemoDP(lp, fileName);
         else
              DemoAPI(lp);
    }
}

As a result of both samples you will see the following output:

Loading parser from serialized file ..\..\..\..\StanfordNLPLibraries\
stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz ... 
done [1.5 sec].
(ROOT
 (S
 (NP (DT This))
 (VP (VBZ is)
 (NP (DT an) (JJ easy) (NN sentence)))
 (. .)))

[nsubj(sentence-4, This-1), cop(sentence-4, is-2), det(sentence-4, another-3), 
root(ROOT-0, sentence-4)]
(ROOT
 (S
 (NP (DT This))
 (VP (VBZ is)
 (NP (DT another) (NN sentence)))
 (. .)))
nsubj(sentence-4, This-1)
cop(sentence-4, is-2)
det(sentence-4, another-3)
root(ROOT-0, sentence-4)

FSharp.NLP.Stanford.Parser available on NuGet

There are two good news for all F# NLP lovers.

News #1: Using The Stanford Parser from F# is easier than it has ever been.

From now, the latest version(v3.2.0.0) of The Stanford Parser is available on Nuget.

All you need to do is:

  1. Install-Package FSharp.NLP.Stanford.Parser
  2. Download models from The Stanford NLP Group site.
  3. Extract models from stanford-parser-3.2.0-models.jar‘ (just unzip it)
  4. You are ready to start.

If you need examples, please look at my previous post ‘NLP: Stanford Parser with F# (.NET)‘.

News #2: FSharp.NLP.Stanford.Parser is first ever attempt to build strongly typed, self descriptive Penn Treebank II Tags model using power of F# Discriminated Unions and Active Patterns.

VB

Enjoy it and feel free to share your feedback.