F# Kung Fu #2: Custom Numeric Literals.

All of you probably know that F# has a set of predefined numerical literals that allow you to clarify meaning of numbers. For example, number 32 will be interpreted as int by default. If you add ‘L’ suffix – 32L, you will get int64 number. If you add ‘I’ suffix- 32I, you will get System.Numerics.BigInteger number.

But F# has an extensible point here: you can define custom interpretation for suffixes Q, R, Z, I, N, G. The trick is that you need to declare an F# module with a special name and define converters functions. (For example NumericLiteralZ for Z literal).

module NumericLiteralZ =
    let FromZero() = 0
    let FromOne() = 1
    let FromInt32 n =
        let rec sumDigits n acc =
            if (n=0) then acc
            else sumDigits (n/10) (acc+n%10)
        sumDigits n 0
    //let FromInt64 n = ...
    //let FromString s = ...

let x = 11111Z
//val x : int = 5
let y = 123Z
//val y : int = 6

Have fun, but be careful.

Update: Note that you cannot use constants integer literals with suffixes Q, R, Z, I, N, G in pattern matching.

NDepend for F# code or FSharp.Compiler.Service ‘code review’.

ndepend_logoI had the opportunity to play with NDepend and could not keep from trying it on F# code. It is interesting to see how analysis rules that designed for languages like C# apply to functional-first F#.

When I downloaded NDepend from the official site, I expected to get regular msi installer with wizard that passes me through all steps and does everything it needs. But it was not true, I got a zip archive with binaries that I needed to unzip manually and execute some binaries that integrate NDepend with my Visual Studio 2013 and Reflector. Installation guide is simple enough and detailed, I have no problems to follow it step by step and configure everything. Nevertheless, installation process seems rather unusual for today.

When I executed a standalone version of NDepend(VisualNDepend) and opened first project that I found – I got stuck! So much information on my screen, I could not understand where to look. My screen looked like a spaceship control panel and I was afraid to touch it ;). I decided to go back to Getting Started page and watch the introduction video. Luckily, NDepend looks pretty well documented.

Now I needed to choose some F# project to do “a code review”. I wanted to find something cool, large and complicated for clarity. And here was my first problem… The first condition is ‘cool’, but most of open source F# projects are cool. This constrain did not help me to reduce number of choices. The next condition is ‘large’, but we do not have really huge F# projects. Pithiness is one of the main F# advantages (F# helps to dramatically decrease code size and code complexity by design). Even F# Compiler is not so big; it is much smaller than my usual C# project. The last condition is ‘complexity’; here situation is similar to ‘size’. F# really helps to keep complexity at manageable level. Anyway, I needed to choose one…

Finally, I have chosen FSharp.Compiler.Service project for analysis (It is a relatively new project). I have no idea what is inside, so it will be more interesting to explore source code in such unusual(for me) way. It is an extremely cool project, which is probably one step further for F# and a fundamental improvement that will open a lot of new doors. (You can read more about what it for and what it can do on the official F# Compiler Services site). This project must be large enough and complicated because it is a brand new extended version of the powerful compiler.

ndepend_homeX

Joking aside! Let’s go deeper to the code.

ndepend-dependencies

In the dependency graph picture, we see the primary assembly FSharp.Compiler.Service (orange in the picture). This assembly depends on minimal set of assemblies from .NET framework (they are marked blue). Also we see that this GitHub project contains sample projects that are built on top of the compiler service (marked green). So project structure is simple enough and quite clear. The same dependencies we can visualize as a dependency matrix:

ndepend-dependencies-matrix

In this matrix, we see more quantitative data about dependencies. Numbers in the cells show the number of assembly members used by one assembly from another. Using these numbers, we can make some conclusions like “UntypedTree sample uses more functionality from FSharp.Compiler.Service than others” or “FsiExe is only one sample that has Windows Forms user interface”.

NDepend also provides one crazy interesting report – Treemap Metric View. This report is able to build a tree of namespaces where size of each node will depend on number of LOC in this namespace. Such plots can show where all complexity is concentrated. To extract more useful information from this plot, you need to have an understanding of the code.

VisualNDependView

Finally, the most intriguing part, the analysis dashboard:

NDependDashboard

Based on these stats, we see that FSharp.Compiler.Service contains more than 84.000 lines of code, which is really a lot for F# project; the average method complexity is 3 that is pretty nice. Also NDepend found violations of 12 critical rules, let’s see deeper what they are.

NDepend_violatedRules

Unfortunately, NDepend does not support navigation to method declaration for non-C# compiled source code and this fact complicates observion of F# code.

NDependSourceDeclarationTo avoid this error (for methods) you need to open instance of VS with your solution and NDepend navigate you directly where you wish.

Let’s dwell on critical violated rules:

Code Quality:

Methods with too many parameters – critical

This rule is violated when methods contain more than 8 parameters. I am going to agree here with NDepend – F# compiler source code has such sin. I do not know exact reason of it, but it should be reasonably for compiler/parser source code.

Methods too complex – critical

This rule is violated when methods have ILCyclomaticComplexity > 40 and ILNestingDepth > 4. As I see this happens mostly because NDepend does not understand definition of functions inside other functions (That does not supported by C#). Most of the code that violate this rule is pretty readable. Yes, functionality is wrapped into one large method, but inside it is split into small handy readable functions.

Types too big – critical

This rule is violated when types contain more than 500 lines of code. This story mostly not about F# too. F# compiles modules to the .NET classes. You are allowed to have as large modules as you need. Modules are more like namespaces than classes and constrain with 500 LOC is not applicable for them.

Object Oriented Design

Do not hide base class methods

The rule is self-explanatory. But in current case we should not pay attention because these 3 violations happened in source code of ProvidedType where this is a part of magic of type providers.

Architecture and Layering

Avoid namespaces mutually dependent

I am not sure here, but it also looks like issue related to the F# modules. NDepend reasoning about namespace dependencies without regards to that namespaces are divided into modules.

Dead Code

Potentially dead Types, Methods and Fields

Hmm… It really looks like that there are some methods inside the compiler that were implemented but not used inside and not exposed to external world. It is probably a secret weapon of F#, sketches of new coming features.

Visibility

Constructors of abstract classes should be declared as protected or private

This issue is related to uses of F# discriminated unions. F# compiles discriminated unions into class hierarchy, where root abstract class has a default parameter-less constructor with default visibility (that is internal for F#)

Naming Conventions

Note that C# and F# have a different development guides with a bit different naming conventions.

Avoid having different types with same name

Mostly this rule is also violated by F# modules. It is side effect of F# modules compilation.

Exception class name should be suffixed with ‘Exception’

Exception suffix is rarely used in F# because language has a special exception keyword to define F# exceptions.

Interface name should begin with a ‘I’

F# compiles types to interfaces when all members are abstract. Actually, sometimes we forget to mention I.

Conclusion

Finally, NDepend is a really nice tool. It has some barrier of entry that forces you to refer to documentation, but it looks like a very powerful tool in skillful hand. It is absolutely invaluable in C# world when you want to understand what the hell is going on in the code, but also applicable to F# to see the big picture.

NDepend is highly customizable. Default set of code verification rules is targeted to C# source code, but you can modify existing rules and/or create new ones that will be designed to F#.  Hopefully, one day such rules will become available in default distribution and F# will be officially supported by NDepend team.

P.S. I have tried only a basic feature set; you can read more about advanced features in the official documentation.

Twitter Pulse #fsharp 2013

F# Twitter Pulse 2013

Previously I have published statistics about #fsharp twitter hashtag. This one is based on the tweets that I collected throughout the whole 2013 year. Every week while working on the F# weekly I have downloaded all tweets with hashtag #fsharp and pushed them to my local MongoDB. I managed to collect about 24000 tweets from more than 3000 twitter users.

The first picture on the top of this page is based on total twitter activity around #fsharp hashtag. The size of twitter names depends on sum number of people tweets and retweets. Other charts are self descriptive. Thank you guys for your effort, let’s make 2014 even better.

top30writerstop30retweeters

weeklyTwitterActivity

Update:

Michael Newton proposed an interesting idea: “Calculate number that measures F# community love”. I calculated number of retweets for each twitter account and divided it by number of unique tweets from this account (both with #fsharp hash tag). Finally, I excluded accounts who did less than 10 new tweets about F# in a year. Here it is:

rperp2

Twitter Followers Map with RProvider

Today @oppenheimmd re-tweeted a nice tweet about building Twitter Followers Map with R. Certainly, I decided to build my own map and here it is:

twitterMap

Total number of followers on the screen is smaller than in Twitter. I think it happens because not all people specified the location in the account settings. Take a note that to be able execute this script you need to specify the location in your Twitter account.

As you probably understand from the title, I did this picture using RProvider instead of executing existing R code. Actually, use of twitterMap.R is pretty simple, if not to pay attention to difficulties with Twitter authorization and SSL certificate validation (this part is ugly a bit).

For this demo we need two R packages twitteR and RCurl (with all their dependencies). Please install them:

#I @"..\packages\RProvider.1.0.5\"
#load "RProvider.fsx"

//open RProvider.utils
//R.install_packages("twitteR")
//R.install_packages("RCurl")

open RDotNet
open RProvider
open RProvider.utils
open RProvider.``base``
open RProvider.twitteR
open RProvider.RCurl
open RProvider.ROAuth

I am lazy a bit to fight with RProvider syntax in some places. Actually, I do not even know if it is possible to rewrite such R code using RProvider or not… I have decided to cheat a bit and define a function that gets R expression as a string and evaluates it.

let eval (text:string) =
    namedParams ["text", box text] |> R.parse |> R.eval

You need to have consumerKey and consumerSecret from your registered Twitter application. If you do not have such ones yet, please follow the steps from this article that helps you to register a new Twitter application. The following code is tended to authenticate you in Twitter:

let twitCred =
    namedParams [
        // TODO: insert your consumerKey and consumerSecret
        "consumerKey", box "xxxxxxxxxxxxxxxxxxxxxx"
        "consumerSecret", box "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
        "requestURL", box "https://api.twitter.com/oauth/request_token"
        "accessURL", box "http://api.twitter.com/oauth/access_token"
        "authURL", box "http://api.twitter.com/oauth/authorize" ]
    |> R.OAuthFactory

R.download_file(url = "http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
R.assign("twitCred", twitCred)
eval """twitCred$handshake(cainfo="cacert.pem")"""

Here you need to do some manual work. These are the last authentication steps:

  1. Copy URL from FSI window and paste it in your browser
  2. Allow your twitter app to access your account data
  3. Copy authorization number-code from browser
  4. Paste code in FSI and press Enter

After that, you need to save your authorization data and set SSL certificate to be used globally, which allow twitterMap.R to communicate with Twitter under your account.

R.registerTwitterOAuth(twitCred)
eval """options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))"""

The last step is to run the script and plot your own map:

R.source("http://biostat.jhsph.edu/~jleek/code/twitterMap.R")
// TODO: do not forget to specify your twitter login and increase nMax if you have more than 5000 followers
eval """twitterMap("your_login", fileName="d:\\TwitterMap.pdf", nMax=5000)"""

P.S. If you rewrite this script without eval please post a link in comments 😉

F# Kung Fu #1: Mastering F# Script references.

One of the most boring parts during working with F# Script files is external references. We need to write a lot of directives with paths to the required files.

New Visual Studio 2013 can help here a bit. A new ‘Send to Interactive‘ button is available there.  Now you can avoid typing an extra command to load references for interactive execution of a small part of your application. But what to do if F# script is our goal?

If you are VS2012 user, then one nice plugin from Tao Liu is available for you “AddReferenceInFSI“. It is a really nice extension, but it is still not a silver bullet.

What do real gurus do in such a situation? Tomas Petricek has shared one typing trick. You can find it in his latest Channel 9 video “Understanding the World with F#” starting from 4:40. What?!? How did he open file picker inside the source code file, chose file that he wanted and inserted relative path to the file directly in code? Why do I always type these long and boring paths if it can be done so easy? Today the truth will come true!

Thomas was so kind and revealed the secret of this trick:

The truth is not so magical as on the face of it – just use the power of your file editor. Let’s repeat all steps once again to better remember:

  1. Find a place where you want to insert file path
  2. Press Ctrl+O that should open a standard file picker. By default VS should open a dialog in the directory where your current file is saved in.
  3. Start typing relative or absolute path to your file, BUT do not use mouse – you are able to use only auto-complete in file path edit box.
  4. When you find a file – select path to it (Ctrl+A)
  5. Copy it (Ctrl+C)
  6. Close file picker (Esc)
  7. Insert path in your script (Ctrl+V)

Do not type boring paths – do it like Tomas 😉

Update from Yan Cui: There is one useful script from Gustavo Guerra. You can load it in every FSI session and save your time.

F# Neural Networks with FsLab

nn_previewNeural networks are very powerful tool and at the same time, it is not easy to use all its power. Now we are one step closer to it from F# and .NET. We will delegate model training to R using R Provider. Also we will use Deedle (that was announced some days ago) for handy data manipulation.

Prerequisites:

Learning from Data:

First of all, we need to load required assemblies into our FSI session. It is pretty easy with FsLab because package have bootstrapping script.

#load "..\packages\FsLab.0.1.4\FsLab.fsx"

The next step is to download and install missed R packages. For this demo, we need neuralnet for training neural network model and prediction, caret for data visualization.

open RProvider.utils
R.install_packages("MASS")
R.install_packages("pbkrtest")
R.install_packages("lattice")
R.install_packages("Matrix")
R.install_packages("mgcv")
R.install_packages("grid")
R.install_packages("neuralnet")
R.install_packages("caret")
R.install_packages("zoo")

Now we are ready to start work. We need to open namespaces and load a data set. For this demo, we have chosen iris data set, which is classic for lots of demos.

open Deedle
open RDotNet
open RProvider
open RProvider.``base``
open RProvider.datasets
open RProvider.neuralnet
open RProvider.caret

let iris : Frame<int, string> = R.iris.GetValue()

To better understand what we are going to do, let’s plot this data set. First of all, split data into two parts: features (Sepal.Length; Sepal.Width; Petal.Length; Petal.Width) and a target variable (Species). After that plot these data into different dimensions (different colors represent different Species).

let features =
iris
|> Frame.filterCols (fun c _ -> c <> "Species")
|> Frame.mapColValues (fun c -> c.As<double>())
let targets =
R.as_factor(iris.Columns.["Species"])

R.featurePlot(x = features, y = targets, plot = "pairs")

nn_features

As you see, our task is not trivial – we have 3 classes instead of 2 (that is not classic situation) and classes are not clearly separable. Nevertheless let’s try!  First of all, we need to split our data into 2 parts – training and testing data sets (70% vs 30%). The first part will be sent to the neural network for learning, the second one will be used for measuring model quality. Also let’s shuffle data to be honest.

iris.ReplaceColumn("Species", targets.AsNumeric())
let range = [1..iris.RowCount]
let trainingIdxs : int[] = R.sample(range, iris.RowCount*7/10).GetValue()
let testingIdxs : int[] = R.setdiff(range, trainingIdxs).GetValue()
let trainingSet = iris.Rows.[trainingIdxs]
let testingSet = iris.Rows.[testingIdxs]

Now we are ready to train a neural network, all we need is to provide a formula (specify what is the input for our model and what is the output) “Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width”, provide a data set and specify the structure of hidden layers. In the following example, we will train the network with two layers of hidden nodes, the first layer with 3 nodes and the second layer with 2 nodes.

let nn =
R.neuralnet(
"Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width",
data = trainingSet, hidden = R.c(3,2),
err_fct = "ce", linear_output = true)

// Plot the resulting neural network with coefficients
R.eval(R.parse(text="library(grid)"))
R.plot_nn nn

nn_network

Cool! How simple it is. To be able to measure quality of the classification we need to split our training set into features and targets.

let testingFeatures =
testingSet
|> Frame.filterCols (fun c _ -> c <> "Species")
|> Frame.mapColValues (fun c -> c.As<double>())
let testingTargets =
testingSet.Columns.["Species"].As<int>().Values

To execute the neural network on the new data (apply our classification) we should call R.compute method and pass the training data set there.

let prediction =
R.compute(nn, testingFeatures)
.AsList().["net.result"].AsVector()
|> Seq.cast<double>
|> Seq.map (round >> int))

Finally, let’s compare prediction results with testing values:

let misclassified =
Seq.zip prediction testingTargets
|> Seq.filter (fun (a,b) -> a<>b)
|> Seq.length

printfn "Misclassified irises '%d' of '%d'" misclassified (testingSet.RowCount)

If you execute all these steps one by one, you will see that there are only ~3 misclassifies of 45 samples. Pretty well quality.

Full script:

#load "..\packages\FsLab.0.1.4\FsLab.fsx"

// You need to install 'nnet' and 'caret' packages if you do not have them
open RProvider.utils
open RProvider.utils
R.install_packages("MASS")
R.install_packages("pbkrtest")
R.install_packages("lattice")
R.install_packages("Matrix")
R.install_packages("mgcv")
R.install_packages("grid")
R.install_packages("neuralnet")
R.install_packages("caret")
R.install_packages("zoo")

open Deedle
open RDotNet
open RProvider
open RProvider.``base``
open RProvider.datasets
open RProvider.neuralnet
open RProvider.caret

// Load data from R to Deedle frame
let iris : Frame<int, string> = R.iris.GetValue()

// Observe iris data set
let features =
iris
|> Frame.filterCols (fun c _ -> c <> "Species")
|> Frame.mapColValues (fun c -> c.As<double>())
let targets =
R.as_factor(iris.Columns.["Species"])

R.featurePlot(x = features, y = targets, plot = "pairs")

iris.ReplaceColumn("Species", targets.AsNumeric())
// Split data to training and testing sets (70% vs 30%)
let range = [1..iris.RowCount]
let trainingIdxs : int[] = R.sample(range, iris.RowCount*7/10).GetValue()
let testingIdxs : int[] = R.setdiff(range, trainingIdxs).GetValue()
let trainingSet = iris.Rows.[trainingIdxs]
let testingSet = iris.Rows.[testingIdxs]

// Train neural network
let nn =
R.neuralnet(
"Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width",
data = trainingSet, hidden = R.c(3,2),
err_fct = "ce", linear_output = true)

// Plot the resulting neural network with coefficients
R.eval(R.parse(text="library(grid)"))
R.plot_nn nn

// Split testing set into features and targets
let testingFeatures =
testingSet
|> Frame.filterCols (fun c _ -> c <> "Species")
|> Frame.mapColValues (fun c -> c.As<double>())
let testingTargets =
testingSet.Columns.["Species"].As<int>().Values

// Predict `Species` for testingFeatures with neural network
let prediction =
R.compute(nn, testingFeatures)
.AsList().["net.result"].AsVector()
|> Seq.cast<double>
|> Seq.map (round >> int))

// Calculate number of misclassified irises
let misclassified =
Seq.zip prediction testingTargets
|> Seq.filter (fun (a,b) -> a<>b)
|> Seq.length

printfn "Misclassified irises '%d' of '%d'" misclassified (testingSet.RowCount)

P.S.

Notice, if you have problems with bootstrapping RProvider and/or converting R data frame to Deedle data frames – you need to verify that during installation of NuGet packages, all assemblies have been copied to RProvider’s lib sub-folder (see in the following picture).

deedle_rprovider

F# Interactive “branding”

FSI console has a pretty small font size by default. It is really uncomfortable to share screen with projector.  Source code in FSI is always small and hard to read. Never thought (until today) that I can configure font, color, font size and etc. In fact, it is very easy to do:

  1. Click Tools -> Options.
  2. Select Environment -> Fonts and Colors.
  3. In the ‘Show setting for‘ drop-down select ‘F# Interactive‘.FSIbranding
  4. Here it is – you can change whatever you want.
  5. That’s wonderful!

Stanford CoreNLP is available on NuGet for F#/C# devs

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

nlp-logo-navbar

Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. Stanford CoreNLP is an integrated framework, which make it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools on it with just two lines of code. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Stanford CoreNLP integrates all Stanford NLP tools, including the part-of-speech (POS) taggerthe named entity recognizer (NER)the parser, and the coreference resolution system, and provides model files for analysis of English. The goal of this project is to enable people to quickly and painlessly get complete linguistic annotations of natural language texts. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled.

Stanford CoreNLP is here and available on NuGet. It is probably the most powerful package from whole The Stanford NLP Group software packages. Please, read usage overview on Stanford CoreNLP home page to understand what it can do, how you can configure an annotation pipeline, what steps are available for you, what models you need to have and so on.

I want to say thank you to Anonymous 😉 and @OneFrameLink for their contribution and stimulating me to finish this work.

Please follow next steps to get started:

Before using Stanford CoreNLP, we need to define and specify annotation pipeline. For example, annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref.

The next thing we need to do is to create StanfordCoreNLP pipeline. But to instantiate a pipeline, we need to specify all required properties or at least paths to all models used by pipeline that are specified in annotators string. Before starting samples, let’s define some helper function that will be used across all source code pieces: jarRoot is a path to folder where we extracted files from stanford-corenlp-3.2.0-models.jar; modelsRoot is a path to folder with all models files; ‘!’ is overloaded operator that converts model name to relative path to the model file.

let (@@) a b = System.IO.Path.Combine(a,b)
let jarRoot = __SOURCE_DIRECTORY__ @@ @"..\..\temp\stanford-corenlp-full-2013-06-20\stanford-corenlp-3.2.0-models\"
let modelsRoot = jarRoot @@ @"edu\stanford\nlp\models\"
let (!) path = modelsRoot @@ path

Now we are ready to instantiate the pipeline, but we need to do a small trick. Pipeline is configured to use default model files (for simplicity) and all paths are specified relatively to the root of stanford-corenlp-3.2.0-models.jar. To make things easier, we can temporary change current directory to the jarRoot, instantiate a pipeline and then change current directory back. This trick helps us dramatically decrease the number of code lines.

let props = Properties()
props.setProperty("annotators","tokenize, ssplit, pos, lemma, ner, parse, dcoref") |> ignore
props.setProperty("sutime.binders","0") |> ignore

let curDir = System.Environment.CurrentDirectory
System.IO.Directory.SetCurrentDirectory(jarRoot)
let pipeline = StanfordCoreNLP(props)
System.IO.Directory.SetCurrentDirectory(curDir)

However,  you do not have to do it. You can configure all models manually. The number of properties (especially paths to models) that you need to specify depends on the annotators value. Let’s assume for a moment that we are in Java world and we want to configure our pipeline in a custom way. Especially for this case, stanford-corenlp-3.2.0-models.jar contains StanfordCoreNLP.properties (you can find it in the folder with extracted files), where you can specify new property values out of code. Most of properties that we need to use for configuration are already mentioned in this file and you can easily understand what it what. But it is not enough to get it work, also you need to look into source code of Stanford CoreNLP. By the way, some days ago Stanford was moved CoreNLP source code into GitHub – now it is much easier to browse it.  Default paths to the models are specified in DefaultPaths.java file, property keys are listed in Constants.java file and information about which path match to which property name is contained in Dictionaries.java. Thus, you are able to dive deeper into pipeline configuration and do whatever you want. For lazy people I already have a working sample.

let props = Properties()
let (<==) key value = props.setProperty(key, value) |> ignore
"annotators"    <== "tokenize, ssplit, pos, lemma, ner, parse, dcoref"
"pos.model"     <== ! @"pos-tagger\english-bidirectional\english-bidirectional-distsim.tagger"
"ner.model"     <== ! @"ner\english.all.3class.distsim.crf.ser.gz"
"parse.model"   <== ! @"lexparser\englishPCFG.ser.gz"

"dcoref.demonym"            <== ! @"dcoref\demonyms.txt"
"dcoref.states"             <== ! @"dcoref\state-abbreviations.txt"
"dcoref.animate"            <== ! @"dcoref\animate.unigrams.txt"
"dcoref.inanimate"          <== ! @"dcoref\inanimate.unigrams.txt"
"dcoref.male"               <== ! @"dcoref\male.unigrams.txt"
"dcoref.neutral"            <== ! @"dcoref\neutral.unigrams.txt"
"dcoref.female"             <== ! @"dcoref\female.unigrams.txt"
"dcoref.plural"             <== ! @"dcoref\plural.unigrams.txt"
"dcoref.singular"           <== ! @"dcoref\singular.unigrams.txt"
"dcoref.countries"          <== ! @"dcoref\countries"
"dcoref.extra.gender"       <== ! @"dcoref\namegender.combine.txt"
"dcoref.states.provinces"   <== ! @"dcoref\statesandprovinces"
"dcoref.singleton.predictor"<== ! @"dcoref\singleton.predictor.ser"

let sutimeRules =
    [| ! @"sutime\defs.sutime.txt";
       ! @"sutime\english.holidays.sutime.txt";
       ! @"sutime\english.sutime.txt" |]
    |> String.concat ","
"sutime.rules"      <== sutimeRules
"sutime.binders"    <== "0"

let pipeline = StanfordCoreNLP(props)

As you see, this option is much longer and harder to do. I recommend to use the first one, especially if you do not need to change the default configuration.

And now the fun part. Everything else is pretty easy: we create an annotation from your text, path it through the pipeline and interpret the results.

let text = "Kosgi Santosh sent an email to Stanford University. He didn't get a reply.";

let annotation = Annotation(text)
pipeline.annotate(annotation)
use stream = new ByteArrayOutputStream()
pipeline.prettyPrint(annotation, new PrintWriter(stream))
printfn "%O" (stream.toString())

Certainly, you can extract all processing results from annotated test.

let customAnnotationPrint (annotation:Annotation) =
    printfn "-------------"
    printfn "Custom print:"
    printfn "-------------"
    let sentences = annotation.get(CoreAnnotations.SentencesAnnotation().getClass()) :?> java.util.ArrayList
    for sentence in sentences |> Seq.cast<CoreMap> do
        printfn "\n\nSentence : '%O'" sentence

    let tokens = sentence.get(CoreAnnotations.TokensAnnotation().getClass()) :?> java.util.ArrayList
    for token in (tokens |> Seq.cast<CoreLabel>) do
       let word = token.get(CoreAnnotations.TextAnnotation().getClass())
       let pos  = token.get(CoreAnnotations.PartOfSpeechAnnotation().getClass())
       let ner  = token.get(CoreAnnotations.NamedEntityTagAnnotation().getClass())
       printfn "%O \t[pos=%O; ner=%O]" word pos ner

    printfn "\nTree:"
    let tree = sentence.get(TreeCoreAnnotations.TreeAnnotation().getClass()) :?> Tree
    use stream = new ByteArrayOutputStream()
    tree.pennPrint(new PrintWriter(stream))
    printfn "The first sentence parsed is:\n %O" (stream.toString())

    printfn "\nDependencies:"
    let deps = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation().getClass()) :?> SemanticGraph
    for edge in deps.edgeListSorted().toArray() |> Seq.cast<SemanticGraphEdge> do
        let gov = edge.getGovernor()
        let dep = edge.getDependent()
        printfn "%O(%s-%d,%s-%d)"
            (edge.getRelation())
            (gov.word()) (gov.index())
            (dep.word()) (dep.index())

The full code sample is available on GutHub, if you run it, you will see the following result:

Sentence #1 (9 tokens):
Kosgi Santosh sent an email to Stanford University.
[Text=Kosgi CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NNP Lemma=Kosgi NamedEntityTag=PERSON] [Text=Santosh CharacterOffsetBegin=6 CharacterOffsetEnd=13 PartOfSpeech=NNP Lemma=Santosh NamedEntityTag=PERSON] [Text=sent CharacterOffsetBegin=14 CharacterOffsetEnd=18 PartOfSpeech=VBD Lemma=send NamedEntityTag=O] [Text=an CharacterOffsetBegin=19 CharacterOffsetEnd=21 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=email CharacterOffsetBegin=22 CharacterOffsetEnd=27 PartOfSpeech=NN Lemma=email NamedEntityTag=O] [Text=to CharacterOffsetBegin=28 CharacterOffsetEnd=30 PartOfSpeech=TO Lemma=to NamedEntityTag=O] [Text=Stanford CharacterOffsetBegin=31 CharacterOffsetEnd=39 PartOfSpeech=NNP Lemma=Stanford NamedEntityTag=ORGANIZATION] [Text=University CharacterOffsetBegin=40 CharacterOffsetEnd=50 PartOfSpeech=NNP Lemma=University NamedEntityTag=ORGANIZATION] [Text=. CharacterOffsetBegin=50 CharacterOffsetEnd=51 PartOfSpeech=. Lemma=. NamedEntityTag=O]
(ROOT
(S
(NP (NNP Kosgi) (NNP Santosh))
(VP (VBD sent)
(NP (DT an) (NN email))
(PP (TO to)
(NP (NNP Stanford) (NNP University))))
(. .)))

nn(Santosh-2, Kosgi-1)
nsubj(sent-3, Santosh-2)
root(ROOT-0, sent-3)
det(email-5, an-4)
dobj(sent-3, email-5)
nn(University-8, Stanford-7)
prep_to(sent-3, University-8)

Sentence #2 (7 tokens):
He didn’t get a reply.
[Text=He CharacterOffsetBegin=52 CharacterOffsetEnd=54 PartOfSpeech=PRP Lemma=he NamedEntityTag=O] [Text=did CharacterOffsetBegin=55 CharacterOffsetEnd=58 PartOfSpeech=VBD Lemma=do NamedEntityTag=O] [Text=n’t CharacterOffsetBegin=58 CharacterOffsetEnd=61 PartOfSpeech=RB Lemma=not NamedEntityTag=O] [Text=get CharacterOffsetBegin=62 CharacterOffsetEnd=65 PartOfSpeech=VB Lemma=get NamedEntityTag=O] [Text=a CharacterOffsetBegin=66 CharacterOffsetEnd=67 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=reply CharacterOffsetBegin=68 CharacterOffsetEnd=73 PartOfSpeech=NN Lemma=reply NamedEntityTag=O] [Text=. CharacterOffsetBegin=73 CharacterOffsetEnd=74 PartOfSpeech=. Lemma=. NamedEntityTag=O]
(ROOT
(S
(NP (PRP He))
(VP (VBD did) (RB n’t)
(VP (VB get)
(NP (DT a) (NN reply))))
(. .)))

nsubj(get-4, He-1)
aux(get-4, did-2)
neg(get-4, n’t-3)
root(ROOT-0, get-4)
det(reply-6, a-5)
dobj(get-4, reply-6)

Coreference set:
(2,1,[1,2)) -> (1,2,[1,3)), that is: “He” -> “Kosgi Santosh”

C# Sample

C# samples are also available on GitHub.

Stanford Temporal Tagger(SUTime)

nlp-logo-navbar

SUTime is a library for recognizing and normalizing time expressions. SUTime is available as part of the Stanford CoreNLP pipeline and can be used to annotate documents with temporal information. It is a deterministic rule-based system designed for extensibility.

There is one more useful thing that we can do with CoreNLP – time extraction. The way that we use CoreNLP is pretty similar to the previous sample. Firstly, we create an annotation pipeline and add there all required annotators. (Notice that this sample also use the operator defined at the beginning of the post)

let pipeline = AnnotationPipeline()
pipeline.addAnnotator(PTBTokenizerAnnotator(false))
pipeline.addAnnotator(WordsToSentencesAnnotator(false))

let tagger = MaxentTagger(! @"pos-tagger\english-bidirectional\english-bidirectional-distsim.tagger")
pipeline.addAnnotator(POSTaggerAnnotator(tagger))

let sutimeRules =
    [| ! @"sutime\defs.sutime.txt";
       ! @"sutime\english.holidays.sutime.txt";
       ! @"sutime\english.sutime.txt" |]
    |> String.concat ","
let props = Properties()
props.setProperty("sutime.rules", sutimeRules ) |> ignore
props.setProperty("sutime.binders", "0") |> ignore
pipeline.addAnnotator(TimeAnnotator("sutime", props))

Now we are ready to annotate something. This part is also equal to the same one from the previous sample.

let text = "Three interesting dates are 18 Feb 1997, the 20th of july and 4 days from today."
let annotation = Annotation(text)
annotation.set(CoreAnnotations.DocDateAnnotation().getClass(), "2013-07-14") |> ignore
pipeline.annotate(annotation)

And finally, we need to interpret annotating results.

printfn "%O\n" (annotation.get(CoreAnnotations.TextAnnotation().getClass()))
let timexAnnsAll = annotation.get(TimeAnnotations.TimexAnnotations().getClass()) :?> java.util.ArrayList
for cm in timexAnnsAll |> Seq.cast<CoreMap> do
    let tokens = cm.get(CoreAnnotations.TokensAnnotation().getClass()) :?> java.util.List
    let first = tokens.get(0)
    let last = tokens.get(tokens.size() - 1)
    let time = cm.get(TimeExpression.Annotation().getClass()) :?> TimeExpression
    printfn "%A [from char offset '%A' to '%A'] --> %A"
        cm first last (time.getTemporal())

The full code sample is available on GutHub, if you run it you will see the following result:

18 Feb 1997 [from char offset ’18’ to ‘1997’] –> 1997-2-18
the 20th of july [from char offset ‘the’ to ‘July’] –> XXXX-7-20
4 days from today [from char offset ‘4’ to ‘today’] –> THIS P1D OFFSET P4D

C# Sample

C# samples are also available on GitHub.

Conclusion

There is a pretty awesome library. I hope you enjoy it. Try it out right now!

There are some other more specific Stanford packages that are already available on NuGet: