Introducing Clippit, get your slides out of PPTX.

15/03/202007/04/2021Blog2 Comments

TL;TR

Clippit is .NETStandard 2.0 library that allows you to easily and efficiently extract all slides from PPTX presentation into one-slide presentations or compose slides back together into one presentation.

Why?

PowerPoint is still the most popular way to present information. Sales and marketing people regularly produce new presentations. But when they work on new presentation they often reuse slides from previous ones. Here is the gap: when you compose presentation you need slides that can be reused, but result of your work is the presentation and you are not generally interested in “slide management”.

One of my projects is enterprise search solution, that help people find Office documents across different enterprise storage systems. One of cool features is that we let our users find particular relevant slide from presentation rather than huge pptx with something relevant on slide 57.

How it was done before

Back in the day, Microsoft PowerPoint allowed us to “Publish slides” (save each individual slide into separate file). I am absolutely sure that this button was in PowerPoint 2013 and as far as I know was removed from PowerPoint 365 and 2019 versions.

When this feature was in the box, you could use Microsoft.Office.Interop.PowerPoint.dll to start instance of PowerPoint and communicate with it using COM Interop.

	public void PublishSlides(string sourceFileName, string resultDirectory)
	{
	Application ppApp = null;
	try
	{
	ppApp = new Application
	{
	DisplayAlerts = PpAlertLevel.ppAlertsNone
	};

	const MsoTriState msoFalse = MsoTriState.msoFalse;
	var presentation = ppApp.Presentations.Open(sourceFileName,
	msoFalse, msoFalse, msoFalse);

	presentation.PublishSlides(resultDirectory, true, true);
	presentation.Close();
	}
	finally
	{
	ppApp?.Quit();
	}
	}

view raw PP-Interop.cs hosted with ❤ by GitHub

But server-side Automation of Office has never been recommended by Microsoft. You were never able to reliably scale your automation, like start multiple instances to work with different document (because you need to think about active window and any on them may stop responding to your command). There is no guarantee that File->Open command will return control to you program, because for example if file is password-protected Office will show popup and ask for password and so on.

That was hard, but doable. PowerPoint guarantees that published slides will be valid PowerPoint documents that user will be able to open and preview. So it is worth playing the ceremony of single-thread automation with retries, timeouts and process kills (when you do not receive control back).

Over the time it became clear that it is the dead end. We need to keep the old version of PowerPoint on some VM and never touch it or find a better way to do it.

The History

Windows only solution that requires MS Office installed on the machine and COM interop is not something that you expect from modern .NET solution.

Ideally it should be .NETStandard library on NuGet that platform-agnostic and able to solve you task anywhere and as fast as possible. But there was nothing on Nuget few months ago.

If you ever work with Office documents from C# you know that there is an OpenXml library that opens office document, deserialize it internals to an object model, let you modify it and then save it back. But OpenXml API is low-level and you need to know a lot about OpenXml internals to be able to extract slides with images, embedding, layouts, masters and cross-references into new presentation correctly.

If you google more you will find that there is a project “Open-Xml-PowerTools” developed by Microsoft since 2015 that have never been officially released on NuGet. Currently this project is archived by OfficeDev team and most actively maintained fork belongs to EricWhiteDev (No NuGet.org feed at this time).

Open-Xml-PowerTools has a feature called PresentationBuilder that was created for similar purpose – compose slide ranges from multiple presentations into one presentation. After playing with this library, I realized that it does a great job but does not fully satisfy my requirements:

Resource usage are not efficient, same streams are opened multiple times and not always properly disposed.
Library is much slower than it could be with proper resource management and less GC pressure.
It generates slides with total size much larger than original presentation, because it copies all layouts when only one is needed.
It does not properly name image parts inside slide, corrupt file extensions and does not support SVG.

It was a great starting point, but I realized that I can improve it. Not only fix all mentioned issues and improve performance 6x times but also add support for new image types, properly extract slide titles and convert then into presentation titles, propagate modification date and erase metadata that does not belong to slide.

How it is done today

So today, I am ready to present the new library Clippit that is technically a fork of most recent version of EricWhiteDev/Open-Xml-PowerTools that is extended and improved for one particular use case: extracting slides from presentation and composing them back efficiently.

All classes were moved to Clippit namespace, so you can load it side-by-side with any version of Open-Xml-PowerTools if you already use it.

The library is already available on NuGet, it is written using C# 8.0 (nullable reference types), compiled for .NET Standard 2.0, tested with .NET Core 3.1. It works perfectly on macOS/Linux and battle-tested on hundreds of real world PowerPoint presentations.

New API is quite simple and easy to use

	var presentation = new PmlDocument(sourceFile);

	// Pubslish slides
	var slides = PresentationBuilder.PublishSlides(presentation).ToList();

	// Save slides into files
	foreach (var slide in slides)
	{
	var targetPath = Path.Combine(targetDir, Path.GetFileName(slide.FileName))
	slide.SaveAs(targetPath);
	}

	// Compose slides back into one presentation
	var sources = slides.Select(x => new SlideSource(x, keepMaster:true)).ToList();
	PresentationBuilder.BuildPresentation(sources)
	.SaveAs(newFileName);

view raw Clippit-101.cs hosted with ❤ by GitHub

P.S. Stay tuned, there will be more OpenXml goodness.

HashiCorp Vault and TLS Certificate Authentication for .NET Applications (Comprehensive guide)

08/10/201907/04/2021Blog4 Comments

HashiCorp Vault is a tool for secrets management, encryption as a service, and privileged access management. It is quite popular nowadays, especially if you own your own infrastructure, private cloud or just cannot store your secrets using Key Vault services provided by Azure/AWS/GCP.

I assume that you already have one up and running instance of HashiCorp Vault, otherwise you may install one using official Installing Vault guide.

Why TLS certificate authentication?

Vault supports many Auth Methods. But what if you are still deploying your app on plain old Windows Server VMs or develop SharePoint application (like I am 😝).

The challenge in this case, that you have to authenticate in Vault in order to get a secret. This means that we need to choose auth method that protects our auth secrets from an accident IT guys who may login on the VM (or malicious code that may find it on file system)

TLS Certificate Auth is a good solution candidate, because we can install certificate into windows certificate store, protect private key (mark it as not-exportable) and even specify list of service accounts, allowed to use this certificate for authentication.

TLS certificate generation

I will be using ssh command on my macOS for certificate generation and Vault configuration, but you can repeat the same step from Window for sure.

For our needs we will use self-signed certificate. You can generate one using OpenSSL. If you do not have OpenSSL installed, you can install from Homebrew.

brew install openssl

First of all we generate private key (it is highly secured, do not share it)

openssl genrsa 2048 > vault_private.pem

Then we generate public part of the key in .pem format (.pem file will be uploaded to Vault for client validation during authentication)

openssl req -x509 -new -key vault_private.pem -out vault_public.pem -days 365

Answer all questions properly, it will help you identify this certificate in future (I’ve created certificate that is valid for 365 days, but you should follow security standards defined in you company).

vault-cfg_sh_—_private-key.png

Note: Common Name cannot be empty, otherwise you will not be able to use this certificate to retrieve the secret (Vault returns ‘missing name in alias’ error). Thank you Vadzim Makarchyk for this note.

The final step is to archive both parts in .pfx format (.pfx file will be deployed into Windows Server certificate store on all machines from where our code should have access to Vault)

openssl pkcs12 -export -in vault_public.pem -inkey vault_private.pem -out vault.pfx

vault-cfg_sh_—_vault-cfg.png

Remember the password entered during *.pfx creation, you’re gonna need it every time you decide to install it on Windows machine.

Vault configuration

In order to configure HashiCorp Vault we will use Vault CLI interface, that can be installed from Homebrew on macOS.

brew install vault

Vault CLI uses environment variables for configuration. My Vault server is hosted on different machine so I need to provide server Url.

VAULT_ADDR=https://my.server.com:8200

export VAULT_ADDR

I uses Enterprise version of Vault that is used by several teams, that it why I also specify namespace (aka folder for my secrets)

VAULT_NAMESPACE=dev/my-team

export VAULT_NAMESPACE

I am lazy to properly setup certificates for Vault CLI, that is why I skip certificate validation (never repeat it in production 😉)

VAULT_SKIP_VERIFY=true

export VAULT_SKIP_VERIFY

We are almost ready to login. The easiest option is to login using Web UI and then reuse issued token in the terminal. Login using your favorite browser, pass authentication and copy token in buffer.

EPAM_Laptop

vault login s.fJTY5S51oIfXKnBAG3Qq5eWp.9GKyY

That is it! Token is saved into ~/.vault-token and CLI is ready to use!

Key/Value secret engine creation

Vault supports multiple Secret Engines, but for our demo we create simple Key/Value storage for secrets (for example to store logins and passwords)

vault secrets enable -path=kv kv

This command enable key/value engine (V1) and name kv (-path param)

NOTE: The kv secrets engine has two versions: kv and kv-v2. To enable versioned kv secrets engine, pass kv-v2 instead.

Engine is ready, but it is empty – let’s fix it.

vault write kv/my-secret value="s3c(eT"

This command effectively creates my-secret secret inside kv secret engine and store one key/value pair inside value=”s3c(eT”

ACL Policy creation

Secret engine is secured, nobody (except you, admin) has access to secrets. We need to create rules/policy that define what access we want to provide. Create new files policy-file.hcl and put following content inside.

path "kv/*" {
  capabilities = ["read", "list"]
}

This policy allows to read and list all secrets inside kv secret engine. All users with this policy will be able to read secrets from our engine. Read more about policies.

Write this policy to the server (and name it policy-name)

vault policy write policy-name policy-file.hcl

TLS Certificates – Auth Method

The last step is to assign this policy. But we want to assign it to all clients authenticated in Vault using TLS certificate created by us earlier.

Fist of all we need to enable certificate authentication in our namespace

vault auth enable cert

and create certificate auth in Vault (name it app), assign policy-name to it and upload the public part of generated key (vault_public.pem)

vault write auth/cert/certs/app policies=policy-name certificate=@vault_public.pem

That is it! Vault is configured and waiting for first connection.

TLS certificate deployment

TLS certificate allows us to deploy it to certain set of machines that should have access to the Vault and then specify which accounts (on these machines) may use it for authentication.

If you are lucky enough and your deployment is automated you can add one more build step in your deployment process that ensures that certificate is provisioned on all target machines. Octopus Deploy is one of such tools that provides built-in template for certificate provisioning. (BTW, it is free for small teams starting from Sept 2, 2019)

On the screenshot you see the step that imports certificate on all target machines with tag SharePoint (in my case) to LocalMachine certificate store to My/Personal store, mark private-key as not exportable and provide access to private key to 2 service accounts.

If your deployment is not automated, you may script the same steps using PowerShell and run it on all machines.

	#Import certificate to local machine personal folder
	$root = Set-Location -PassThru $PSScriptRoot

	$cert = Get-ChildItem -Path $root \| where {$_.Extension -like "*.pfx"}

	$PlainTextPass = Read-Host -Prompt "Type .pfx password for '$cert' certificate"
	$pfxpass = $PlainTextPass \| ConvertTo-SecureString -AsPlainText -Force

	$cert = $cert \| Import-PfxCertificate -CertStoreLocation Cert:\LocalMachine\My -Exportable -Password $pfxpass
	Write-Host "Certificate is imported"


	#Grant permission to selected account on private key and MachineKeys folder
	$fileName = $cert.PrivateKey.CspKeyContainerInfo.UniqueKeyContainerName
	$path = "$env:ALLUSERSPROFILE\Microsoft\Crypto\RSA\MachineKeys\$fileName"

	function SetPermissions([string[]]$accountNames)
	{
	$acl = Get-Acl -Path $path

	# Add the new user and preserve all current permissions: SetAccessRuleProtection(False, X)
	# Add the new user and remove all inherited permissions: SetAccessRuleProtection(True, False)
	# Add the new user and convert all inherited permissions to explicit permissions: SetAccessRuleProtection(True, True)
	$acl.SetAccessRuleProtection($True, $False)

	foreach ($accountName in $accountNames) {
	$rule = New-Object System.Security.AccessControl.FileSystemAccessRule($accountName,"Full","Allow")
	$acl.AddAccessRule($rule)
	}
	Set-Acl -Path $path -AclObject $acl

	Write-Host "Access to certificate is granted for $accountNames"
	}

	SetPermissions(@(
	"me@sergeytihon.com",
	"you@sergeytihon.com"
	))

view raw

importCerfiticate.ps1

hosted with ❤ by GitHub

If you are brave, you can click it even manually! 🙈

Double click on vault.pfx file and choose LocalMachine store location
Click Next, Next and type password used during *.pfx creation and Next again.
Choose Personal certificate store.
Click Next, Finish, OK – your certificated in the store!
Execute mmc (Microsoft Managed Console) from start menu.
File -> Add/Remove Snap-in …
Certificate, Add, Computer account and click Next & Ok
Find our certificate and click Manage Private Keys…
On this screen you can manage the list of accounts that will be able to use this certificate for authentication on the current machine.

.NET client application

Vault is ready, machine is ready (service account / current user is allowed to use certificate from the LocalMachine/Personal store). Few lines of code are separating us from success 😊.

I will use VaultSharp NuGet Package. It is more or less up to date, it supports namespaces feature and starting from next release usage of namespaces will become even more intuitive.

	using System;
	using System.Collections.Generic;
	using System.Data;
	using System.Linq;
	using System.Threading.Tasks;
	using System.Security.Cryptography.X509Certificates;
	using VaultSharp;
	using VaultSharp.V1.AuthMethods.Cert;

	namespace SergeyTihon.App.Configuration
	{
	public class VaultSecretProvider
	{
	public VaultSecretProvider(string vaultUrl, string vaultNamespace, string certificateThumbprint)
	{
	var clientCertificate = GetCertificate(certificateThumbprint);
	var authMethod = new CertAuthMethodInfo(clientCertificate);

	_vaultClient = new VaultClient(new VaultClientSettings(vaultUrl, authMethod)
	{
	BeforeApiRequestAction = (httpClient, httpRequestMessage) =>
	{
	httpRequestMessage.Headers.Add("X-Vault-Namespace", vaultNamespace);
	}
	});
	}

	public static X509Certificate2 GetCertificate(string certThumbprint)
	{
	var store = new X509Store(StoreName.My, StoreLocation.LocalMachine);
	store.Open(OpenFlags.ReadOnly);
	var certCollection = store.Certificates;

	// Find unexpired certificates.
	var currentCerts = certCollection.Find(X509FindType.FindByTimeValid, DateTime.Now, false);

	// From the collection of unexpired certificates, find the ones with the correct thumbprint.
	var signingCert = currentCerts.Find(X509FindType.FindByThumbprint, certThumbprint, false);

	// Return the first certificate in the collection, has the right name and is current.
	var cert = signingCert.OfType<X509Certificate2>().OrderByDescending(c => c.NotBefore).FirstOrDefault();
	store.Close();

	if (cert is null)
	{
	throw new DataException($"Cannot find valid certificate with thumbprint {certThumbprint}");
	}
	return cert;
	}

	private readonly VaultClient _vaultClient;

	public async Task<Dictionary<string, object>> GetValue(string path, string mountPoint)
	{
	var secret = await _vaultClient.V1.Secrets.KeyValue.V1.ReadSecretAsync(path, mountPoint);
	return secret.Data;
	}
	}
	}

view raw

VaultSecretProvider.cs

hosted with ❤ by GitHub

VaultSecretProvider find X509 certificate in StoreName.My / StoreLocation.LocalMachine, then create CertAuthMethodInfo using certificate and VaultClient that X-Vault-Namespace header to each request with vaultNamespace name.

Using configured instance of VaultClient we can request our secret from Vault _vaultClient.V1.Secrets.KeyValue.V1.ReadSecretAsync(path, mountPoint) specifying path to the secret and mountPoint (name of secret engine).

We are ready to call and receive secrets

	new VaultSecretProvider(
	"https://my.server.com:8200", // VAULT_ADDR
	"dev/my-team", // VAULT_NAMESPACE
	"877501d5a018e9344088fd5c89580f6b095f5326" // vault.pfx certificate thumbprint
	).GetValue("my-secret", "kv") // path to the secret – vault write kv/my-secret value="s3c(eT"

view raw

VaultSecretProviderTests.cs

hosted with ❤ by GitHub

Conclusion

Wow, this became a long read, but I hope it was a good one.

TLS certificate authenctication in Vault is a good option for apps that uses Full .NET Framework and runs inside Windows Server VMs.

Just do not forget renew/replace certificates regularly.

Be better WPF / MvvmLight developer in 2018

16/04/201807/04/2021Blog7 Comments

It is 2018, the time of .NET Core, x-plat, clouds, microservices, blockchain and shine of JavaScript. But, there are guys, like me, who still maintain and sometimes develop classic .NET desktop applications for Windows.

I am not a WPF expert, but I spent a couple of days reviewing, testing and fixing one of our desktop apps and I definitely learned a couple of new tips & tricks that worth to share with other non-experts.

Part #1: General Tips

Tip #1.1: Choose right library/framework

It happened that we use MvvmLight. The library is lightweight and already exists for almost 10 years, MVVM pattern is well-known and lets us keep solution code reasonably well-structured.

But this is definitely not the only choice, there are many other different-purpose libraries and frameworks that may suit you better, especially if you do green-field development. So choose carefully:

Tip #1.2: Distribute using Squirrel.Windows

Installers always were hard, the seamless auto-update process is even harder. But, today, we have the solution that works for simple user-oriented applications that don’t do crazy things during installation. This solution is Squirrel.Windows – an installation and update framework for Windows desktop apps, designed for C# apps.

It’s definitely worth to learn it once and use it for all apps that you develop.

Tip #1.3: Think about monitoring

Aggregated analytics from user’s machine is priceless for successful apps. There are plenty amount of data that can help you deliver better apps:

Crash reports
Application version distribution
User’s count / Active user
Performance / Integrations tracking
Custom events / Logs

It is not always possible to collect all kinds of data from user’s machine, but do it if you can. There are a couple of services that may help you, like Application Insights, HockeyApp and others.

Tip #1.4: Use the full power of IDE

Learn tools that MS baked for you and use them

Tip #1.5: Debug Data Binding Issues

When data bindings do not play nice you have a possibility to debug. It is not super intuitive, but there are ways to step into the binding process to better figure out what is actually going on. Check this nice article from Mike Woelmer – How To Debug Data Binding Issues in WPF

Part #2: MVVM Light – Code Tips

C# quickly evolves over time, more and more features become available to us. It is not always obvious how to use new async code with an old API.

Tip #2.1: “New” INotifyPropertyChanged syntax

I think almost any WPF developer knows how to implement INotifyPropertyChanged interface

public class MyViewModel : INotifyPropertyChanged
{
    private string _isBusy;
    public event PropertyChangedEventHandler PropertyChanged;

    public MyViewModel() {}

    public string IsBusy
    {
        get { return _isBusy; }
        set
        {
            _isBusy = value;
            OnPropertyChanged("IsBusy");
        }
    }

    protected void OnPropertyChanged(string name)
    {
        if (PropertyChanged == null)
            return;
        PropertyChanged(this,
            new PropertyChangedEventArgs(name))
    }
}

Using MVVM Light you can do way shorter (such syntax probably exists for a while, but I discovered it only recently)

public class MyViewModel : ViewModelBase
{
    public MyViewModel() {}

    private string _isBusy;
    public string IsBusy
    {
        get => _isBusy;
        set { Set(() => IsBusy, ref _isBusy, value) }
    }
}

All property change events will happen under the hood of Set method. Also, Set method returns true when the value changed so you can use it to do additional actions on property change.

private string _isBusy;
public string IsBusy
{
    get => _isBusy;
    set {
        if (Set(() => IsBusy, ref _isBusy, value)) {
        // Do whatever you need on update
        }
    }
}

Update from Chris Jobson:

There are overloads that allow us to omit first argument – propertyExpression. In this case [CallerMemberName] will be used as the property name, so the code will be even shorter. Not bad for 2018 =)

private string _isBusy;
public string IsBusy
{
    get => _isBusy;
    set => Set(ref _isBusy, value);
}

Tip #2.2: Async to Action glue

C# async was designed to be better compatible with old APIs and consume Action or delegate. Also, it is one of the reasons why async void exists in the language, but we should always use async Task in our own code.

Two following casts are valid

Action task = async () => await Task.Yield();
Func task2 = async () => await Task.Yield();

Read “Do async lambdas return Tasks?” to better understand what’s actually going on here. It means that you can pass your async method as Action to RelayCommand.

new RelayCommand(async() => await Download());

TBH, you should use it like this (explanation in the next tip)

new RelayCommand(async() => await Download(), keepTargetAlive:true);

Tip #2.3: Do not use lambdas with RelayCommand

Lambdas as a parameter for RelayCommand is a bad idea unless you know what can go wrong and use the latest version of MvvmLight.

Actually, I have spent almost 2 days of my life to figure out why at some point of time several buttons in our application stopped working, even though all commands defined in the ViewModel are read-only and assigned once in the constructor.

We had simple commands that do some trivial actions on click, so the developer decided to use lambda in command declaration to save space and simplify the code.

new RelayCommand(() => IsBusy = true);

The code looks simple and correct, but RelayCommand under the hood stores only weak reference to the delegate and any GC cycle can recycle local lambda function. So at some point in time (after next cycle of GC) RelayCommand may not find delegate to call and nothing will happen after the click. For a deeper analysis of this behavior, you can read “RelayCommands and WeakFuncs“.

At the time of writing this post, the issues in MvvmLight library were fixed (Using RelayCommand and Messenger (and WeakAction) with closures) and released in version 5.4.1. But fix does not apply by default.

If you really want to use lambdas with RelayCommand & Messenger you should manually set keepTargetAlive:true (false by default), but probably better do not use them at all.

new RelayCommand(() => IsBusy = true, keepTargetAlive:true);

P.S. Worth to mention that Laurent Bugnion has the course on Pluralsight “MVVM Light Toolkit Fundamentals” that provides detailed MVVM Light overview.

ASP.NET MVC with Simple Windows Authorization

25/01/201725/02/2021Tips and Tricks8 Comments

A lot of enterprises use Active Directory (AD) to manage user accounts and Security Groups to manage access to resources.

So (I think) that there is a common task when you want to create some internal resource that will provide certain functionality for your team, but you do not want to expose your data outside. We can easily enable Windows authentication, however usually we also need to add an authorization(limit access to certain groups)

The task is simple, but I do not know why it is so hard to find manual for this. Steps are as follows:

Enable Windows authentication in web.config
Add WindowsTokenRoleProvider that transforms all Security Groups to ASP.NET Roles
Configure Authorization rules based on roles
Disable anonymous authentication for IIS Express.

Changes in Web.config:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  ...
  <system.web>
    ...
    <authentication mode="Windows" />
    <authorization>
      <allow roles="DOMAIN\MyTeam" />
      <deny users="*"/>
    </authorization>
    <roleManager cacheRolesInCookie="false" defaultProvider="WindowsProvider" enabled="true">
      <providers>
        <clear />
        <add name="WindowsProvider" type="System.Web.Security.WindowsTokenRoleProvider" applicationName="/" />
      </providers>
    </roleManager>
  </system.web>
  ...
</configuration>

Changes in project file:

<Project ToolsVersion="12.0" DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
  <PropertyGroup>
    <Configuration Condition=" '$(Configuration)' == '' ">Debug</Configuration>
    <Platform Condition=" '$(Platform)' == '' ">AnyCPU</Platform>
    ...
    <TargetFrameworkVersion>v4.6.1</TargetFrameworkVersion>
    <UseIISExpress>true</UseIISExpress>
    <IISExpressSSLPort />
    <IISExpressAnonymousAuthentication>disabled</IISExpressAnonymousAuthentication>
    <IISExpressWindowsAuthentication>enabled</IISExpressWindowsAuthentication>
    <IISExpressUseClassicPipelineMode />
    <UseGlobalApplicationHostFile />
    ...
  </PropertyGroup>
  ...

P.S. You can use security groups to restrict access to Controllers/Views based on the roles (AuthorizeAttribute)

Why I wish C# never got async/await

01/06/201428/10/2015F#Leave a Comment

Absolutely the same feelings

Stanford CoreNLP is available on NuGet for F#/C# devs

26/10/201325/02/2021F#, Machine Learning and NLP74 Comments

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. Stanford CoreNLP is an integrated framework, which make it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools on it with just two lines of code. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Stanford CoreNLP integrates all Stanford NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, and the coreference resolution system, and provides model files for analysis of English. The goal of this project is to enable people to quickly and painlessly get complete linguistic annotations of natural language texts. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled.

Stanford CoreNLP is here and available on NuGet. It is probably the most powerful package from whole The Stanford NLP Group software packages. Please, read usage overview on Stanford CoreNLP home page to understand what it can do, how you can configure an annotation pipeline, what steps are available for you, what models you need to have and so on.

I want to say thank you to Anonymous 😉 and @OneFrameLink for their contribution and stimulating me to finish this work.

Please follow next steps to get started:

Install-Package Stanford.NLP.CoreNLP
Download models from The Stanford NLP Group site.
Extract models from stanford-corenlp-3.2.0-models.jar and remember new folder location. (Unzip archive)
You are ready to start.

Before using Stanford CoreNLP, we need to define and specify annotation pipeline. For example, annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref.

The next thing we need to do is to create StanfordCoreNLP pipeline. But to instantiate a pipeline, we need to specify all required properties or at least paths to all models used by pipeline that are specified in annotators string. Before starting samples, let’s define some helper function that will be used across all source code pieces: jarRoot is a path to folder where we extracted files from stanford-corenlp-3.2.0-models.jar; modelsRoot is a path to folder with all models files; ‘!’ is overloaded operator that converts model name to relative path to the model file.

let (@@) a b = System.IO.Path.Combine(a,b)
let jarRoot = __SOURCE_DIRECTORY__ @@ @"..\..\temp\stanford-corenlp-full-2013-06-20\stanford-corenlp-3.2.0-models\"
let modelsRoot = jarRoot @@ @"edu\stanford\nlp\models\"
let (!) path = modelsRoot @@ path

Now we are ready to instantiate the pipeline, but we need to do a small trick. Pipeline is configured to use default model files (for simplicity) and all paths are specified relatively to the root of stanford-corenlp-3.2.0-models.jar. To make things easier, we can temporary change current directory to the jarRoot, instantiate a pipeline and then change current directory back. This trick helps us dramatically decrease the number of code lines.

let props = Properties()
props.setProperty("annotators","tokenize, ssplit, pos, lemma, ner, parse, dcoref") |> ignore
props.setProperty("sutime.binders","0") |> ignore

let curDir = System.Environment.CurrentDirectory
System.IO.Directory.SetCurrentDirectory(jarRoot)
let pipeline = StanfordCoreNLP(props)
System.IO.Directory.SetCurrentDirectory(curDir)

However, you do not have to do it. You can configure all models manually. The number of properties (especially paths to models) that you need to specify depends on the annotators value. Let’s assume for a moment that we are in Java world and we want to configure our pipeline in a custom way. Especially for this case, stanford-corenlp-3.2.0-models.jar contains StanfordCoreNLP.properties (you can find it in the folder with extracted files), where you can specify new property values out of code. Most of properties that we need to use for configuration are already mentioned in this file and you can easily understand what it what. But it is not enough to get it work, also you need to look into source code of Stanford CoreNLP. By the way, some days ago Stanford was moved CoreNLP source code into GitHub – now it is much easier to browse it. Default paths to the models are specified in DefaultPaths.java file, property keys are listed in Constants.java file and information about which path match to which property name is contained in Dictionaries.java. Thus, you are able to dive deeper into pipeline configuration and do whatever you want. For lazy people I already have a working sample.

let props = Properties()
let (<==) key value = props.setProperty(key, value) |> ignore
"annotators"    <== "tokenize, ssplit, pos, lemma, ner, parse, dcoref"
"pos.model"     <== ! @"pos-tagger\english-bidirectional\english-bidirectional-distsim.tagger"
"ner.model"     <== ! @"ner\english.all.3class.distsim.crf.ser.gz"
"parse.model"   <== ! @"lexparser\englishPCFG.ser.gz"

"dcoref.demonym"            <== ! @"dcoref\demonyms.txt"
"dcoref.states"             <== ! @"dcoref\state-abbreviations.txt"
"dcoref.animate"            <== ! @"dcoref\animate.unigrams.txt"
"dcoref.inanimate"          <== ! @"dcoref\inanimate.unigrams.txt"
"dcoref.male"               <== ! @"dcoref\male.unigrams.txt"
"dcoref.neutral"            <== ! @"dcoref\neutral.unigrams.txt"
"dcoref.female"             <== ! @"dcoref\female.unigrams.txt"
"dcoref.plural"             <== ! @"dcoref\plural.unigrams.txt"
"dcoref.singular"           <== ! @"dcoref\singular.unigrams.txt"
"dcoref.countries"          <== ! @"dcoref\countries"
"dcoref.extra.gender"       <== ! @"dcoref\namegender.combine.txt"
"dcoref.states.provinces"   <== ! @"dcoref\statesandprovinces"
"dcoref.singleton.predictor"<== ! @"dcoref\singleton.predictor.ser"

let sutimeRules =
    [| ! @"sutime\defs.sutime.txt";
       ! @"sutime\english.holidays.sutime.txt";
       ! @"sutime\english.sutime.txt" |]
    |> String.concat ","
"sutime.rules"      <== sutimeRules
"sutime.binders"    <== "0"

let pipeline = StanfordCoreNLP(props)

As you see, this option is much longer and harder to do. I recommend to use the first one, especially if you do not need to change the default configuration.

And now the fun part. Everything else is pretty easy: we create an annotation from your text, path it through the pipeline and interpret the results.

let text = "Kosgi Santosh sent an email to Stanford University. He didn't get a reply.";

let annotation = Annotation(text)
pipeline.annotate(annotation)
use stream = new ByteArrayOutputStream()
pipeline.prettyPrint(annotation, new PrintWriter(stream))
printfn "%O" (stream.toString())

Certainly, you can extract all processing results from annotated test.

let customAnnotationPrint (annotation:Annotation) =
    printfn "-------------"
    printfn "Custom print:"
    printfn "-------------"
    let sentences = annotation.get(CoreAnnotations.SentencesAnnotation().getClass()) :?> java.util.ArrayList
    for sentence in sentences |> Seq.cast<CoreMap> do
        printfn "\n\nSentence : '%O'" sentence

    let tokens = sentence.get(CoreAnnotations.TokensAnnotation().getClass()) :?> java.util.ArrayList
    for token in (tokens |> Seq.cast<CoreLabel>) do
       let word = token.get(CoreAnnotations.TextAnnotation().getClass())
       let pos  = token.get(CoreAnnotations.PartOfSpeechAnnotation().getClass())
       let ner  = token.get(CoreAnnotations.NamedEntityTagAnnotation().getClass())
       printfn "%O \t[pos=%O; ner=%O]" word pos ner

    printfn "\nTree:"
    let tree = sentence.get(TreeCoreAnnotations.TreeAnnotation().getClass()) :?> Tree
    use stream = new ByteArrayOutputStream()
    tree.pennPrint(new PrintWriter(stream))
    printfn "The first sentence parsed is:\n %O" (stream.toString())

    printfn "\nDependencies:"
    let deps = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation().getClass()) :?> SemanticGraph
    for edge in deps.edgeListSorted().toArray() |> Seq.cast<SemanticGraphEdge> do
        let gov = edge.getGovernor()
        let dep = edge.getDependent()
        printfn "%O(%s-%d,%s-%d)"
            (edge.getRelation())
            (gov.word()) (gov.index())
            (dep.word()) (dep.index())

The full code sample is available on GutHub, if you run it, you will see the following result:

Sentence #1 (9 tokens):
Kosgi Santosh sent an email to Stanford University.
[Text=Kosgi CharacterOffsetBegin=0 CharacterOffsetEnd=5 PartOfSpeech=NNP Lemma=Kosgi NamedEntityTag=PERSON] [Text=Santosh CharacterOffsetBegin=6 CharacterOffsetEnd=13 PartOfSpeech=NNP Lemma=Santosh NamedEntityTag=PERSON] [Text=sent CharacterOffsetBegin=14 CharacterOffsetEnd=18 PartOfSpeech=VBD Lemma=send NamedEntityTag=O] [Text=an CharacterOffsetBegin=19 CharacterOffsetEnd=21 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=email CharacterOffsetBegin=22 CharacterOffsetEnd=27 PartOfSpeech=NN Lemma=email NamedEntityTag=O] [Text=to CharacterOffsetBegin=28 CharacterOffsetEnd=30 PartOfSpeech=TO Lemma=to NamedEntityTag=O] [Text=Stanford CharacterOffsetBegin=31 CharacterOffsetEnd=39 PartOfSpeech=NNP Lemma=Stanford NamedEntityTag=ORGANIZATION] [Text=University CharacterOffsetBegin=40 CharacterOffsetEnd=50 PartOfSpeech=NNP Lemma=University NamedEntityTag=ORGANIZATION] [Text=. CharacterOffsetBegin=50 CharacterOffsetEnd=51 PartOfSpeech=. Lemma=. NamedEntityTag=O]
(ROOT
(S
(NP (NNP Kosgi) (NNP Santosh))
(VP (VBD sent)
(NP (DT an) (NN email))
(PP (TO to)
(NP (NNP Stanford) (NNP University))))
(. .)))

nn(Santosh-2, Kosgi-1)
nsubj(sent-3, Santosh-2)
root(ROOT-0, sent-3)
det(email-5, an-4)
dobj(sent-3, email-5)
nn(University-8, Stanford-7)
prep_to(sent-3, University-8)

Sentence #2 (7 tokens):
He didn’t get a reply.
[Text=He CharacterOffsetBegin=52 CharacterOffsetEnd=54 PartOfSpeech=PRP Lemma=he NamedEntityTag=O] [Text=did CharacterOffsetBegin=55 CharacterOffsetEnd=58 PartOfSpeech=VBD Lemma=do NamedEntityTag=O] [Text=n’t CharacterOffsetBegin=58 CharacterOffsetEnd=61 PartOfSpeech=RB Lemma=not NamedEntityTag=O] [Text=get CharacterOffsetBegin=62 CharacterOffsetEnd=65 PartOfSpeech=VB Lemma=get NamedEntityTag=O] [Text=a CharacterOffsetBegin=66 CharacterOffsetEnd=67 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=reply CharacterOffsetBegin=68 CharacterOffsetEnd=73 PartOfSpeech=NN Lemma=reply NamedEntityTag=O] [Text=. CharacterOffsetBegin=73 CharacterOffsetEnd=74 PartOfSpeech=. Lemma=. NamedEntityTag=O]
(ROOT
(S
(NP (PRP He))
(VP (VBD did) (RB n’t)
(VP (VB get)
(NP (DT a) (NN reply))))
(. .)))

nsubj(get-4, He-1)
aux(get-4, did-2)
neg(get-4, n’t-3)
root(ROOT-0, get-4)
det(reply-6, a-5)
dobj(get-4, reply-6)

Coreference set:
(2,1,[1,2)) -> (1,2,[1,3)), that is: “He” -> “Kosgi Santosh”

C# Sample

C# samples are also available on GitHub.

Stanford Temporal Tagger(SUTime)

SUTime is a library for recognizing and normalizing time expressions. SUTime is available as part of the Stanford CoreNLP pipeline and can be used to annotate documents with temporal information. It is a deterministic rule-based system designed for extensibility.

There is one more useful thing that we can do with CoreNLP – time extraction. The way that we use CoreNLP is pretty similar to the previous sample. Firstly, we create an annotation pipeline and add there all required annotators. (Notice that this sample also use the operator defined at the beginning of the post)

let pipeline = AnnotationPipeline()
pipeline.addAnnotator(PTBTokenizerAnnotator(false))
pipeline.addAnnotator(WordsToSentencesAnnotator(false))

let tagger = MaxentTagger(! @"pos-tagger\english-bidirectional\english-bidirectional-distsim.tagger")
pipeline.addAnnotator(POSTaggerAnnotator(tagger))

let sutimeRules =
    [| ! @"sutime\defs.sutime.txt";
       ! @"sutime\english.holidays.sutime.txt";
       ! @"sutime\english.sutime.txt" |]
    |> String.concat ","
let props = Properties()
props.setProperty("sutime.rules", sutimeRules ) |> ignore
props.setProperty("sutime.binders", "0") |> ignore
pipeline.addAnnotator(TimeAnnotator("sutime", props))

Now we are ready to annotate something. This part is also equal to the same one from the previous sample.

let text = "Three interesting dates are 18 Feb 1997, the 20th of july and 4 days from today."
let annotation = Annotation(text)
annotation.set(CoreAnnotations.DocDateAnnotation().getClass(), "2013-07-14") |> ignore
pipeline.annotate(annotation)

And finally, we need to interpret annotating results.

printfn "%O\n" (annotation.get(CoreAnnotations.TextAnnotation().getClass()))
let timexAnnsAll = annotation.get(TimeAnnotations.TimexAnnotations().getClass()) :?> java.util.ArrayList
for cm in timexAnnsAll |> Seq.cast<CoreMap> do
    let tokens = cm.get(CoreAnnotations.TokensAnnotation().getClass()) :?> java.util.List
    let first = tokens.get(0)
    let last = tokens.get(tokens.size() - 1)
    let time = cm.get(TimeExpression.Annotation().getClass()) :?> TimeExpression
    printfn "%A [from char offset '%A' to '%A'] --> %A"
        cm first last (time.getTemporal())

The full code sample is available on GutHub, if you run it you will see the following result:

18 Feb 1997 [from char offset ’18’ to ‘1997’] –> 1997-2-18
the 20th of july [from char offset ‘the’ to ‘July’] –> XXXX-7-20
4 days from today [from char offset ‘4’ to ‘today’] –> THIS P1D OFFSET P4D

C# Sample

C# samples are also available on GitHub.

Conclusion

There is a pretty awesome library. I hope you enjoy it. Try it out right now!

There are some other more specific Stanford packages that are already available on NuGet:

Stanford Word Segmenter is available on NuGet

09/09/201325/02/2021F#, Machine Learning and NLP2 Comments

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.

The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications.

One more tool from Stanford NLP Software Package become ready on NuGet today. It is a Stanford Word Segmenter. This is a fourth one Stanford NuGet package published by me, previous ones were a “Stanford Parser“, “Stanford Named Entity Recognizer (NER)” and “Stanford Log-linear Part-Of-Speech Tagger“. Please follow next steps to get started:

Install-Package Stanford.NLP.Segmenter
Download models from The Stanford NLP Group site.
Extract models from ’data‘ folder.
You are ready to start.

F# Sample

For more details see source code on GitHub.

open java.util
open edu.stanford.nlp.ie.crf

[<EntryPoint>]
let main argv =
if (argv.Length <> 1) then
printf "usage: StanfordSegmenter.Csharp.Samples.exe filename"
else
let props = Properties();
props.setProperty("sighanCorporaDict", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data") |> ignore
props.setProperty("serDictionary", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\dict-chris6.ser.gz") |> ignore
props.setProperty("testFile", argv.[0]) |> ignore
props.setProperty("inputEncoding", "UTF-8") |> ignore
props.setProperty("sighanPostProcessing", "true") |> ignore

let segmenter = CRFClassifier(props)
segmenter.loadClassifierNoExceptions(@"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\ctb.gz", props)
segmenter.classifyAndWriteAnswers(argv.[0])
0

C# Sample

For more details see source code on GitHub.

using java.util;
using edu.stanford.nlp.ie.crf;

namespace StanfordSegmenter.Csharp.Samples
{
class Program
{
static void Main(string[] args)
{
if (args.Length != 1)
{
System.Console.WriteLine("usage: StanfordSegmenter.Csharp.Samples.exe filename");
return;
}

var props = new Properties();
props.setProperty("sighanCorporaDict", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data");
props.setProperty("serDictionary", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\dict-chris6.ser.gz");
props.setProperty("testFile", args[0]);
props.setProperty("inputEncoding", "UTF-8");
props.setProperty("sighanPostProcessing", "true");

var segmenter = new CRFClassifier(props);
segmenter.loadClassifierNoExceptions(@"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\ctb.gz", props);
segmenter.classifyAndWriteAnswers(args[0]);
}
}
}

Stanford Log-linear Part-Of-Speech Tagger is available on NuGet

14/07/201325/02/2021F#, Machine Learning and NLP35 Comments

Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.

There is one more tool that has become ready on NuGet today. It is a Stanford Log-linear Part-Of-Speech Tagger. This is a third one Stanford NuGet package published by me, previous ones were a “Stanford Parser“ and “Stanford Named Entity Recognizer (NER)“. I have already posted about this tool with guidance on how to recompile it and use from F# (see “NLP: Stanford POS Tagger with F# (.NET)“). Please follow next steps to get started:

Install-Package Stanford.NLP.POSTagger
Download models from The Stanford NLP Group site.
Extract models from ’models‘ folder.
You are ready to start.

F# Sample

For more details see source code on GitHub.

let model = @"..\..\..\..\temp\stanford-postagger-2013-06-20\models\wsj-0-18-bidirectional-nodistsim.tagger"

let tagReader (reader:Reader) =
    let tagger = MaxentTagger(model)
    MaxentTagger.tokenizeText(reader)
    |> Iterable.toSeq
    |> Seq.iter (fun sentence ->
        let tSentence = tagger.tagSentence(sentence :?> List)
        printfn "%O" (Sentence.listToString(tSentence, false))
    )

let tagFile (fileName:string) =
    tagReader (new BufferedReader(new FileReader(fileName)))

let tagText (text:string) =
    tagReader (new StringReader(text))

C# Sample

For more details see source code on GitHub.

public static class TaggerDemo
{
    public const string Model =
        @"..\..\..\..\temp\stanford-postagger-2013-06-20\models\wsj-0-18-bidirectional-nodistsim.tagger";

    private static void TagReader(Reader reader)
    {
        var tagger = new MaxentTagger(Model);
        foreach (List sentence in MaxentTagger.tokenizeText(reader).toArray())
        {
             var tSentence = tagger.tagSentence(sentence);
             System.Console.WriteLine(Sentence.listToString(tSentence, false));
        }
    }

    public static void TagFile (string fileName)
    {
        TagReader(new BufferedReader(new FileReader(fileName)));
    }

    public static void TagText(string text)
    {
        TagReader(new StringReader(text));
    }
}

As a result of both samples you will see the same output. For example, if you start program with these parameters:

1 text "A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads 
text in some language and assigns parts of speech to each word (and other token), 
such as noun, verb, adjective, etc., although generally computational 
applications use more fine-grained POS tags like 'noun-plural'."

Then you will see following on your screen:

A/DT Part-Of-Speech/NNP Tagger/NNP -LRB-/-LRB- POS/NNP Tagger/NNP -RRB-/-RRB- 
is/VBZ a/DT piece/NN of/IN software/NN that/WDT reads/VBZ text/NN in/IN some/DT 
language/NN and/CC assigns/VBZ parts/NNS of/IN speech/NN to/TO each/DT word/NN 
-LRB-/-LRB- and/CC other/JJ token/JJ -RRB-/-RRB- ,/, such/JJ as/IN noun/JJ ,/, 
verb/JJ ,/, adjective/JJ ,/, etc./FW ,/, although/IN generally/RB computational/JJ 
applications/NNS use/VBP more/RBR fine-grained/JJ POS/NNP tags/NNS like/IN `/`` 
noun-plural/JJ '/'' ./.

Stanford Named Entity Recognizer (NER) is available on NuGet

12/07/201325/02/2021F#, Machine Learning and NLP18 Comments

Update (2017, July 24): Links and/or samples in this post might be outdated. The latest version of samples is available on new Stanford.NLP.NET site.

One more tool from Stanford NLP product line became available on NuGet today. It is the second library that was recompiled and published to the NuGet. The first one was the “Stanford Parser“. The second one is Stanford Named Entity Recognizer (NER). I have already posted about this tool with guidance on how to recompile it and use from F# (see “NLP: Stanford Named Entity Recognizer with F# (.NET)“). There are some other interesting things happen, NER is kind of hot topic. I recently saw a question about C# NER on CodeProject, Flo asked me about NER in the comment of another post. So, I am happy to make it wider available. The flow of use is as follows:

Install-Package Stanford.NLP.NER
Download models from The Stanford NLP Group site.
Extract models from ’classifiers‘ folder.
You are ready to start.

F# Sample

F# sample is pretty much the same as in ”NLP: Stanford Named Entity Recognizer with F# (.NET)” post. For more details see source code on GitHub.

let main file =
    let classifier =
        CRFClassifier.getClassifierNoExceptions(
             @"..\..\..\..\temp\stanford-ner-2013-06-20\classifiers\english.all.3class.distsim.crf.ser.gz")
    // For either a file to annotate or for the hardcoded text example,
    // this demo file shows two ways to process the output, for teaching
    // purposes.  For the file, it shows both how to run NER on a String
    // and how to run it on a whole file.  For the hard-coded String,
    // it shows how to run it on a single sentence, and how to do this
    // and produce an inline XML output format.
    match file with
    | Some(fileName) ->
        let fileContents = File.ReadAllText(fileName)
        classifier.classify(fileContents)
        |> Iterable.toSeq
        |> Seq.cast<java.util.List>
        |> Seq.iter (fun sentence ->
            sentence
            |> Iterable.toSeq
            |> Seq.cast<CoreLabel>
            |> Seq.iter (fun word ->
                 printf "%s/%O " (word.word()) (word.get(CoreAnnotations.AnswerAnnotation().getClass()))
            )
            printfn ""
        )
    | None ->
        let s1 = "Good afternoon Rajat Raina, how are you today?"
        let s2 = "I go to school at Stanford University, which is located in California."
        printfn "%s\n" (classifier.classifyToString(s1))
        printfn "%s\n" (classifier.classifyWithInlineXML(s2))
        printfn "%s\n" (classifier.classifyToString(s2, "xml", true));
        classifier.classify(s2)
        |> Iterable.toSeq
        |> Seq.iteri (fun i coreLabel ->
            printfn "%d\n:%O\n" i coreLabel
        )

C# Sample

C# version is quite similar. For more details see source code on GitHub.

class Program
{
    public static CRFClassifier Classifier =
        CRFClassifier.getClassifierNoExceptions(
             @"..\..\..\..\temp\stanford-ner-2013-06-20\classifiers\english.all.3class.distsim.crf.ser.gz");

    // For either a file to annotate or for the hardcoded text example,
    // this demo file shows two ways to process the output, for teaching
    // purposes.  For the file, it shows both how to run NER on a String
    // and how to run it on a whole file.  For the hard-coded String,
    // it shows how to run it on a single sentence, and how to do this
    // and produce an inline XML output format.

    static void Main(string[] args)
    {
        if (args.Length > 0)
        {
            var fileContent = File.ReadAllText(args[0]);
            foreach (List sentence in Classifier.classify(fileContent).toArray())
            {
                foreach (CoreLabel word in sentence.toArray())
                {
                    Console.Write( "{0}/{1} ", word.word(), word.get(new CoreAnnotations.AnswerAnnotation().getClass()));
                }
                Console.WriteLine();
            }
        } else
        {
            const string S1 = "Good afternoon Rajat Raina, how are you today?";
            const string S2 = "I go to school at Stanford University, which is located in California.";
            Console.WriteLine("{0}\n", Classifier.classifyToString(S1));
            Console.WriteLine("{0}\n", Classifier.classifyWithInlineXML(S2));
            Console.WriteLine("{0}\n", Classifier.classifyToString(S2, "xml", true));

            var classification = Classifier.classify(S2).toArray();
            for (var i = 0; i < classification.Length; i++)
            {
                Console.WriteLine("{0}\n:{1}\n", i, classification[i]);
            }
        }
    }
}

As a result of both samples you will see the following output:

Don/PERSON Syme/PERSON is/O an/O Australian/O computer/O scientist/O and/O a/O 
Principal/O Researcher/O at/O Microsoft/ORGANIZATION Research/ORGANIZATION ,/O 
Cambridge/LOCATION ,/O U.K./LOCATION ./O He/O is/O the/O designer/O and/O 
architect/O of/O the/O F/O #/O programming/O language/O ,/O described/O by/O 
a/O reporter/O as/O being/O regarded/O as/O ``/O the/O most/O original/O new/O 
face/O in/O computer/O languages/O since/O Bjarne/PERSON Stroustrup/PERSON 
developed/O C/O +/O +/O in/O the/O early/O 1980s/O ./O
Earlier/O ,/O Syme/PERSON created/O generics/O in/O the/O ./O NET/O Common/O 
Language/O Runtime/O ,/O including/O the/O initial/O design/O of/O generics/O 
for/O the/O C/O #/O programming/O language/O ,/O along/O with/O others/O 
including/O Andrew/PERSON Kennedy/PERSON and/O later/O Anders/PERSON 
Hejlsberg/PERSON ./O Kennedy/PERSON ,/O Syme/PERSON and/O Yu/PERSON also/O 
formalized/O this/O widely/O used/O system/O ./O
He/O holds/O a/O Ph.D./O from/O the/O University/ORGANIZATION of/ORGANIZATION 
Cambridge/ORGANIZATION ,/O and/O is/O a/O member/O of/O the/O WG2/O .8/O 
working/O group/O on/O functional/O programming/O ./O He/O is/O a/O co-author/O 
of/O the/O book/O Expert/O F/O #/O 2.0/O ./O
In/O the/O past/O he/O also/O worked/O on/O formal/O specification/O ,/O 
interactive/O proof/O ,/O automated/O verification/O and/O proof/O description/O 
languages/O ./O