Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.
Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.
The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications.
One more tool from Stanford NLP Software Package become ready on NuGet today. It is a Stanford Word Segmenter. This is a fourth one Stanford NuGet package published by me, previous ones were a “Stanford Parser“, “Stanford Named Entity Recognizer (NER)” and “Stanford Log-linear Part-Of-Speech Tagger“. Please follow next steps to get started:
- Install-Package Stanford.NLP.Segmenter
- Download models from The Stanford NLP Group site.
- Extract models from ’data‘ folder.
- You are ready to start.
F# Sample
For more details see source code on GitHub.
open java.util
open edu.stanford.nlp.ie.crf
[<EntryPoint>]
let main argv =
if (argv.Length <> 1) then
printf "usage: StanfordSegmenter.Csharp.Samples.exe filename"
else
let props = Properties();
props.setProperty("sighanCorporaDict", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data") |> ignore
props.setProperty("serDictionary", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\dict-chris6.ser.gz") |> ignore
props.setProperty("testFile", argv.[0]) |> ignore
props.setProperty("inputEncoding", "UTF-8") |> ignore
props.setProperty("sighanPostProcessing", "true") |> ignore
let segmenter = CRFClassifier(props)
segmenter.loadClassifierNoExceptions(@"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\ctb.gz", props)
segmenter.classifyAndWriteAnswers(argv.[0])
0
C# Sample
For more details see source code on GitHub.
using java.util;
using edu.stanford.nlp.ie.crf;
namespace StanfordSegmenter.Csharp.Samples
{
class Program
{
static void Main(string[] args)
{
if (args.Length != 1)
{
System.Console.WriteLine("usage: StanfordSegmenter.Csharp.Samples.exe filename");
return;
}
var props = new Properties();
props.setProperty("sighanCorporaDict", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data");
props.setProperty("serDictionary", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\dict-chris6.ser.gz");
props.setProperty("testFile", args[0]);
props.setProperty("inputEncoding", "UTF-8");
props.setProperty("sighanPostProcessing", "true");
var segmenter = new CRFClassifier(props);
segmenter.loadClassifierNoExceptions(@"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\ctb.gz", props);
segmenter.classifyAndWriteAnswers(args[0]);
}
}
}
Discover more from Sergey Tihon's Blog
Subscribe to get the latest posts sent to your email.

2 thoughts on “Stanford Word Segmenter is available on NuGet”