Update (2014, January 3): Links and/or samples in this post might be outdated. The latest version of samples are available on new Stanford.NLP.NET site.
Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.
The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications.
One more tool from Stanford NLP Software Package become ready on NuGet today. It is a Stanford Word Segmenter. This is a fourth one Stanford NuGet package published by me, previous ones were a “Stanford Parser“, “Stanford Named Entity Recognizer (NER)” and “Stanford Log-linear Part-Of-Speech Tagger“. Please follow next steps to get started:
- Install-Package Stanford.NLP.Segmenter
- Download models from The Stanford NLP Group site.
- Extract models from ’data‘ folder.
- You are ready to start.
F# Sample
For more details see source code on GitHub.
open java.util open edu.stanford.nlp.ie.crf [<EntryPoint>] let main argv = if (argv.Length <> 1) then printf "usage: StanfordSegmenter.Csharp.Samples.exe filename" else let props = Properties(); props.setProperty("sighanCorporaDict", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data") |> ignore props.setProperty("serDictionary", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\dict-chris6.ser.gz") |> ignore props.setProperty("testFile", argv.[0]) |> ignore props.setProperty("inputEncoding", "UTF-8") |> ignore props.setProperty("sighanPostProcessing", "true") |> ignore let segmenter = CRFClassifier(props) segmenter.loadClassifierNoExceptions(@"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\ctb.gz", props) segmenter.classifyAndWriteAnswers(argv.[0]) 0
C# Sample
For more details see source code on GitHub.
using java.util; using edu.stanford.nlp.ie.crf; namespace StanfordSegmenter.Csharp.Samples { class Program { static void Main(string[] args) { if (args.Length != 1) { System.Console.WriteLine("usage: StanfordSegmenter.Csharp.Samples.exe filename"); return; } var props = new Properties(); props.setProperty("sighanCorporaDict", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data"); props.setProperty("serDictionary", @"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\dict-chris6.ser.gz"); props.setProperty("testFile", args[0]); props.setProperty("inputEncoding", "UTF-8"); props.setProperty("sighanPostProcessing", "true"); var segmenter = new CRFClassifier(props); segmenter.loadClassifierNoExceptions(@"..\..\..\..\temp\stanford-segmenter-2013-06-20\data\ctb.gz", props); segmenter.classifyAndWriteAnswers(args[0]); } } }
2 thoughts on “Stanford Word Segmenter is available on NuGet”