ucto − Unicode Tokenizer
ucto [[options]] [input-file] [[output-file]]
ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.
−c configfile
read settings from a file
−d value
set debug mode to ’value’
−e value
set input encoding. (default UTF8)
−N value
set UTF8 output normalization. (default NFC)
−−filter=[YES|NO]
disable filtering of special characters, (default YES) These special characters can be specified in the [FILTER] block of the configuration file.
−f
OBSOLETE. use --filter=NO
−L language
Automatically selects a configuration file by language code. The language code is generally a three-letter iso-639-3 code. For example, ’fra’ will select the file tokconfig-fra from the installation directory
−−detectlanguages=<lang1,lang2,..langn>
try to detect all the specified languages. The default language will be ’lang1’. (only useful for FoLiA output).
All language codes must be iso-639-3.
You can use the special language code ‘und‘. This ensures there is NO default language, but any language that is NOT in the list will remain unanalyzed.
Warning: To be able to handle utterances of mixed language, Ucto uses a simple sentence splitter based on the markers ’.’ ’?’ and ’!’. This may occasionally lead to surprising results.
−l
Convert to all lowercase
−u
Convert to all uppercase
−n
Emit one sentence per line on output
−m
Assume one sentence per line on input
−−normalize=class1,class2,..,classn
map all occurrences of tokens with class1,...class to their generic names. e.g −−normalize=DATE will map all dates to the word {{DATE}}. Very useful to normalize tokens like URL’s, DATE’s, E−mail addresses and so on.
−T value or −−textredundancy=value
set text redundancy level for
text nodes in FoLiA output:
’full’ - add text to all levels: <p>
<s> <w> etc.
’minimal’ - don’t introduce text on higher
levels, but retain what is already
there.
’none’ - only introduce text on <w>, AND
remove all text from higher levels
−−allow-word-correction
Allow ucto to tokenize inside FoLiA Word elements, creating FoLiA Corrections
−−ignore-tag-hints
Skip all tag=token hints from the FoLiA input. These hints can be used to signal text markup like subscript and superscript
−−add−tokens="file"
Add additional tokens to the [TOKENS] block of the default language. The file should contain one TOKEN per line.
−−passthru
Don’t tokenize, but perform input decoding and simple token role detection
−−filterpunct
remove most of the punctuation from the output. (not from abreviations and embedded punctuation like John’s)
−P
Disable Paragraph Detection
−Q
Enable Quote Detection. (this is experimental and may lead to unexpected results)
−s <string>
Set End-of-sentence marker. (Default <utt>)
−V or −− version
Show version information
−v
set Verbose mode
−F
Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: −nPQvs) For files with an ’.xml’ extension, −F is the default.
−−inputclass="cls"
When tokenizing a FoLiA XML document, search for text nodes of class ’cls’. The default is "current".
−−outputclass="cls"
When tokenizing a FoLiA XML document, output the tokenized text in text nodes with ’cls’. The default is "current". It is recommended to have different classes for input and output.
−−textclass="cls"(obsolete)
use ’cls’ for input and output of text from FoLiA. Equivalent to both −−inputclass=’cls’ and −−outputclass=’cls’)
This option is obsolete and NOT recommended. Please use the separate −−inputclass= and −−outputclass options.
−−copyclass
when ucto is used on FoLiA with fully tokenized text in inputclass=’inputclass’, no text in textclass ’outputclass’ is produced. (A warning will be given). To circumvent this. Add the −−copyclass option. Which assures that text will be emitted in that class
−X
Output FoLiA XML. (this disables usage of most other options: −nPQvs)
−−id <DocId>
Use the specified Document ID for the FoLiA XML
−x <DocId> (obsolete)
Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: −nPQvs).
obsolete Use −X and −−id instead
likely
Maarten van Gompel [email protected]
Ko van der Sloot [email protected]