Experimenting shows next results: only $ marked digits determined as money

------sentences: [The drop last week unwound most of the prior week's jump,
suggesting employers were not laying off workers in response to tighter
fiscal policy, especially the $85 billion in across-the-board government
spending cuts that have dampened factory activity]
------tokenizing
------finding money
[[29..32) money]
[29..32) money
prepare model
------sentences: [buy milk $2]
------tokenizing
buy
milk
$
2
------finding money
[[2..4) money]
[2..4) money
------pos tagging
VB
NN
$
CD
------saving message to database
prepare model
------sentences: [buy milk usd 2]
------tokenizing
buy
milk
usd
2
------finding money
[]
------pos tagging
VB
NN
CD
------saving message to database
prepare model
------sentences: [Buy milk two Dollars]
------tokenizing
Buy
milk
two
Dollars
------finding money
[]

I have not noticed difference between SimpleTokenizer and TokenizerME in
this case


On Thu, May 23, 2013 at 5:00 PM, Jörn Kottmann <[email protected]> wrote:

> On 05/23/2013 02:56 PM, Яков Керанчук wrote:
>
>> Thanks for suggestion with own model, I'll try
>>
>> I use standard en-token.bin model, text contains mixed upper-lower case
>> words.
>>
>
> For the english model you should use the SimpleTokenizer, the token output
> from the en-token.bin model is not compatible with the training data.
>
> Jörn
>



-- 
Best regards,
Yakov Keranchuk
+79263768032

Reply via email to