No there are no differences in your samples. Try to use capital USD
instead of usd.

The model was trained on English news text from the 90s try to give it some (old) news
articles for testing.

Jörn

On 05/23/2013 03:16 PM, Яков Керанчук wrote:
Experimenting shows next results: only $ marked digits determined as money

------sentences: [The drop last week unwound most of the prior week's jump,
suggesting employers were not laying off workers in response to tighter
fiscal policy, especially the $85 billion in across-the-board government
spending cuts that have dampened factory activity]
------tokenizing
------finding money
[[29..32) money]
[29..32) money
prepare model
------sentences: [buy milk $2]
------tokenizing
buy
milk
$
2
------finding money
[[2..4) money]
[2..4) money
------pos tagging
VB
NN
$
CD
------saving message to database
prepare model
------sentences: [buy milk usd 2]
------tokenizing
buy
milk
usd
2
------finding money
[]
------pos tagging
VB
NN
CD
------saving message to database
prepare model
------sentences: [Buy milk two Dollars]
------tokenizing
Buy
milk
two
Dollars
------finding money
[]

I have not noticed difference between SimpleTokenizer and TokenizerME in
this case


On Thu, May 23, 2013 at 5:00 PM, Jörn Kottmann <[email protected]> wrote:

On 05/23/2013 02:56 PM, Яков Керанчук wrote:

Thanks for suggestion with own model, I'll try

I use standard en-token.bin model, text contains mixed upper-lower case
words.

For the english model you should use the SimpleTokenizer, the token output
from the en-token.bin model is not compatible with the training data.

Jörn




Reply via email to