Handling tokenization of Chinese + Latin words

Ricardo Soto Estévez Wed, 11 Aug 2021 01:06:01 -0700

Good day,
and first of all, it's a pleasure to join you all. In my workplace a new
interesting dilemma appeared that we have been looking up and discussing
the last few days and I thought that you could be a good place to extend
our research on this topic.


Let's head to the heart of this. Some days ago a Chinese customer did open
a ticket into our place about not getting results from Chinese phrases like
the following: 短袖V領上衣, 條紋印花口袋T恤.
As you see, hidden between the ideograms we have V and T and it's not
incorrect as V領 stands for V-neck and T恤 stands for T-shirt. The problem is
that currently our pipeline of tokenizers and filters divide those into T
and 恤, provoking that we don't generate the correct matches and as so those
constructs are unsearchable.

I have been giving a look to the source code and documentation of all the
classes related to the tokenizer and found about the rulefiles. A little
glimpse to the specification of those looks exactly like what we need but I
would like to know what you think. Also, has anyone else come into this
problem before? I wouldn't be surprised if it was the case, can you share
the approach that you followed or the rulefile that you used if you did it
like this? Or any rulefile that you know about that matches our needings?

Thank you all

-- 
*Ricardo Soto Estévez* <[email protected]>
BACKEND ENGINEER
[image: Empathy Logo]
Privacy Policy <https://www.empathy.co/privacy-policy/>

Handling tokenization of Chinese + Latin words

Reply via email to