Good day, and first of all, it's a pleasure to join you all. In my workplace a new interesting dilemma appeared that we have been looking up and discussing the last few days and I thought that you could be a good place to extend our research on this topic.
Let's head to the heart of this. Some days ago a Chinese customer did open a ticket into our place about not getting results from Chinese phrases like the following: 短袖V領上衣, 條紋印花口袋T恤. As you see, hidden between the ideograms we have V and T and it's not incorrect as V領 stands for V-neck and T恤 stands for T-shirt. The problem is that currently our pipeline of tokenizers and filters divide those into T and 恤, provoking that we don't generate the correct matches and as so those constructs are unsearchable. I have been giving a look to the source code and documentation of all the classes related to the tokenizer and found about the rulefiles. A little glimpse to the specification of those looks exactly like what we need but I would like to know what you think. Also, has anyone else come into this problem before? I wouldn't be surprised if it was the case, can you share the approach that you followed or the rulefile that you used if you did it like this? Or any rulefile that you know about that matches our needings? Thank you all -- *Ricardo Soto Estévez* <[email protected]> BACKEND ENGINEER [image: Empathy Logo] Privacy Policy <https://www.empathy.co/privacy-policy/>
