def booleanize(query)
query.gsub(/\W+/, ' ').split(/\s+/).join(' | ')
end
def booleanize_V2(query)
query.gsub(/\W+/, ' ').split(/\s+/).select { |s| s.split('').length > 1
}.join(' | ')
end
We use Sphinx boolean matching to figure out if someone is asking a question
that is similar to other questions: we use Sphinx's extended2 mode with
boolean operators and bm25 rank mode. The question "Do you like robots?"
will be converted (with the *booleanize* method) to "Do | you | like |
robots" and will match, say, "Are robots cool?" before it matches "Do you
like pizza?" Even though the latter has a common phrase and 3 words in
common with the asked question, since robots is a rarer word, bm25 will rank
it higher.
Now this works great in english... but I noticed that Sphinx was having
trouble with single, accented characters:
*"Avez-vous le sentiment d'être \"drogué\" à certains aliments ? (ex.
sucreries, pâtisseries, etc.)"*
causes:
*ThinkingSphinx::SphinxError: index question_core,question_delta: syntax
error, unexpected '|' near ' à | certains | aliments | ex | sucreries |
pâtisseries | etc'*
And also special characters...
*"¿ a partir de el 2011 en que fecha hay convocatorias nuevas ?"*
causes:
*ThinkingSphinx::SphinxError: index question_core,question_delta: syntax
error, unexpected '|' near '¿ | a | partir | de | el | 2011 | en | que |
fecha | hay | convocatorias | nuevas'*
So as a temporary hack, I switched to *booleanize_V2* which strips out
single characters... Now... here comes Korean and my hack is no longer
viable:
*Question: 스마트폰을 가지고 있나?*
*Sphinx Querying: '스마트폰을 | 가지고 | 있나'*
*Sphinx Sphinx Daemon returned error: index question_core,question_delta:
syntax error, unexpected '|' near '스마트폰을 | 가지고 | 있나'*
Now I've realized there are a few problems here:
1. Why is Sphinx flipping out with the utf-8 characters?
2.
Is there a good way to determine word boundaries in multiple languages?
Does anyone know what's going on?
Thanks so much in advance for any help!
Best,
Aaron
--
You received this message because you are subscribed to the Google Groups
"Thinking Sphinx" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/thinking-sphinx?hl=en.