def booleanize(query)
  query.gsub(/\W+/, ' ').split(/\s+/).join(' | ')
end

def booleanize_V2(query)
  query.gsub(/\W+/, ' ').split(/\s+/).select { |s| s.split('').length > 1 
}.join(' | ')
end

We use Sphinx boolean matching to figure out if someone is asking a question 
that is similar to other questions: we use Sphinx's extended2 mode with 
boolean operators and bm25 rank mode. The question "Do you like robots?" 
will be converted (with the *booleanize* method) to "Do | you | like | 
robots" and will match, say, "Are robots cool?" before it matches "Do you 
like pizza?" Even though the latter has a common phrase and 3 words in 
common with the asked question, since robots is a rarer word, bm25 will rank 
it higher.

Now this works great in english... but I noticed that Sphinx was having 
trouble with single, accented characters:

*"Avez-vous le sentiment d'être \"drogué\" à certains aliments ? (ex. 
sucreries, pâtisseries, etc.)"*

causes:

*ThinkingSphinx::SphinxError: index question_core,question_delta: syntax 
error, unexpected '|' near ' à | certains | aliments | ex | sucreries | 
pâtisseries | etc'*

And also special characters...

*"¿ a partir de el 2011 en que fecha hay convocatorias nuevas ?"*

causes:

*ThinkingSphinx::SphinxError: index question_core,question_delta: syntax 
error, unexpected '|' near '¿ | a | partir | de | el | 2011 | en | que | 
fecha | hay | convocatorias | nuevas'*

So as a temporary hack, I switched to *booleanize_V2* which strips out 
single characters... Now... here comes Korean and my hack is no longer 
viable:

*Question: 스마트폰을 가지고 있나?*

*Sphinx Querying: '스마트폰을 | 가지고 | 있나'*

*Sphinx Sphinx Daemon returned error: index question_core,question_delta: 
syntax error, unexpected '|' near '스마트폰을 | 가지고 | 있나'*

Now I've realized there are a few problems here:

   1. Why is Sphinx flipping out with the utf-8 characters?
   2. 
   
   Is there a good way to determine word boundaries in multiple languages?
   
Does anyone know what's going on?

Thanks so much in advance for any help!

Best,

Aaron


-- 
You received this message because you are subscribed to the Google Groups 
"Thinking Sphinx" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/thinking-sphinx?hl=en.

Reply via email to