http://people.apache.org/~hossman/#solr-dev Please Use "solr-u...@lucene" Not "solr-...@lucene"
Your question is better suited for the solr-u...@lucene mailing list ... not the solr-...@lucene list. solr-dev is for discussing development of the internals of the Solr application ... it is *not* the appropriate place to ask questions about how to use Solr (or write Solr plugins) when developing your own applications. Please resend your message to the solr-user mailing list, where you are likely to get more/better responses since that list also has a larger number of subscribers. : Date: Sun, 5 Jul 2009 22:09:07 -0700 : From: Mark Bennett <mbenn...@ideaeng.com> : Reply-To: solr-dev@lucene.apache.org : To: solr-dev@lucene.apache.org : Subject: Different structure of standard generated query for CJK vs. Western : query : : (resending with ALL Asian characters removed from example, which apparently : trips a filter) : I'm getting phrase queries instead of implicit "OR" queries with Asian : text. I first noticed it with the Dismax query handler, but it also happens : with the Standard query. : : Of course Asian text is broken up into N-Gram pairs, I understand that. But : after analysis (via the Web UI) the 2-character "words" still have spaces in : between them, so I'd expect similar results to an English sentence which : also has spaces. : : English: (default field title_en) : User Query: I need help with my iPod : Generates: title_en:i title_en:need title_en:help title_en:with title_en:my : title_en:ipod : : Japanese: (default field title_cjk) : User Query: iPodC1C2C3C4C5C6C7... : Generates: PhraseQuery(title_cjk:"ipod C1C2 C2C3 C3C4 C4C5 C5C6 C6C7") : The problem is the cjk phrase queries are too rigid, everything has to : match. Although setting phrase slop helps with proximity, I don't think you : can tell it to not require 100% of the bigrams to be present. : : What I'd like is just: title_cjk:ipod title_cjk:C1C2 title_cjk:C2C3 : title_cjk:C3C4 etc... : The only theory I have so far, looking through the code and mailing list : comments, this might have something to do with token offsets? Though the : start of each token is 1 past the previous one, they do overlap by 1 char : each time. I'm not sure that's it, nor what the logic would be. Bumping : the increments from 1 to 3 or 4 would make them no longer overlap, if that's : all there is to it. : : Ideally I'd like the cjk queries to be structured the same as the English : ones. Also it'd be better if this could be done with just schema or config : changes, though I realize that's not as likely. : : -- : Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com : Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 : -Hoss