Re: eDisMax, multiple language support and stopwords

2013-11-11 Thread Liu Bo
Happy to see some one have similar solutions as ours.

we have similar multi-language search feature and we index different
language content to _fr, _en field like you've done

but in search, we need a language code as a parameter to specify the
language client wants to search on which is normally decided by the website
visited, such as: qf=name descriptionlanguage=en

and in our search components we find the right field: name_en and
description_en to be searched on

we used to support on all language search and removed that later, as the
site tells the customer which language is supported, we also don't think we
have many language experts on our web sites that knows more than two
language and need to search them at the same time.


On 7 November 2013 23:01, Tom Mortimer tom.m.f...@gmail.com wrote:

 Ah, thanks Markus. I think I'll just add the Boolean operators to the
 stopwords list in that case.

 Tom



 On 7 November 2013 12:01, Markus Jelsma markus.jel...@openindex.io
 wrote:

  This is an ancient problem. The issue here is your mm-parameter, it gets
  confused because for separate fields different amount of tokens are
  filtered/emitted so it is never going to work just like this. The easiest
  option is not to use the stopfilter.
 
 
 
 http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
  https://issues.apache.org/jira/browse/SOLR-3085
 
  -Original message-
   From:Tom Mortimer tom.m.f...@gmail.com
   Sent: Thursday 7th November 2013 12:50
   To: solr-user@lucene.apache.org
   Subject: eDisMax, multiple language support and stopwords
  
   Hi all,
  
   Thanks for the help and advice I've got here so far!
  
   Another question - I want to support stopwords at search time, so that
  e.g.
   the query oscar and wilde is equivalent to oscar wilde (this is
 with
   lowercaseOperators=false). Fair enough, I have stopword and in the
  query
   analyser chain.
  
   However, I also need to support French as well as English, so I've got
  _en
   and _fr versions of the text fields, with appropriate stemming and
   stopwords. I index French content into the _fr fields and English into
  the
   _en fields. I'm searching with eDisMax over both versions, e.g.:
  
   str name=qfheadline_en headline_fr/str
  
   However, this means I get no results for oscar and wilde. The parsed
   query is:
  
   (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
   DisjunctionMaxQuery((headline_fr:and))
   DisjunctionMaxQuery((headline_fr:wild |
 headline_en:wild)))~3))/no_coord
  
   If I add and to the French stopwords list, I *do* get results, and
 the
   parsed query is:
  
   (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
   DisjunctionMaxQuery((headline_fr:wild |
 headline_en:wild)))~2))/no_coord
  
   This implies that the only solution is to have a minimal, shared
  stopwords
   list for all languages I want to support. Is this correct, or is there
 a
   way of supporting this kind of searching with per-language stopword
  lists?
  
   Thanks for any ideas!
  
   Tom
  
 




-- 
All the best

Liu Bo


eDisMax, multiple language support and stopwords

2013-11-07 Thread Tom Mortimer
Hi all,

Thanks for the help and advice I've got here so far!

Another question - I want to support stopwords at search time, so that e.g.
the query oscar and wilde is equivalent to oscar wilde (this is with
lowercaseOperators=false). Fair enough, I have stopword and in the query
analyser chain.

However, I also need to support French as well as English, so I've got _en
and _fr versions of the text fields, with appropriate stemming and
stopwords. I index French content into the _fr fields and English into the
_en fields. I'm searching with eDisMax over both versions, e.g.:

str name=qfheadline_en headline_fr/str

However, this means I get no results for oscar and wilde. The parsed
query is:

(+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
DisjunctionMaxQuery((headline_fr:and))
DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord

If I add and to the French stopwords list, I *do* get results, and the
parsed query is:

(+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord

This implies that the only solution is to have a minimal, shared stopwords
list for all languages I want to support. Is this correct, or is there a
way of supporting this kind of searching with per-language stopword lists?

Thanks for any ideas!

Tom


RE: eDisMax, multiple language support and stopwords

2013-11-07 Thread Markus Jelsma
This is an ancient problem. The issue here is your mm-parameter, it gets 
confused because for separate fields different amount of tokens are 
filtered/emitted so it is never going to work just like this. The easiest 
option is not to use the stopfilter.

http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
https://issues.apache.org/jira/browse/SOLR-3085
 
-Original message-
 From:Tom Mortimer tom.m.f...@gmail.com
 Sent: Thursday 7th November 2013 12:50
 To: solr-user@lucene.apache.org
 Subject: eDisMax, multiple language support and stopwords
 
 Hi all,
 
 Thanks for the help and advice I've got here so far!
 
 Another question - I want to support stopwords at search time, so that e.g.
 the query oscar and wilde is equivalent to oscar wilde (this is with
 lowercaseOperators=false). Fair enough, I have stopword and in the query
 analyser chain.
 
 However, I also need to support French as well as English, so I've got _en
 and _fr versions of the text fields, with appropriate stemming and
 stopwords. I index French content into the _fr fields and English into the
 _en fields. I'm searching with eDisMax over both versions, e.g.:
 
 str name=qfheadline_en headline_fr/str
 
 However, this means I get no results for oscar and wilde. The parsed
 query is:
 
 (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
 DisjunctionMaxQuery((headline_fr:and))
 DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord
 
 If I add and to the French stopwords list, I *do* get results, and the
 parsed query is:
 
 (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
 DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord
 
 This implies that the only solution is to have a minimal, shared stopwords
 list for all languages I want to support. Is this correct, or is there a
 way of supporting this kind of searching with per-language stopword lists?
 
 Thanks for any ideas!
 
 Tom
 


Re: eDisMax, multiple language support and stopwords

2013-11-07 Thread Tom Mortimer
Ah, thanks Markus. I think I'll just add the Boolean operators to the
stopwords list in that case.

Tom



On 7 November 2013 12:01, Markus Jelsma markus.jel...@openindex.io wrote:

 This is an ancient problem. The issue here is your mm-parameter, it gets
 confused because for separate fields different amount of tokens are
 filtered/emitted so it is never going to work just like this. The easiest
 option is not to use the stopfilter.


 http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-td493483.html
 https://issues.apache.org/jira/browse/SOLR-3085

 -Original message-
  From:Tom Mortimer tom.m.f...@gmail.com
  Sent: Thursday 7th November 2013 12:50
  To: solr-user@lucene.apache.org
  Subject: eDisMax, multiple language support and stopwords
 
  Hi all,
 
  Thanks for the help and advice I've got here so far!
 
  Another question - I want to support stopwords at search time, so that
 e.g.
  the query oscar and wilde is equivalent to oscar wilde (this is with
  lowercaseOperators=false). Fair enough, I have stopword and in the
 query
  analyser chain.
 
  However, I also need to support French as well as English, so I've got
 _en
  and _fr versions of the text fields, with appropriate stemming and
  stopwords. I index French content into the _fr fields and English into
 the
  _en fields. I'm searching with eDisMax over both versions, e.g.:
 
  str name=qfheadline_en headline_fr/str
 
  However, this means I get no results for oscar and wilde. The parsed
  query is:
 
  (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
  DisjunctionMaxQuery((headline_fr:and))
  DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~3))/no_coord
 
  If I add and to the French stopwords list, I *do* get results, and the
  parsed query is:
 
  (+((DisjunctionMaxQuery((headline_fr:osca | headline_en:oscar))
  DisjunctionMaxQuery((headline_fr:wild | headline_en:wild)))~2))/no_coord
 
  This implies that the only solution is to have a minimal, shared
 stopwords
  list for all languages I want to support. Is this correct, or is there a
  way of supporting this kind of searching with per-language stopword
 lists?
 
  Thanks for any ideas!
 
  Tom