Re: Application of different stemmers / stopword lists within a single field

Manuel Le Normand Mon, 28 Apr 2014 02:29:26 -0700

Why wouldn't you take advantage of your use case - the chars belong to
different char classes.


You can index this field to a single solr field (no copyField) and apply an
analysis chain that includes both languages analysis - stopword, stemmers
etc.
As every filter should apply to its' specific language (e.g an arabic
stemmer should not stem a lating word) you can make cross languages search
on this single field.


On Mon, Apr 28, 2014 at 5:59 AM, Alexandre Rafalovitch
<arafa...@gmail.com>wrote:

> If you can throw money at the problem:
> http://www.basistech.com/text-analytics/rosette/language-identifier/ .
> Language Boundary Locator at the bottom of the page seems to be
> part/all of your solution.
>
> Otherwise, specifically for English and Arabic, you could play with
> Unicode ranges to try detecting text blocks:
> 1) Create an UpdateRequestProcessor chain that
> a) clones text into field_EN and field_AR.
> b) applies regular expression transformations that strip English or
> Arabic unicode text range correspondingly, so field_EN only has
> English characters left, etc. Of course, you need to decide what you
> want to do with occasional EN or neutral characters happening in the
> middle of Arabic text (numbers: Arabic or Indic? brackets, dashes,
> etc). But if you just index text, it might be ok even if it is not
> perfect.
> c) deletes empty fields, just in case not all of them have mix language
> 2) Use eDismax to search over both fields, each with its own processor.
>
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Fri, Apr 25, 2014 at 5:34 PM, Timothy Hill <timothy.d.h...@gmail.com>
> wrote:
> > This may not be a practically solvable problem, but the company I work
> for
> > has a large number of lengthy mixed-language documents - for example,
> > scholarly articles about Islam written in English but containing lengthy
> > passages of Arabic. Ideally, we would like users to be able to search
> both
> > the English and Arabic portions of the text, using the full complement of
> > language-processing tools such as stemming and stopword removal.
> >
> > The problem, of course, is that these two languages co-occur in the same
> > field. Is there any way to apply different processing to different words
> or
> > paragraphs within a single field through language detection? Is this to
> all
> > intents and purposes impossible within Solr? Or is another approach
> (using
> > language detection to split the single large field into
> > language-differentiated smaller fields, for example)
> possible/recommended?
> >
> > Thanks,
> >
> > Tim Hill
>

Re: Application of different stemmers / stopword lists within a single field

Reply via email to