Yes, each token could have a LanguageAttribute on it, just like
ScriptAttributes. I didn't *think* a span would be necessary.

I would also add a multivalued "lang" field to the document. Searching
English documents for "die" might look like: "q=die&lang=eng". The "lang"
param could tell the RequestHandler to add a filter query "fq=lang:eng" to
constrain the search to the English corpus, as well as recruit an English
analyzer when tokenizing the "die" query term.

Since I can't control text length, I would just let the language detection
tool do it's best and not sweat it.


On Wed, Aug 6, 2014 at 12:11 AM, TK <kuros...@sonic.net> wrote:

>
> On 8/5/14, 8:36 AM, Rich Cariens wrote:
>
>> Of course this is extremely primitive and basic, but I think it would be
>> possible to write a CharFilter or TokenFilter that inspects the entire
>> TokenStream to guess the language(s), perhaps even noting where languages
>> change. Language and position information could be tracked, the
>> TokenStream
>> rewound and then Tokens emitted with "LanguageAttributes" for downstream
>> Token stemmers to deal with.
>>
>>  I'm curious how you are planning to handle the languageAttribute.
> Would each token have this attribute denoting a span of Tokens
> with a language? But then how would you search
> English documents that includes the term "die" while skipping
> all the German documents which most likely to have "die"?
>
> Automatic language detection works OK for long text of
> regular kind of contents.  But it doesn't work well with short
> text. What strategy would you use to deal with short text?
>
> --
> TK
>
>

Reply via email to