Why shouldn't lang-id component work at query-time?

2013-07-07 Thread adfel70
Hi,
I'm trying to integrate solr's lang-id component in my solr environment.
In my scenario, I have documents in many different languages. I want to
index them in the same solr collection, to different fields and apply
language-specific analyzers on each field by its language.

So far lang-id component does exactly what I need.

The problem is that in all recepies that I've read, eventually at query-time
I have to indicate which language I'm querying.
Either by specifying the field I want to search:
/solr/collection/select?q=text_it:abc abc
Or by creating a language-specific request handler which I would have to use
like this:
/solr/collection/selectIT?q=text:abc abc

Either way, I must tell solr the language, which in my case - a web
client+many different languages, it's quite problematic.

I was wondering why shouldn't lang-id component provide a full ability to
index and query on multi-languages when both in indexing and in querying the
language is transparent to the client.
This could be achieved by applying the same language-detection tool at query
time.

Any insights?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Why-shouldn-t-lang-id-component-work-at-query-time-tp4076057.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Why shouldn't lang-id component work at query-time?

2013-07-07 Thread Jack Krupansky
The problem at query time is simple: a typical query has too few terms to 
reliably identify the language using statistical techniques, especially for 
a language like English which is famous for borrowing words from other 
languages. I mean, is raison d'être REALLY French anymore? Or, are 
sombrero or poncho or mañana really strictly Spanish anymore?


Multi-lingual support is an art/craft; don't expect cookbook answers that 
will apply to all apps in all environments.


That said, Edismax searching of multiple field, one for each language is 
probably the best you're going to do without doing something 
super-sophisticated.


-- Jack Krupansky

-Original Message- 
From: adfel70

Sent: Sunday, July 07, 2013 1:32 PM
To: solr-user@lucene.apache.org
Subject: Why shouldn't lang-id component work at query-time?

Hi,
I'm trying to integrate solr's lang-id component in my solr environment.
In my scenario, I have documents in many different languages. I want to
index them in the same solr collection, to different fields and apply
language-specific analyzers on each field by its language.

So far lang-id component does exactly what I need.

The problem is that in all recepies that I've read, eventually at query-time
I have to indicate which language I'm querying.
Either by specifying the field I want to search:
/solr/collection/select?q=text_it:abc abc
Or by creating a language-specific request handler which I would have to use
like this:
/solr/collection/selectIT?q=text:abc abc

Either way, I must tell solr the language, which in my case - a web
client+many different languages, it's quite problematic.

I was wondering why shouldn't lang-id component provide a full ability to
index and query on multi-languages when both in indexing and in querying the
language is transparent to the client.
This could be achieved by applying the same language-detection tool at query
time.

Any insights?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Why-shouldn-t-lang-id-component-work-at-query-time-tp4076057.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Why shouldn't lang-id component work at query-time?

2013-07-07 Thread Walter Underwood
Proper nouns are the worst for language ID. What language is Laserjet or 
Obama?  --wunder

On Jul 7, 2013, at 10:47 AM, Jack Krupansky wrote:

 The problem at query time is simple: a typical query has too few terms to 
 reliably identify the language using statistical techniques, especially for a 
 language like English which is famous for borrowing words from other 
 languages. I mean, is raison d'être REALLY French anymore? Or, are 
 sombrero or poncho or mañana really strictly Spanish anymore?
 
 Multi-lingual support is an art/craft; don't expect cookbook answers that 
 will apply to all apps in all environments.
 
 That said, Edismax searching of multiple field, one for each language is 
 probably the best you're going to do without doing something 
 super-sophisticated.
 
 -- Jack Krupansky
 
 -Original Message- From: adfel70
 Sent: Sunday, July 07, 2013 1:32 PM
 To: solr-user@lucene.apache.org
 Subject: Why shouldn't lang-id component work at query-time?
 
 Hi,
 I'm trying to integrate solr's lang-id component in my solr environment.
 In my scenario, I have documents in many different languages. I want to
 index them in the same solr collection, to different fields and apply
 language-specific analyzers on each field by its language.
 
 So far lang-id component does exactly what I need.
 
 The problem is that in all recepies that I've read, eventually at query-time
 I have to indicate which language I'm querying.
 Either by specifying the field I want to search:
 /solr/collection/select?q=text_it:abc abc
 Or by creating a language-specific request handler which I would have to use
 like this:
 /solr/collection/selectIT?q=text:abc abc
 
 Either way, I must tell solr the language, which in my case - a web
 client+many different languages, it's quite problematic.
 
 I was wondering why shouldn't lang-id component provide a full ability to
 index and query on multi-languages when both in indexing and in querying the
 language is transparent to the client.
 This could be achieved by applying the same language-detection tool at query
 time.
 
 Any insights?
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Why-shouldn-t-lang-id-component-work-at-query-time-tp4076057.html
 Sent from the Solr - User mailing list archive at Nabble.com. 

--
Walter Underwood
wun...@wunderwood.org





Re: Why shouldn't lang-id component work at query-time?

2013-07-07 Thread adfel70
Well, yes, the problem is indeed simple..

Regarding the approach you're offering - if I query on multiple fields, each
field for another language, why should it matter if I use edismax searching
or default lucene searching?



Jack Krupansky-2 wrote
 The problem at query time is simple: a typical query has too few terms to 
 reliably identify the language using statistical techniques, especially
 for 
 a language like English which is famous for borrowing words from other 
 languages. I mean, is raison d'être REALLY French anymore? Or, are 
 sombrero or poncho or mañana really strictly Spanish anymore?
 
 Multi-lingual support is an art/craft; don't expect cookbook answers that 
 will apply to all apps in all environments.
 
 That said, Edismax searching of multiple field, one for each language is 
 probably the best you're going to do without doing something 
 super-sophisticated.
 
 -- Jack Krupansky
 
 -Original Message- 
 From: adfel70
 Sent: Sunday, July 07, 2013 1:32 PM
 To: 

 solr-user@.apache

 Subject: Why shouldn't lang-id component work at query-time?
 
 Hi,
 I'm trying to integrate solr's lang-id component in my solr environment.
 In my scenario, I have documents in many different languages. I want to
 index them in the same solr collection, to different fields and apply
 language-specific analyzers on each field by its language.
 
 So far lang-id component does exactly what I need.
 
 The problem is that in all recepies that I've read, eventually at
 query-time
 I have to indicate which language I'm querying.
 Either by specifying the field I want to search:
 /solr/collection/select?q=text_it:abc abc
 Or by creating a language-specific request handler which I would have to
 use
 like this:
 /solr/collection/selectIT?q=text:abc abc
 
 Either way, I must tell solr the language, which in my case - a web
 client+many different languages, it's quite problematic.
 
 I was wondering why shouldn't lang-id component provide a full ability to
 index and query on multi-languages when both in indexing and in querying
 the
 language is transparent to the client.
 This could be achieved by applying the same language-detection tool at
 query
 time.
 
 Any insights?
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Why-shouldn-t-lang-id-component-work-at-query-time-tp4076057.html
 Sent from the Solr - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Why-shouldn-t-lang-id-component-work-at-query-time-tp4076057p4076062.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Why shouldn't lang-id component work at query-time?

2013-07-07 Thread Jack Krupansky
Default Lucen/Solr searching doesn't support qf or a list of fields to 
search, so you can't use that technique there.


-- Jack Krupansky

-Original Message- 
From: adfel70

Sent: Sunday, July 07, 2013 1:54 PM
To: solr-user@lucene.apache.org
Subject: Re: Why shouldn't lang-id component work at query-time?

Well, yes, the problem is indeed simple..

Regarding the approach you're offering - if I query on multiple fields, each
field for another language, why should it matter if I use edismax searching
or default lucene searching?



Jack Krupansky-2 wrote

The problem at query time is simple: a typical query has too few terms to
reliably identify the language using statistical techniques, especially
for
a language like English which is famous for borrowing words from other
languages. I mean, is raison d'être REALLY French anymore? Or, are
sombrero or poncho or mañana really strictly Spanish anymore?

Multi-lingual support is an art/craft; don't expect cookbook answers that
will apply to all apps in all environments.

That said, Edismax searching of multiple field, one for each language is
probably the best you're going to do without doing something
super-sophisticated.

-- Jack Krupansky

-Original Message- 
From: adfel70

Sent: Sunday, July 07, 2013 1:32 PM
To:



solr-user@.apache



Subject: Why shouldn't lang-id component work at query-time?

Hi,
I'm trying to integrate solr's lang-id component in my solr environment.
In my scenario, I have documents in many different languages. I want to
index them in the same solr collection, to different fields and apply
language-specific analyzers on each field by its language.

So far lang-id component does exactly what I need.

The problem is that in all recepies that I've read, eventually at
query-time
I have to indicate which language I'm querying.
Either by specifying the field I want to search:
/solr/collection/select?q=text_it:abc abc
Or by creating a language-specific request handler which I would have to
use
like this:
/solr/collection/selectIT?q=text:abc abc

Either way, I must tell solr the language, which in my case - a web
client+many different languages, it's quite problematic.

I was wondering why shouldn't lang-id component provide a full ability to
index and query on multi-languages when both in indexing and in querying
the
language is transparent to the client.
This could be achieved by applying the same language-detection tool at
query
time.

Any insights?




--
View this message in context:
http://lucene.472066.n3.nabble.com/Why-shouldn-t-lang-id-component-work-at-query-time-tp4076057.html
Sent from the Solr - User mailing list archive at Nabble.com.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Why-shouldn-t-lang-id-component-work-at-query-time-tp4076057p4076062.html
Sent from the Solr - User mailing list archive at Nabble.com.