Re: Step by step tutorial for multi-language indexing and search

2010-10-27 Thread Lance Norskog
Yes, you can declare each field with the Spanish, French, etc. types. 
The _t and other types are dynamic and don't have to be declared. This 
feature is generally used when you have hundreds or thousands of fields. 
It is more clear to declare your fields.


You're right- that error should not be thrown. You are not asking for a 
sort.
I don't know that one. You could try starting over with the Solr 1.4.1 
release binaries.


Jakub Godawa wrote:

Hi Erick, thanks for your help!

I need some technical help though... let me put it that way:

1. I deleted everything in index with:
curl http://localhost:8983/solr/update -F stream.body='
deletequery*:*/query/delete'
curl http://localhost:8983/solr/update -F stream.body='commit /'

2. I created 2 documents with fields: name_en, answer_en, name_es, answer_es
3. I made a query through admin page, with response:

response
-
lst name=responseHeader
int name=status0/int
int name=QTime9/int
-
lst name=params
str name=indenton/str
str name=start0/str
str name=qJakub
/str
str name=version2.2/str
str name=rows10/str
/lst
/lst
-
result name=response numFound=2 start=0
-
doc
-
arr name=answer_en_t
strMy name is Jakub/str
/arr
-
arr name=answer_es_t
strMe llamo Jakub./str
/arr
-
arr name=id
strQuestion:1/str
/arr
-
arr name=name_en_t
strWhat is your name?/str
/arr
-
arr name=name_es_t
strComo te llamas?/str
/arr
-
arr name=pk_s
str1/str
/arr
-
arr name=spell
strWhat is your name?/str
strMy name is Jakub/str
strComo te llamas?/str
strMe llamo Jakub./str
/arr
/doc
-
doc
-
arr name=answer_en_t
strI am in the kitchen Jakub!/str
/arr
-
arr name=answer_es_t
strEstoy en la cocina./str
/arr
-
arr name=id
strQuestion:2/str
/arr
-
arr name=name_en_t
strWhere are you?/str
/arr
-
arr name=name_es_t
strDonde estas?/str
/arr
-
arr name=pk_s
str2/str
/arr
-
arr name=spell
strWhere are you?/str
strI am in the kitchen Jakub!/str
strDonde estas?/str
strEstoy en la cocina./str
/arr
/doc
/result
/response

4. Now I needed two dismaxes to make it work in two separate languages. Lets
say I just want to look up in *_en fields, then I created a dismax:

requestHandler name=/English class=solr.SearchHandler
 lst name=defaults
   str name=defTypedismax/str
   str name=echoParamsexplicit/str
   float name=tie0.01/float
   str name=qf
 name_en_t^0.5 answer_en_t^1.0
  /str
  /lst
   /requestHandler


5. Hitting the url: http://localhost:8982/solr/English/?q=Jakub gaves me an
error:

there are more terms than documents in field name_en_t, but it's
impossible to sort on tokenized fields

6. I know that I should create a separate dismax for Spanish.

My questions:
1. Why those fields are named with *_t? I saw in schema.xml that they are
made dynamicly. Can/should I create my own predefined fields in schema.xml?
Is this the place where you put HOW the field should be interpreted by
indexer?
2. Why the error in no. 5 is being thrown? I know that you cannot do sorting
on tokenized fields, but I don't see myself trying to index anything nor
tokenizing.
3. How should it be changed to work properly?

Thank you and I ask for patience as this can help many rookies like to me to
get started.
Jakub.

2010/10/21 Erick Ericksonerickerick...@gmail.com

   

See below:

But also search the archives for multilanguage, this topic has been
discussed
many times before. Lucid Imagination maintains a Solr-powered (of course)
searchable
list at: http://www.lucidimagination.com/search/

http://www.lucidimagination.com/search/

On Wed, Oct 20, 2010 at 9:03 AM, Jakub Godawajakub.god...@gmail.com
 

wrote:
   
 

Hi everyone! (my first post)

I am new, but really curious about usefullness of lucene/solr in
   

documents
 

search from the web applications. I use Ruby on Rails to create one, with
plugin acts_as_solr_reloaded that makes connection between web app and
solr easy.

So I am in a point, where I know that good solution is to prepare
multi-language documents with fields like:
question_en, answer_en,
question_fr, answer_fr,
question_pl,  answer_pl... etc.

I need to create an index that would work with 6 languages: english,
french,
german, russian, ukrainian and polish.

My questions are:
1. Is it doable to have just one search field that behaves like Google's
for
all those documents? It can be an option to indicate a language to
   

search.
 
   

This depends on what you mean by do-able. Are you going to allow a French
user to search an English document (  etc)? But the real answer is yes,
you
can
if you .. There'll be tradeoffs.

Take a look at the dismax handler. It's kind of hard to grok all at once,
but you
can cause it to search across multiple fields. That is, the user types
language,
and you can turn it into a complex query under the covers like
lang_en:language lang_fr:language lang_ru:language, etc. You can also
apply boosts. Note that this has obvious problems with, say, Russian. Half
your
job will be figuring out what will satisfy the 

Re: Step by step tutorial for multi-language indexing and search

2010-10-24 Thread Jakub Godawa
Hi Erick, thanks for your help!

I need some technical help though... let me put it that way:

1. I deleted everything in index with:
curl http://localhost:8983/solr/update -F stream.body='
deletequery*:*/query/delete'
curl http://localhost:8983/solr/update -F stream.body=' commit /'

2. I created 2 documents with fields: name_en, answer_en, name_es, answer_es
3. I made a query through admin page, with response:

response
-
lst name=responseHeader
int name=status0/int
int name=QTime9/int
-
lst name=params
str name=indenton/str
str name=start0/str
str name=qJakub
/str
str name=version2.2/str
str name=rows10/str
/lst
/lst
-
result name=response numFound=2 start=0
-
doc
-
arr name=answer_en_t
strMy name is Jakub/str
/arr
-
arr name=answer_es_t
strMe llamo Jakub./str
/arr
-
arr name=id
strQuestion:1/str
/arr
-
arr name=name_en_t
strWhat is your name?/str
/arr
-
arr name=name_es_t
strComo te llamas?/str
/arr
-
arr name=pk_s
str1/str
/arr
-
arr name=spell
strWhat is your name?/str
strMy name is Jakub/str
strComo te llamas?/str
strMe llamo Jakub./str
/arr
/doc
-
doc
-
arr name=answer_en_t
strI am in the kitchen Jakub!/str
/arr
-
arr name=answer_es_t
strEstoy en la cocina./str
/arr
-
arr name=id
strQuestion:2/str
/arr
-
arr name=name_en_t
strWhere are you?/str
/arr
-
arr name=name_es_t
strDonde estas?/str
/arr
-
arr name=pk_s
str2/str
/arr
-
arr name=spell
strWhere are you?/str
strI am in the kitchen Jakub!/str
strDonde estas?/str
strEstoy en la cocina./str
/arr
/doc
/result
/response

4. Now I needed two dismaxes to make it work in two separate languages. Lets
say I just want to look up in *_en fields, then I created a dismax:

requestHandler name=/English class=solr.SearchHandler
lst name=defaults
  str name=defTypedismax/str
  str name=echoParamsexplicit/str
  float name=tie0.01/float
  str name=qf
name_en_t^0.5 answer_en_t^1.0
 /str
 /lst
  /requestHandler


5. Hitting the url: http://localhost:8982/solr/English/?q=Jakub gaves me an
error:

there are more terms than documents in field name_en_t, but it's
impossible to sort on tokenized fields

6. I know that I should create a separate dismax for Spanish.

My questions:
1. Why those fields are named with *_t? I saw in schema.xml that they are
made dynamicly. Can/should I create my own predefined fields in schema.xml?
Is this the place where you put HOW the field should be interpreted by
indexer?
2. Why the error in no. 5 is being thrown? I know that you cannot do sorting
on tokenized fields, but I don't see myself trying to index anything nor
tokenizing.
3. How should it be changed to work properly?

Thank you and I ask for patience as this can help many rookies like to me to
get started.
Jakub.

2010/10/21 Erick Erickson erickerick...@gmail.com

 See below:

 But also search the archives for multilanguage, this topic has been
 discussed
 many times before. Lucid Imagination maintains a Solr-powered (of course)
 searchable
 list at: http://www.lucidimagination.com/search/

 http://www.lucidimagination.com/search/

 On Wed, Oct 20, 2010 at 9:03 AM, Jakub Godawa jakub.god...@gmail.com
 wrote:

  Hi everyone! (my first post)
 
  I am new, but really curious about usefullness of lucene/solr in
 documents
  search from the web applications. I use Ruby on Rails to create one, with
  plugin acts_as_solr_reloaded that makes connection between web app and
  solr easy.
 
  So I am in a point, where I know that good solution is to prepare
  multi-language documents with fields like:
  question_en, answer_en,
  question_fr, answer_fr,
  question_pl,  answer_pl... etc.
 
  I need to create an index that would work with 6 languages: english,
  french,
  german, russian, ukrainian and polish.
 
  My questions are:
  1. Is it doable to have just one search field that behaves like Google's
  for
  all those documents? It can be an option to indicate a language to
 search.
 

 This depends on what you mean by do-able. Are you going to allow a French
 user to search an English document ( etc)? But the real answer is yes,
 you
 can
 if you .. There'll be tradeoffs.

 Take a look at the dismax handler. It's kind of hard to grok all at once,
 but you
 can cause it to search across multiple fields. That is, the user types
 language,
 and you can turn it into a complex query under the covers like
 lang_en:language lang_fr:language lang_ru:language, etc. You can also
 apply boosts. Note that this has obvious problems with, say, Russian. Half
 your
 job will be figuring out what will satisfy the user.

 You could also have a #different# dismax handler defined for various
 languages. Say
 the user was coming from Spanish. Consider a browseES handler. See
 solrconfig.xml
 for the default dismax handler. The Solr book mentioned above describes
 this.


  2. How should I begin changing the solr/conf/schema.xml (or other) file
 to
  tailor it to my needs? As I am a real rookie here, I am still a bit
  confused
  about fields, fieldTypes and their 

Re: Step by step tutorial for multi-language indexing and search

2010-10-20 Thread Dennis Gearon
Thre's approximately a 100% chance that you are going to go through a server 
side langauge(php, ruby, pearl, java, VB/asp/,net[cough,cough]), before you get 
to Solr/Lucene. I'd recommend it anyway.

This code will should look at the user's browser locale (en_US, pl_PL, es_CO, 
etc). The server side langauge would then choose wich language to search by and 
display.

NOW, that being said, are you going to have the exact same content for all 
langauges, just translated? The temptation would be to translate to a common 
language like English, then do the search, then get the translation. I wouln'dt 
recommend it, but I'm no expert. Translation of single words can be OK, but 
mulitword ideas and especially sentences doesn't work so well that way.

you probably will have separate content for that reason, AND another. Different 
cultures are interested in different things and only have common ground on 
cetain things like international news (but with different opinions) and medical 
news. So different content for differnt cultures speaking different languages.

Are you tryihg to address differnt languages in some place like the US or Great 
Britain, with LOTS of different languages spoken in minority cultures? Only 
then would you want a geographically centered server and information gathering 
organization. If you were going to have search for other countries, then I'd 
recommend those resources be geogrpahically close to their source culture.
Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Wed, 10/20/10, Jakub Godawa jakub.god...@gmail.com wrote:

 From: Jakub Godawa jakub.god...@gmail.com
 Subject: Step by step tutorial for multi-language indexing and search
 To: solr-user@lucene.apache.org
 Date: Wednesday, October 20, 2010, 6:03 AM
 Hi everyone! (my first post)
 
 I am new, but really curious about usefullness of
 lucene/solr in documents
 search from the web applications. I use Ruby on Rails to
 create one, with
 plugin acts_as_solr_reloaded that makes connection
 between web app and
 solr easy.
 
 So I am in a point, where I know that good solution is to
 prepare
 multi-language documents with fields like:
 question_en, answer_en,
 question_fr, answer_fr,
 question_pl,  answer_pl... etc.
 
 I need to create an index that would work with 6 languages:
 english, french,
 german, russian, ukrainian and polish.
 
 My questions are:
 1. Is it doable to have just one search field that behaves
 like Google's for
 all those documents? It can be an option to indicate a
 language to search.
 2. How should I begin changing the solr/conf/schema.xml (or
 other) file to
 tailor it to my needs? As I am a real rookie here, I am
 still a bit confused
 about fields, fieldTypes and their connection with
 particular field (ex.
 answer_fr) and the tokenizers and analyzers. If someone
 can provide a
 basic step by step tutorial on how to make it work in two
 languages I would
 be more that happy.
 3. Do all those languages are supported
 (officially/unofficialy) by
 lucene/solr?
 
 Thank you for help,
 Jakub Godawa.



Re: Step by step tutorial for multi-language indexing and search

2010-10-20 Thread Jakub Godawa
2010/10/20 Dennis Gearon gear...@sbcglobal.net

 Thre's approximately a 100% chance that you are going to go through a
 server side langauge(php, ruby, pearl, java, VB/asp/,net[cough,cough]),
 before you get to Solr/Lucene. I'd recommend it anyway.


I use a server side language (Ruby) as I build the web application.


 This code will should look at the user's browser locale (en_US, pl_PL,
 es_CO, etc). The server side langauge would then choose wich language to
 search by and display.


As I said, I may provide locale as an addition to the search query.


 NOW, that being said, are you going to have the exact same content for all
 langauges, just translated? The temptation would be to translate to a common
 language like English, then do the search, then get the translation. I
 wouln'dt recommend it, but I'm no expert. Translation of single words can be
 OK, but mulitword ideas and especially sentences doesn't work so well that
 way.


I would like not to yield that temptation. I know that Solr/Lucene can work
with many lanugages and I think is has a purpose - like languages' semantic
diversity. Whats more, you often don't translate things literally even if
they are just translations.


 you probably will have separate content for that reason, AND another.
 Different cultures are interested in different things and only have common
 ground on cetain things like international news (but with different
 opinions) and medical news. So different content for differnt cultures
 speaking different languages.


I need to treat each culture separetly regarding the subject of query.


 Are you tryihg to address differnt languages in some place like the US or
 Great Britain, with LOTS of different languages spoken in minority cultures?
 Only then would you want a geographically centered server and information
 gathering organization. If you were going to have search for other
 countries, then I'd recommend those resources be geogrpahically close to
 their source culture.


No I am not trying to address miniority cultures.

Thanks for answer,
Jakub Godawa.

Dennis Gearon

 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
 better idea to learn from others’ mistakes, so you do not have to make them
 yourself. from '
 http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'

 EARTH has a Right To Life,
  otherwise we all die.


 --- On Wed, 10/20/10, Jakub Godawa jakub.god...@gmail.com wrote:

  From: Jakub Godawa jakub.god...@gmail.com
  Subject: Step by step tutorial for multi-language indexing and search
  To: solr-user@lucene.apache.org
  Date: Wednesday, October 20, 2010, 6:03 AM
  Hi everyone! (my first post)
 
  I am new, but really curious about usefullness of
  lucene/solr in documents
  search from the web applications. I use Ruby on Rails to
  create one, with
  plugin acts_as_solr_reloaded that makes connection
  between web app and
  solr easy.
 
  So I am in a point, where I know that good solution is to
  prepare
  multi-language documents with fields like:
  question_en, answer_en,
  question_fr, answer_fr,
  question_pl,  answer_pl... etc.
 
  I need to create an index that would work with 6 languages:
  english, french,
  german, russian, ukrainian and polish.
 
  My questions are:
  1. Is it doable to have just one search field that behaves
  like Google's for
  all those documents? It can be an option to indicate a
  language to search.
  2. How should I begin changing the solr/conf/schema.xml (or
  other) file to
  tailor it to my needs? As I am a real rookie here, I am
  still a bit confused
  about fields, fieldTypes and their connection with
  particular field (ex.
  answer_fr) and the tokenizers and analyzers. If someone
  can provide a
  basic step by step tutorial on how to make it work in two
  languages I would
  be more that happy.
  3. Do all those languages are supported
  (officially/unofficialy) by
  lucene/solr?
 
  Thank you for help,
  Jakub Godawa.
 



Re: Step by step tutorial for multi-language indexing and search

2010-10-20 Thread Pradeep Singh
Here's what I would do -

Search all the fields everytime regardless of language. Use one handler and
specify all of these in qf and pf.
question_en, answer_en,
question_fr, answer_fr,
question_pl,  answer_pl

Individual field based analyzers will take care of appropriate tokenization
and you will get a match across all languages.

Even with this setup if you wanted you could also have a separate field
called language and use a fq to limit searches to that language only.

-Pradeep

On Wed, Oct 20, 2010 at 6:03 AM, Jakub Godawa jakub.god...@gmail.comwrote:

 Hi everyone! (my first post)

 I am new, but really curious about usefullness of lucene/solr in documents
 search from the web applications. I use Ruby on Rails to create one, with
 plugin acts_as_solr_reloaded that makes connection between web app and
 solr easy.

 So I am in a point, where I know that good solution is to prepare
 multi-language documents with fields like:
 question_en, answer_en,
 question_fr, answer_fr,
 question_pl,  answer_pl... etc.

 I need to create an index that would work with 6 languages: english,
 french,
 german, russian, ukrainian and polish.

 My questions are:
 1. Is it doable to have just one search field that behaves like Google's
 for
 all those documents? It can be an option to indicate a language to search.
 2. How should I begin changing the solr/conf/schema.xml (or other) file to
 tailor it to my needs? As I am a real rookie here, I am still a bit
 confused
 about fields, fieldTypes and their connection with particular field
 (ex.
 answer_fr) and the tokenizers and analyzers. If someone can provide a
 basic step by step tutorial on how to make it work in two languages I would
 be more that happy.
 3. Do all those languages are supported (officially/unofficialy) by
 lucene/solr?

 Thank you for help,
 Jakub Godawa.



Re: Step by step tutorial for multi-language indexing and search

2010-10-20 Thread Erick Erickson
See below:

But also search the archives for multilanguage, this topic has been
discussed
many times before. Lucid Imagination maintains a Solr-powered (of course)
searchable
list at: http://www.lucidimagination.com/search/

http://www.lucidimagination.com/search/

On Wed, Oct 20, 2010 at 9:03 AM, Jakub Godawa jakub.god...@gmail.comwrote:

 Hi everyone! (my first post)

 I am new, but really curious about usefullness of lucene/solr in documents
 search from the web applications. I use Ruby on Rails to create one, with
 plugin acts_as_solr_reloaded that makes connection between web app and
 solr easy.

 So I am in a point, where I know that good solution is to prepare
 multi-language documents with fields like:
 question_en, answer_en,
 question_fr, answer_fr,
 question_pl,  answer_pl... etc.

 I need to create an index that would work with 6 languages: english,
 french,
 german, russian, ukrainian and polish.

 My questions are:
 1. Is it doable to have just one search field that behaves like Google's
 for
 all those documents? It can be an option to indicate a language to search.


This depends on what you mean by do-able. Are you going to allow a French
user to search an English document ( etc)? But the real answer is yes, you
can
if you .. There'll be tradeoffs.

Take a look at the dismax handler. It's kind of hard to grok all at once,
but you
can cause it to search across multiple fields. That is, the user types
language,
and you can turn it into a complex query under the covers like
lang_en:language lang_fr:language lang_ru:language, etc. You can also
apply boosts. Note that this has obvious problems with, say, Russian. Half
your
job will be figuring out what will satisfy the user.

You could also have a #different# dismax handler defined for various
languages. Say
the user was coming from Spanish. Consider a browseES handler. See
solrconfig.xml
for the default dismax handler. The Solr book mentioned above describes
this.


 2. How should I begin changing the solr/conf/schema.xml (or other) file to
 tailor it to my needs? As I am a real rookie here, I am still a bit
 confused
 about fields, fieldTypes and their connection with particular field
 (ex.
 answer_fr) and the tokenizers and analyzers. If someone can provide a
 basic step by step tutorial on how to make it work in two languages I would
 be more that happy.


You have several choices here:
 books Lucene in Action and Solr 1.4, Enterprise SearchServer both have
discussions here.
 Spend some time on the solr/admin/analysis page. That page allows you to
see
   pretty much exactly what each of the steps in an analyzer chain
accomplish.


 3. Do all those languages are supported (officially/unofficialy) by
 lucene/solr?


See:
http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/analysis/Analyzer.html
Remember that Solr is built on Lucene, so these analyzers are available.



 Thank you for help,
 Jakub Godawa.


Best
Erick