Re: Step by step tutorial for multi-language indexing and search
Yes, you can declare each field with the Spanish, French, etc. types. The _t and other types are dynamic and don't have to be declared. This feature is generally used when you have hundreds or thousands of fields. It is more clear to declare your fields. You're right- that error should not be thrown. You are not asking for a sort. I don't know that one. You could try starting over with the Solr 1.4.1 release binaries. Jakub Godawa wrote: Hi Erick, thanks for your help! I need some technical help though... let me put it that way: 1. I deleted everything in index with: curl http://localhost:8983/solr/update -F stream.body=' deletequery*:*/query/delete' curl http://localhost:8983/solr/update -F stream.body='commit /' 2. I created 2 documents with fields: name_en, answer_en, name_es, answer_es 3. I made a query through admin page, with response: response - lst name=responseHeader int name=status0/int int name=QTime9/int - lst name=params str name=indenton/str str name=start0/str str name=qJakub /str str name=version2.2/str str name=rows10/str /lst /lst - result name=response numFound=2 start=0 - doc - arr name=answer_en_t strMy name is Jakub/str /arr - arr name=answer_es_t strMe llamo Jakub./str /arr - arr name=id strQuestion:1/str /arr - arr name=name_en_t strWhat is your name?/str /arr - arr name=name_es_t strComo te llamas?/str /arr - arr name=pk_s str1/str /arr - arr name=spell strWhat is your name?/str strMy name is Jakub/str strComo te llamas?/str strMe llamo Jakub./str /arr /doc - doc - arr name=answer_en_t strI am in the kitchen Jakub!/str /arr - arr name=answer_es_t strEstoy en la cocina./str /arr - arr name=id strQuestion:2/str /arr - arr name=name_en_t strWhere are you?/str /arr - arr name=name_es_t strDonde estas?/str /arr - arr name=pk_s str2/str /arr - arr name=spell strWhere are you?/str strI am in the kitchen Jakub!/str strDonde estas?/str strEstoy en la cocina./str /arr /doc /result /response 4. Now I needed two dismaxes to make it work in two separate languages. Lets say I just want to look up in *_en fields, then I created a dismax: requestHandler name=/English class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf name_en_t^0.5 answer_en_t^1.0 /str /lst /requestHandler 5. Hitting the url: http://localhost:8982/solr/English/?q=Jakub gaves me an error: there are more terms than documents in field name_en_t, but it's impossible to sort on tokenized fields 6. I know that I should create a separate dismax for Spanish. My questions: 1. Why those fields are named with *_t? I saw in schema.xml that they are made dynamicly. Can/should I create my own predefined fields in schema.xml? Is this the place where you put HOW the field should be interpreted by indexer? 2. Why the error in no. 5 is being thrown? I know that you cannot do sorting on tokenized fields, but I don't see myself trying to index anything nor tokenizing. 3. How should it be changed to work properly? Thank you and I ask for patience as this can help many rookies like to me to get started. Jakub. 2010/10/21 Erick Ericksonerickerick...@gmail.com See below: But also search the archives for multilanguage, this topic has been discussed many times before. Lucid Imagination maintains a Solr-powered (of course) searchable list at: http://www.lucidimagination.com/search/ http://www.lucidimagination.com/search/ On Wed, Oct 20, 2010 at 9:03 AM, Jakub Godawajakub.god...@gmail.com wrote: Hi everyone! (my first post) I am new, but really curious about usefullness of lucene/solr in documents search from the web applications. I use Ruby on Rails to create one, with plugin acts_as_solr_reloaded that makes connection between web app and solr easy. So I am in a point, where I know that good solution is to prepare multi-language documents with fields like: question_en, answer_en, question_fr, answer_fr, question_pl, answer_pl... etc. I need to create an index that would work with 6 languages: english, french, german, russian, ukrainian and polish. My questions are: 1. Is it doable to have just one search field that behaves like Google's for all those documents? It can be an option to indicate a language to search. This depends on what you mean by do-able. Are you going to allow a French user to search an English document ( etc)? But the real answer is yes, you can if you .. There'll be tradeoffs. Take a look at the dismax handler. It's kind of hard to grok all at once, but you can cause it to search across multiple fields. That is, the user types language, and you can turn it into a complex query under the covers like lang_en:language lang_fr:language lang_ru:language, etc. You can also apply boosts. Note that this has obvious problems with, say, Russian. Half your job will be figuring out what will satisfy the
Re: Step by step tutorial for multi-language indexing and search
Hi Erick, thanks for your help! I need some technical help though... let me put it that way: 1. I deleted everything in index with: curl http://localhost:8983/solr/update -F stream.body=' deletequery*:*/query/delete' curl http://localhost:8983/solr/update -F stream.body=' commit /' 2. I created 2 documents with fields: name_en, answer_en, name_es, answer_es 3. I made a query through admin page, with response: response - lst name=responseHeader int name=status0/int int name=QTime9/int - lst name=params str name=indenton/str str name=start0/str str name=qJakub /str str name=version2.2/str str name=rows10/str /lst /lst - result name=response numFound=2 start=0 - doc - arr name=answer_en_t strMy name is Jakub/str /arr - arr name=answer_es_t strMe llamo Jakub./str /arr - arr name=id strQuestion:1/str /arr - arr name=name_en_t strWhat is your name?/str /arr - arr name=name_es_t strComo te llamas?/str /arr - arr name=pk_s str1/str /arr - arr name=spell strWhat is your name?/str strMy name is Jakub/str strComo te llamas?/str strMe llamo Jakub./str /arr /doc - doc - arr name=answer_en_t strI am in the kitchen Jakub!/str /arr - arr name=answer_es_t strEstoy en la cocina./str /arr - arr name=id strQuestion:2/str /arr - arr name=name_en_t strWhere are you?/str /arr - arr name=name_es_t strDonde estas?/str /arr - arr name=pk_s str2/str /arr - arr name=spell strWhere are you?/str strI am in the kitchen Jakub!/str strDonde estas?/str strEstoy en la cocina./str /arr /doc /result /response 4. Now I needed two dismaxes to make it work in two separate languages. Lets say I just want to look up in *_en fields, then I created a dismax: requestHandler name=/English class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf name_en_t^0.5 answer_en_t^1.0 /str /lst /requestHandler 5. Hitting the url: http://localhost:8982/solr/English/?q=Jakub gaves me an error: there are more terms than documents in field name_en_t, but it's impossible to sort on tokenized fields 6. I know that I should create a separate dismax for Spanish. My questions: 1. Why those fields are named with *_t? I saw in schema.xml that they are made dynamicly. Can/should I create my own predefined fields in schema.xml? Is this the place where you put HOW the field should be interpreted by indexer? 2. Why the error in no. 5 is being thrown? I know that you cannot do sorting on tokenized fields, but I don't see myself trying to index anything nor tokenizing. 3. How should it be changed to work properly? Thank you and I ask for patience as this can help many rookies like to me to get started. Jakub. 2010/10/21 Erick Erickson erickerick...@gmail.com See below: But also search the archives for multilanguage, this topic has been discussed many times before. Lucid Imagination maintains a Solr-powered (of course) searchable list at: http://www.lucidimagination.com/search/ http://www.lucidimagination.com/search/ On Wed, Oct 20, 2010 at 9:03 AM, Jakub Godawa jakub.god...@gmail.com wrote: Hi everyone! (my first post) I am new, but really curious about usefullness of lucene/solr in documents search from the web applications. I use Ruby on Rails to create one, with plugin acts_as_solr_reloaded that makes connection between web app and solr easy. So I am in a point, where I know that good solution is to prepare multi-language documents with fields like: question_en, answer_en, question_fr, answer_fr, question_pl, answer_pl... etc. I need to create an index that would work with 6 languages: english, french, german, russian, ukrainian and polish. My questions are: 1. Is it doable to have just one search field that behaves like Google's for all those documents? It can be an option to indicate a language to search. This depends on what you mean by do-able. Are you going to allow a French user to search an English document ( etc)? But the real answer is yes, you can if you .. There'll be tradeoffs. Take a look at the dismax handler. It's kind of hard to grok all at once, but you can cause it to search across multiple fields. That is, the user types language, and you can turn it into a complex query under the covers like lang_en:language lang_fr:language lang_ru:language, etc. You can also apply boosts. Note that this has obvious problems with, say, Russian. Half your job will be figuring out what will satisfy the user. You could also have a #different# dismax handler defined for various languages. Say the user was coming from Spanish. Consider a browseES handler. See solrconfig.xml for the default dismax handler. The Solr book mentioned above describes this. 2. How should I begin changing the solr/conf/schema.xml (or other) file to tailor it to my needs? As I am a real rookie here, I am still a bit confused about fields, fieldTypes and their
Re: Step by step tutorial for multi-language indexing and search
Thre's approximately a 100% chance that you are going to go through a server side langauge(php, ruby, pearl, java, VB/asp/,net[cough,cough]), before you get to Solr/Lucene. I'd recommend it anyway. This code will should look at the user's browser locale (en_US, pl_PL, es_CO, etc). The server side langauge would then choose wich language to search by and display. NOW, that being said, are you going to have the exact same content for all langauges, just translated? The temptation would be to translate to a common language like English, then do the search, then get the translation. I wouln'dt recommend it, but I'm no expert. Translation of single words can be OK, but mulitword ideas and especially sentences doesn't work so well that way. you probably will have separate content for that reason, AND another. Different cultures are interested in different things and only have common ground on cetain things like international news (but with different opinions) and medical news. So different content for differnt cultures speaking different languages. Are you tryihg to address differnt languages in some place like the US or Great Britain, with LOTS of different languages spoken in minority cultures? Only then would you want a geographically centered server and information gathering organization. If you were going to have search for other countries, then I'd recommend those resources be geogrpahically close to their source culture. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Wed, 10/20/10, Jakub Godawa jakub.god...@gmail.com wrote: From: Jakub Godawa jakub.god...@gmail.com Subject: Step by step tutorial for multi-language indexing and search To: solr-user@lucene.apache.org Date: Wednesday, October 20, 2010, 6:03 AM Hi everyone! (my first post) I am new, but really curious about usefullness of lucene/solr in documents search from the web applications. I use Ruby on Rails to create one, with plugin acts_as_solr_reloaded that makes connection between web app and solr easy. So I am in a point, where I know that good solution is to prepare multi-language documents with fields like: question_en, answer_en, question_fr, answer_fr, question_pl, answer_pl... etc. I need to create an index that would work with 6 languages: english, french, german, russian, ukrainian and polish. My questions are: 1. Is it doable to have just one search field that behaves like Google's for all those documents? It can be an option to indicate a language to search. 2. How should I begin changing the solr/conf/schema.xml (or other) file to tailor it to my needs? As I am a real rookie here, I am still a bit confused about fields, fieldTypes and their connection with particular field (ex. answer_fr) and the tokenizers and analyzers. If someone can provide a basic step by step tutorial on how to make it work in two languages I would be more that happy. 3. Do all those languages are supported (officially/unofficialy) by lucene/solr? Thank you for help, Jakub Godawa.
Re: Step by step tutorial for multi-language indexing and search
2010/10/20 Dennis Gearon gear...@sbcglobal.net Thre's approximately a 100% chance that you are going to go through a server side langauge(php, ruby, pearl, java, VB/asp/,net[cough,cough]), before you get to Solr/Lucene. I'd recommend it anyway. I use a server side language (Ruby) as I build the web application. This code will should look at the user's browser locale (en_US, pl_PL, es_CO, etc). The server side langauge would then choose wich language to search by and display. As I said, I may provide locale as an addition to the search query. NOW, that being said, are you going to have the exact same content for all langauges, just translated? The temptation would be to translate to a common language like English, then do the search, then get the translation. I wouln'dt recommend it, but I'm no expert. Translation of single words can be OK, but mulitword ideas and especially sentences doesn't work so well that way. I would like not to yield that temptation. I know that Solr/Lucene can work with many lanugages and I think is has a purpose - like languages' semantic diversity. Whats more, you often don't translate things literally even if they are just translations. you probably will have separate content for that reason, AND another. Different cultures are interested in different things and only have common ground on cetain things like international news (but with different opinions) and medical news. So different content for differnt cultures speaking different languages. I need to treat each culture separetly regarding the subject of query. Are you tryihg to address differnt languages in some place like the US or Great Britain, with LOTS of different languages spoken in minority cultures? Only then would you want a geographically centered server and information gathering organization. If you were going to have search for other countries, then I'd recommend those resources be geogrpahically close to their source culture. No I am not trying to address miniority cultures. Thanks for answer, Jakub Godawa. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from ' http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Wed, 10/20/10, Jakub Godawa jakub.god...@gmail.com wrote: From: Jakub Godawa jakub.god...@gmail.com Subject: Step by step tutorial for multi-language indexing and search To: solr-user@lucene.apache.org Date: Wednesday, October 20, 2010, 6:03 AM Hi everyone! (my first post) I am new, but really curious about usefullness of lucene/solr in documents search from the web applications. I use Ruby on Rails to create one, with plugin acts_as_solr_reloaded that makes connection between web app and solr easy. So I am in a point, where I know that good solution is to prepare multi-language documents with fields like: question_en, answer_en, question_fr, answer_fr, question_pl, answer_pl... etc. I need to create an index that would work with 6 languages: english, french, german, russian, ukrainian and polish. My questions are: 1. Is it doable to have just one search field that behaves like Google's for all those documents? It can be an option to indicate a language to search. 2. How should I begin changing the solr/conf/schema.xml (or other) file to tailor it to my needs? As I am a real rookie here, I am still a bit confused about fields, fieldTypes and their connection with particular field (ex. answer_fr) and the tokenizers and analyzers. If someone can provide a basic step by step tutorial on how to make it work in two languages I would be more that happy. 3. Do all those languages are supported (officially/unofficialy) by lucene/solr? Thank you for help, Jakub Godawa.
Re: Step by step tutorial for multi-language indexing and search
Here's what I would do - Search all the fields everytime regardless of language. Use one handler and specify all of these in qf and pf. question_en, answer_en, question_fr, answer_fr, question_pl, answer_pl Individual field based analyzers will take care of appropriate tokenization and you will get a match across all languages. Even with this setup if you wanted you could also have a separate field called language and use a fq to limit searches to that language only. -Pradeep On Wed, Oct 20, 2010 at 6:03 AM, Jakub Godawa jakub.god...@gmail.comwrote: Hi everyone! (my first post) I am new, but really curious about usefullness of lucene/solr in documents search from the web applications. I use Ruby on Rails to create one, with plugin acts_as_solr_reloaded that makes connection between web app and solr easy. So I am in a point, where I know that good solution is to prepare multi-language documents with fields like: question_en, answer_en, question_fr, answer_fr, question_pl, answer_pl... etc. I need to create an index that would work with 6 languages: english, french, german, russian, ukrainian and polish. My questions are: 1. Is it doable to have just one search field that behaves like Google's for all those documents? It can be an option to indicate a language to search. 2. How should I begin changing the solr/conf/schema.xml (or other) file to tailor it to my needs? As I am a real rookie here, I am still a bit confused about fields, fieldTypes and their connection with particular field (ex. answer_fr) and the tokenizers and analyzers. If someone can provide a basic step by step tutorial on how to make it work in two languages I would be more that happy. 3. Do all those languages are supported (officially/unofficialy) by lucene/solr? Thank you for help, Jakub Godawa.
Re: Step by step tutorial for multi-language indexing and search
See below: But also search the archives for multilanguage, this topic has been discussed many times before. Lucid Imagination maintains a Solr-powered (of course) searchable list at: http://www.lucidimagination.com/search/ http://www.lucidimagination.com/search/ On Wed, Oct 20, 2010 at 9:03 AM, Jakub Godawa jakub.god...@gmail.comwrote: Hi everyone! (my first post) I am new, but really curious about usefullness of lucene/solr in documents search from the web applications. I use Ruby on Rails to create one, with plugin acts_as_solr_reloaded that makes connection between web app and solr easy. So I am in a point, where I know that good solution is to prepare multi-language documents with fields like: question_en, answer_en, question_fr, answer_fr, question_pl, answer_pl... etc. I need to create an index that would work with 6 languages: english, french, german, russian, ukrainian and polish. My questions are: 1. Is it doable to have just one search field that behaves like Google's for all those documents? It can be an option to indicate a language to search. This depends on what you mean by do-able. Are you going to allow a French user to search an English document ( etc)? But the real answer is yes, you can if you .. There'll be tradeoffs. Take a look at the dismax handler. It's kind of hard to grok all at once, but you can cause it to search across multiple fields. That is, the user types language, and you can turn it into a complex query under the covers like lang_en:language lang_fr:language lang_ru:language, etc. You can also apply boosts. Note that this has obvious problems with, say, Russian. Half your job will be figuring out what will satisfy the user. You could also have a #different# dismax handler defined for various languages. Say the user was coming from Spanish. Consider a browseES handler. See solrconfig.xml for the default dismax handler. The Solr book mentioned above describes this. 2. How should I begin changing the solr/conf/schema.xml (or other) file to tailor it to my needs? As I am a real rookie here, I am still a bit confused about fields, fieldTypes and their connection with particular field (ex. answer_fr) and the tokenizers and analyzers. If someone can provide a basic step by step tutorial on how to make it work in two languages I would be more that happy. You have several choices here: books Lucene in Action and Solr 1.4, Enterprise SearchServer both have discussions here. Spend some time on the solr/admin/analysis page. That page allows you to see pretty much exactly what each of the steps in an analyzer chain accomplish. 3. Do all those languages are supported (officially/unofficialy) by lucene/solr? See: http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/analysis/Analyzer.html Remember that Solr is built on Lucene, so these analyzers are available. Thank you for help, Jakub Godawa. Best Erick