IndexMerge not found
i try http://wiki.apache.org/solr/MergingSolrIndexes system: win2003, jdk 1.6 Error information: Caused by: java.lang.ClassNotFoundException: org.apache.lucene.misc.IndexMergeTo ol at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClassInternal(Unknown Source) Could not find the main class: org/apache/lucene/misc/IndexMergeTool. Program w ill exit. -- regards j.L ( I live in Shanghai, China)
Re: IndexMerge not found
i use lucene-core-2.9-dev.jar, lucene-misc-2.9-dev.jar On Thu, Jul 2, 2009 at 2:02 PM, James liu liuping.ja...@gmail.com wrote: i try http://wiki.apache.org/solr/MergingSolrIndexes system: win2003, jdk 1.6 Error information: Caused by: java.lang.ClassNotFoundException: org.apache.lucene.misc.IndexMergeTo ol at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClassInternal(Unknown Source) Could not find the main class: org/apache/lucene/misc/IndexMergeTool. Program w ill exit. -- regards j.L ( I live in Shanghai, China) -- regards j.L ( I live in Shanghai, China)
Re: Question on Facet Count
On Wed, Jul 1, 2009 at 10:28 PM, Sumit Aggarwal sumit.kaggar...@gmail.comwrote: Hi Shalin, Sorry for the confusion but i dont have separate index fields. I have all information in only one index field descp. Now is it possible what you explained. No, you should separate out the data in multiple fields for this to work. One big field containing everything is ok for full text search but you can't build a faceted search on that. -- Regards, Shalin Shekhar Mangar.
1.4 stable release date
Hi Just wondering if there is a release date for 1.4 stable? Regards Andrew
Is it problem? I use solr to search and index is made by lucene. (not EmbeddedSolrServer(wiki is old))
I use solr to search and index is made by lucene. (not EmbeddedSolrServer(wiki is old)) Is it problem when i use solr to search? which the difference between Index(made by lucene and solr)? thks -- regards j.L ( I live in Shanghai, China)
Implementing PhraseQuery and MoreLikeThis Query in one app
Hi, Recently I've posted a question regarding using stop words in a PhraseQuery and in a MoreLikeThis query in the same app. I posted it twice. Unfortunately I didn't get any responses. I realize that the question might not have been formulated clearly. So let me reformulate it. Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in the same app taking into account the fact that for the former to work the stop words list needs to be included and this results in the latter putting stop words among the most important words? Or these two queries need to use two different indexes and thus have to be implemented in different applications or in different cores of Solr (with different schema.xml files: one with the StopWord Filter and another without it.)? Any opinion will be highly appreciated. Thank you. Redards, Sergey Goldberg P.S. Just for the reference, here is my original message. 1. There're 3 kinds of searches in my application: a) PhraseQuery search; b) search for separate words; c) MLT search. The problem I encountered is in the use of a stop words list. If I don't take it into account, the MLT query picks up common words as the most important words what is not right. And when I use it, the PhraseQuery stops working. I tried it with the ps and qs parameters (ps=100, qs=100) but that didn't change anything. (Both indexed fields are of type text, the StandardAnalyzer is applied, and all docs are in English.) 2. Do I understand it right that the query q=id:1mlt=truemlt.fl=content... should bring back documents where the most important words are in the set of those for the doc with id=1? -- View this message in context: http://www.nabble.com/Implementing-PhraseQuery-and-MoreLikeThis-Query-in-one-app-tp24303817p24303817.html Sent from the Solr - User mailing list archive at Nabble.com.
Creating spellchecker dictionary from multiple sources
Hello everybody, dealing with the spell checker component i'm wondering if it's possible to generate my dictionary index based on multiple indexes fields and also want to know how anyone has solve this problem. Thx -- Lici
Re: Is there any other way to load the index beside using http connection?
On Wed, 1 Jul 2009 15:07:12 -0700 Francis Yakin fya...@liquid.com wrote: We have several thousands of xml files in database that we load it to solr master The Database uses http connection and transfer those files to solr master. Solr then translate xml files to their lindex. We are experiencing issue with close/open connection in the firewall and very very slow. Is there any other way to load the data/index from Database to solr master beside using http connection, so it means we just scp/ftp the xml file from Database system to solr master and let solr convert those to lucene indexes? Francis, after reading the whole thread, it seems you have : - Data source : Oracle DB, on separate location to your SOLR. - Data format : XML output. definitely DIH is a great option, but since you are on 1.2, not available to you (you should look into upgrading if you can!). Have you tried connecting to SOLR over HTTP from localhost, therefore avoiding any firewall issues and network latency ? it should work a LOT faster than from a remote site. Also make sure not to commit until you really needed. Other alternatives are to transform the XML into csv and import it that way. Or write a simple app that will parse the xml and post it directly using the embedded solr method. plenty of options, all of them documented @ solr's site. good luck, b _ {Beto|Norberto|Numard} Meijome People demand freedom of speech to make up for the freedom of thought which they avoid. Soren Aabye Kierkegaard I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Is it problem? I use solr to search and index is made by lucene. (not EmbeddedSolrServer(wiki is old))
On Thu, 2 Jul 2009 16:12:58 +0800 James liu liuping.ja...@gmail.com wrote: I use solr to search and index is made by lucene. (not EmbeddedSolrServer(wiki is old)) Is it problem when i use solr to search? which the difference between Index(made by lucene and solr)? Hi James, make sure the version of Lucene used to create your index is the same as the libraries included in your version of SOLR. it should work. it may be that an older lucene index works with a newer lucene-provided-in-solr libs, but after using it you may not be able to go back , but i am not sure of the details. probably an FAQ by now - check the archives :) good luck, B _ {Beto|Norberto|Numard} Meijome He has no enemies, but is intensely disliked by his friends. Oscar Wilde I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Implementing PhraseQuery and MoreLikeThis Query in one app
SergeyG schrieb: Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in the same app taking into account the fact that for the former to work the stop words list needs to be included and this results in the latter putting stop words among the most important words? Why would the inclusion of a stopword list result in stopwords being of top importance in the MoreLikeThis query? Michael Ludwig
Adding shards entries in solrconfig.xml
Hi, I read the following article: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr its mentioned that its much easier to set the shards parameter for your SearchHandler in solrcofig.xml. i also went through: http://www.nabble.com/newbie-question-on-SOLR-distributed-searches-with-many-%22shards%22-td20687487.html but it gives a wage idea about setting the shards. particularly the syntax. Can anyone given an example of setting the shards parameter in solrconfig.xml. Regards, Raakhi
Re: Implementing PhraseQuery and MoreLikeThis Query in one app
Why would the inclusion of a stopword list result in stopwords being of top importance in the MoreLikeThis query? Michael, I just saw some of them (words from the stop words list) in the MLT query's response. Sergey SergeyG wrote: Hi, Recently I've posted a question regarding using stop words in a PhraseQuery and in a MoreLikeThis query in the same app. I posted it twice. Unfortunately I didn't get any responses. I realize that the question might not have been formulated clearly. So let me reformulate it. Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in the same app taking into account the fact that for the former to work the stop words list needs to be included and this results in the latter putting stop words among the most important words? Or these two queries need to use two different indexes and thus have to be implemented in different applications or in different cores of Solr (with different schema.xml files: one with the StopWord Filter and another without it.)? Any opinion will be highly appreciated. Thank you. Redards, Sergey Goldberg P.S. Just for the reference, here is my original message. 1. There're 3 kinds of searches in my application: a) PhraseQuery search; b) search for separate words; c) MLT search. The problem I encountered is in the use of a stop words list. If I don't take it into account, the MLT query picks up common words as the most important words what is not right. And when I use it, the PhraseQuery stops working. I tried it with the ps and qs parameters (ps=100, qs=100) but that didn't change anything. (Both indexed fields are of type text, the StandardAnalyzer is applied, and all docs are in English.) 2. Do I understand it right that the query q=id:1mlt=truemlt.fl=content... should bring back documents where the most important words are in the set of those for the doc with id=1? -- View this message in context: http://www.nabble.com/Implementing-PhraseQuery-and-MoreLikeThis-Query-in-one-app-tp24303817p24304705.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it problem? I use solr to search and index is made by lucene. (not EmbeddedSolrServer(wiki is old))
Hi, You need to ensure that the index format is compatible (that the same Lucene jars are used in both cases) and that the analysis performed on fields is the same. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: James liu liuping.ja...@gmail.com To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 4:12:58 AM Subject: Is it problem? I use solr to search and index is made by lucene. (not EmbeddedSolrServer(wiki is old)) I use solr to search and index is made by lucene. (not EmbeddedSolrServer(wiki is old)) Is it problem when i use solr to search? which the difference between Index(made by lucene and solr)? thks -- regards j.L ( I live in Shanghai, China)
Re: 1.4 stable release date
Hi Andrew, I don't think we have a specific date set. THe best way to monitor this progress is probably by monitoring the number of JIRA issues set for 1.4 (Fix for). Otis-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Andrew McCombe eupe...@gmail.com To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 4:01:21 AM Subject: 1.4 stable release date Hi Just wondering if there is a release date for 1.4 stable? Regards Andrew
Re: IndexMerge not found
Hi, My feeling is those jars are actually not in your CLASSPATH (or in -cp). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: James liu liuping.ja...@gmail.com To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 2:03:19 AM Subject: Re: IndexMerge not found i use lucene-core-2.9-dev.jar, lucene-misc-2.9-dev.jar On Thu, Jul 2, 2009 at 2:02 PM, James liu wrote: i try http://wiki.apache.org/solr/MergingSolrIndexes system: win2003, jdk 1.6 Error information: Caused by: java.lang.ClassNotFoundException: org.apache.lucene.misc.IndexMergeTo ol at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClassInternal(Unknown Source) Could not find the main class: org/apache/lucene/misc/IndexMergeTool. Program w ill exit. -- regards j.L ( I live in Shanghai, China) -- regards j.L ( I live in Shanghai, China)
Re: Is there any other way to load the index beside using http connection?
Francis, I think both of these are on the Solr Wiki. You'll have to figure out how to export from DB yourself, and you'll probably write a script/tool to read the export and rewrite it in the csv format. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Francis Yakin fya...@liquid.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 12:26:14 AM Subject: RE: Is there any other way to load the index beside using http connection? How you import the documents as csv data/file from Oracle Database to Sol master( they are two different machines)? And you have the doc for using EmbeddedSolrServer? Thanks Otis! Francis -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Wednesday, July 01, 2009 8:01 PM To: solr-user@lucene.apache.org Subject: Re: Is there any other way to load the index beside using http connection? Francis, There are a number of things you can do to make indexing over HTTP faster. You can also import documents as csv data/file. Finally, you can use EmbeddedSolrServer. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Francis Yakin To: solr-user@lucene.apache.org Sent: Wednesday, July 1, 2009 6:07:12 PM Subject: Is there any other way to load the index beside using http connection? We have several thousands of xml files in database that we load it to solr master The Database uses http connection and transfer those files to solr master. Solr then translate xml files to their lindex. We are experiencing issue with close/open connection in the firewall and very very slow. Is there any other way to load the data/index from Database to solr master beside using http connection, so it means we just scp/ftp the xml file from Database system to solr master and let solr convert those to lucene indexes? Any input or help will be much appreciated. Thanks Francis
Re: Implementing PhraseQuery and MoreLikeThis Query in one app
Hi, Rushing quickly through this one, one way you can use the same index for both is by copying fields. One field copy would leave stopwords in (for PQ), and the other copy would remove stopwords (for MLT). There may be more elegant ways to accomplish this - this is the first thing that comes to mind. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: SergeyG sgoldb...@mail.ru To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 5:31:21 AM Subject: Implementing PhraseQuery and MoreLikeThis Query in one app Hi, Recently I've posted a question regarding using stop words in a PhraseQuery and in a MoreLikeThis query in the same app. I posted it twice. Unfortunately I didn't get any responses. I realize that the question might not have been formulated clearly. So let me reformulate it. Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in the same app taking into account the fact that for the former to work the stop words list needs to be included and this results in the latter putting stop words among the most important words? Or these two queries need to use two different indexes and thus have to be implemented in different applications or in different cores of Solr (with different schema.xml files: one with the StopWord Filter and another without it.)? Any opinion will be highly appreciated. Thank you. Redards, Sergey Goldberg P.S. Just for the reference, here is my original message. 1. There're 3 kinds of searches in my application: a) PhraseQuery search; b) search for separate words; c) MLT search. The problem I encountered is in the use of a stop words list. If I don't take it into account, the MLT query picks up common words as the most important words what is not right. And when I use it, the PhraseQuery stops working. I tried it with the ps and qs parameters (ps=100, qs=100) but that didn't change anything. (Both indexed fields are of type text, the StandardAnalyzer is applied, and all docs are in English.) 2. Do I understand it right that the query q=id:1mlt=truemlt.fl=content... should bring back documents where the most important words are in the set of those for the doc with id=1? -- View this message in context: http://www.nabble.com/Implementing-PhraseQuery-and-MoreLikeThis-Query-in-one-app-tp24303817p24303817.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Creating spellchecker dictionary from multiple sources
Hi Lici, I don't think the current spellchecker can look at more than one field, let alone multiple indices, but you could certainly modify the code and make it do that. Looking at multiple fields of the same index may make more sense than looking at multiple indices. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Licinio Fernández Maurelo licinio.fernan...@gmail.com To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 5:36:34 AM Subject: Creating spellchecker dictionary from multiple sources Hello everybody, dealing with the spell checker component i'm wondering if it's possible to generate my dictionary index based on multiple indexes fields and also want to know how anyone has solve this problem. Thx -- Lici
Re: Implementing PhraseQuery and MoreLikeThis Query in one app
Michael - because they are the most frequent, which is how MLT selects terms to use for querying, IIRC. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Michael Ludwig m...@as-guides.com To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 6:20:05 AM Subject: Re: Implementing PhraseQuery and MoreLikeThis Query in one app SergeyG schrieb: Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in the same app taking into account the fact that for the former to work the stop words list needs to be included and this results in the latter putting stop words among the most important words? Why would the inclusion of a stopword list result in stopwords being of top importance in the MoreLikeThis query? Michael Ludwig
Re: Adding shards entries in solrconfig.xml
Rakhi, Have you looked at Solr example directories (in Solr svn)? There may be an example of it there. From memory, the syntax is: shardsURL1,URL2 /shards e.g. shardshttp://shard1:8080/solr,http://shard2:8080/solr/shards This goes into one of the sections of the request handler configuration. Shards can also be specified in the shards param in the URL itself. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Rakhi Khatwani rkhatw...@gmail.com To: solr-user@lucene.apache.org Cc: ninad.r...@germinait.com Sent: Thursday, July 2, 2009 6:36:43 AM Subject: Adding shards entries in solrconfig.xml Hi, I read the following article: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr its mentioned that its much easier to set the shards parameter for your SearchHandler in solrcofig.xml. i also went through: http://www.nabble.com/newbie-question-on-SOLR-distributed-searches-with-many-%22shards%22-td20687487.html but it gives a wage idea about setting the shards. particularly the syntax. Can anyone given an example of setting the shards parameter in solrconfig.xml. Regards, Raakhi
Re: Adding shards entries in solrconfig.xml
Rakhi Khatwani wrote: Hi, I read the following article: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr its mentioned that its much easier to set the shards parameter for your SearchHandler in solrcofig.xml. i also went through: http://www.nabble.com/newbie-question-on-SOLR-distributed-searches-with-many-%22shards%22-td20687487.html but it gives a wage idea about setting the shards. particularly the syntax. Can anyone given an example of setting the shards parameter in solrconfig.xml. Regards, Raakhi requestHandler name=standard class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str str name=shardslocalhost:8983/solr,localhost:7574/solr/str /lst /requestHandler Thats an example of the syntax though. Don't do it on the standard request handler or you will create an infinite loop. Define a different handler. -- - Mark http://www.lucidimagination.com
Re: Creating spellchecker dictionary from multiple sources
You could configure multiple spellcheckers on different fields, or if you want to aggregate several fields into the suggestions, use copyField to pool all text to be suggested together into a single field. Erik On Jul 2, 2009, at 7:46 AM, Otis Gospodnetic wrote: Hi Lici, I don't think the current spellchecker can look at more than one field, let alone multiple indices, but you could certainly modify the code and make it do that. Looking at multiple fields of the same index may make more sense than looking at multiple indices. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Licinio Fernández Maurelo licinio.fernan...@gmail.com To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 5:36:34 AM Subject: Creating spellchecker dictionary from multiple sources Hello everybody, dealing with the spell checker component i'm wondering if it's possible to generate my dictionary index based on multiple indexes fields and also want to know how anyone has solve this problem. Thx -- Lici
Metada document for faceted search
Hi. I'm trying to implement custom faceted search like CNET's approach.http://www.mail-archive.com/java-u...@lucene.apache.org/msg02646.html . But i couldn't figure out how to structure and index category metadata document. Thanks. -- Osman İZBAT
Re: Creating spellchecker dictionary from multiple sources
Thanks for your responses guys, my problem is that currently we have 11 cores-index, some of them contains fields i want to use for spell checking and i'm thinking on build an extra-core containing the dictionary index, and import from multiple indexes the information i need via DIH. Should it works, i hope 2009/7/2 Erik Hatcher e...@ehatchersolutions.com You could configure multiple spellcheckers on different fields, or if you want to aggregate several fields into the suggestions, use copyField to pool all text to be suggested together into a single field. Erik On Jul 2, 2009, at 7:46 AM, Otis Gospodnetic wrote: Hi Lici, I don't think the current spellchecker can look at more than one field, let alone multiple indices, but you could certainly modify the code and make it do that. Looking at multiple fields of the same index may make more sense than looking at multiple indices. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Licinio Fernández Maurelo licinio.fernan...@gmail.com To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 5:36:34 AM Subject: Creating spellchecker dictionary from multiple sources Hello everybody, dealing with the spell checker component i'm wondering if it's possible to generate my dictionary index based on multiple indexes fields and also want to know how anyone has solve this problem. Thx -- Lici -- Lici
Re: Installing a patch in a solr nightly on Windows
Thanks for the suggestions: Koji: I am aware of Cygwin. The problem is I am not sure how to do the whole thing. I downloaded a nightly zip file and extracted it to a directory. Where do I put the .patch file? Where do I execute the patch... command from? It doesn't work when I do it at the root of the install. Michael: I'll take a look at that standalone utility. Paul: I assume that in order to do it with svn, you need to checkout the trunk? What do you do after that? Do you have the link to the distributions? I get OPTIONS of 'http://svn.apache.org/repos/asf/lucene/solr/trunk': could not connect to server (http://svn.apache.org) when I try. Something tells me that my proxy is blocking the connection. If that is the case, then I don't think that I can do a checkout. Do you have any other alternatives? Thanks again for the input. ahammad wrote: Hello, I am trying to install a patch for Solr (https://issues.apache.org/jira/browse/SOLR-284) but I'm not sure how to do it in Windows. I have a copy of the nightly build, but I don't know how to proceed. I looked at the HowToContribute wiki for patch installation instructions, but there are no Windows specific instructions in there. Any help would be greatly appreciated. Thanks -- View this message in context: http://www.nabble.com/Installing-a-patch-in-a-solr-nightly-on-Windows-tp24273921p24306501.html Sent from the Solr - User mailing list archive at Nabble.com.
EnglishPorterFilterFactory and PatternReplaceFilterFactory
In Germany we have a strange habbit of seeing some sort of equivalence between Umlaut letters and a two letter representation. Example 'ä' and 'ae' are expected to give the same search results. To achieve this I added this filter to the text fieldtype definition: filter class=solr.PatternReplaceFilterFactory pattern=ä replacement=ae replace=all / to both index and query analyzers (and more for the other umlauts). This works well when I search for a name (a word not stemmed) but not e.g. with the word Wärme. search for 'wärme' works search for 'waerme' does not work search for 'waerm' works if I move the EnglishPorterFilterFactory after the PatternReplaceFilterFactory. DebugQuery for waerme gives a parsedquery FS:waerm. What I don't understand is why the (existing) records are not found. If I understand it right, there should be 'waerm' in the index as well. By the way, the reason why I keep the EnglishPorterFilterFactory is that the records are in many languages and the English stemming gives good results in many cases and I don't want (yet) to multiply my fields to have language specific versions. But even if the stemming is not right because the language is not English I think records should be found as long as the analyzers are the same for index and query. This is with Solr 1.3. Can someone shed some light on what is going on and how I can achieve my goal? -Michael
Making Analyzer Phrase aware?
I was looking at the SOLR-908 port of nutch CommonGramsFilter as an approach for having phrase searches be sensitive to stop words within a query. So a search on car on street wouldn't match the text car in street. From what I can tell the query version of the filter will *always* create stop-word-grams, not just in a phrase context. I want non-phrase searches to ignore stop words as usual. Can someone tell me how to make an analyzer (or token filter) phrase aware so I only create grams when I know I'm inside of a phrase? Thanks. Mike -- View this message in context: http://www.nabble.com/Making-Analyzer-Phrase-aware--tp24306862p24306862.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Installing a patch in a solr nightly on Windows
ahammad wrote: Thanks for the suggestions: Koji: I am aware of Cygwin. The problem is I am not sure how to do the whole thing. I downloaded a nightly zip file and extracted it to a directory. Where do I put the .patch file? Where do I execute the patch... command from? It doesn't work when I do it at the root of the install. It should work at the root of the install: $ patch -p0 SOLR-284.patch Do you see an error message? What's error? Koji
Re: Installing a patch in a solr nightly on Windows
When I go to the source and I input the command, I get: bash: patch: command not found Thanks Koji Sekiguchi-2 wrote: ahammad wrote: Thanks for the suggestions: Koji: I am aware of Cygwin. The problem is I am not sure how to do the whole thing. I downloaded a nightly zip file and extracted it to a directory. Where do I put the .patch file? Where do I execute the patch... command from? It doesn't work when I do it at the root of the install. It should work at the root of the install: $ patch -p0 SOLR-284.patch Do you see an error message? What's error? Koji -- View this message in context: http://www.nabble.com/Installing-a-patch-in-a-solr-nightly-on-Windows-tp24273921p24307414.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Implementing PhraseQuery and MoreLikeThis Query in one app
I think it works better to use the highest tf.idf terms, not the highest tf. That is what I implemented for Ultraseek ten years ago. With tf, you get lots of terms with low discrimination power. wunder On 7/2/09 4:48 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Michael - because they are the most frequent, which is how MLT selects terms to use for querying, IIRC. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Michael Ludwig m...@as-guides.com To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 6:20:05 AM Subject: Re: Implementing PhraseQuery and MoreLikeThis Query in one app SergeyG schrieb: Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in the same app taking into account the fact that for the former to work the stop words list needs to be included and this results in the latter putting stop words among the most important words? Why would the inclusion of a stopword list result in stopwords being of top importance in the MoreLikeThis query? Michael Ludwig
Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory
First, don't use an English stemmer on German text. It will give some odd results. Are you using the same conversions on the index and query side? The German stemmer might already handle typewriter umlauts. If it doesn't, use the pattern replace factory. You will also need to convert ß to ss. You really do need separate fields for each language. Handling these characters is language-specific. The typewriter umlaut conversion is wrong for English. It is correct, but rare, to see a diaresis in English when vowels are pronounced separately, like coöperate. In Swedish, it is not OK to convert ö to another letter or combination of letters. wunder On 7/2/09 6:27 AM, Michael Lackhoff mich...@lackhoff.de wrote: In Germany we have a strange habbit of seeing some sort of equivalence between Umlaut letters and a two letter representation. Example 'ä' and 'ae' are expected to give the same search results. To achieve this I added this filter to the text fieldtype definition: filter class=solr.PatternReplaceFilterFactory pattern=ä replacement=ae replace=all / to both index and query analyzers (and more for the other umlauts). This works well when I search for a name (a word not stemmed) but not e.g. with the word Wärme. search for 'wärme' works search for 'waerme' does not work search for 'waerm' works if I move the EnglishPorterFilterFactory after the PatternReplaceFilterFactory. DebugQuery for waerme gives a parsedquery FS:waerm. What I don't understand is why the (existing) records are not found. If I understand it right, there should be 'waerm' in the index as well. By the way, the reason why I keep the EnglishPorterFilterFactory is that the records are in many languages and the English stemming gives good results in many cases and I don't want (yet) to multiply my fields to have language specific versions. But even if the stemming is not right because the language is not English I think records should be found as long as the analyzers are the same for index and query. This is with Solr 1.3. Can someone shed some light on what is going on and how I can achieve my goal? -Michael
Re: Installing a patch in a solr nightly on Windows
You will need the patch binary as well to apply the diff to the original file. On Thu, 2009-07-02 at 07:10 -0700, ahammad wrote: When I go to the source and I input the command, I get: bash: patch: command not found Thanks Koji Sekiguchi-2 wrote: ahammad wrote: Thanks for the suggestions: Koji: I am aware of Cygwin. The problem is I am not sure how to do the whole thing. I downloaded a nightly zip file and extracted it to a directory. Where do I put the .patch file? Where do I execute the patch... command from? It doesn't work when I do it at the root of the install. It should work at the root of the install: $ patch -p0 SOLR-284.patch Do you see an error message? What's error? Koji
Re: DIH: Limited xpath syntax unable to parse all xml elements
Thanks Noble, I gave those examples a try. If I use field column=body xpath=/book/body/chapter/p / I only get the text from the last p element, not from all elements. If I use field column=body xpath=/book/body/chapter flatten=true/ or field column=body xpath=/book/body/chapter/ flatten=true/ I don't get back anything for the body column. So the first example is close, but it only gets the text for the last p element. If I could get all p elements at the same level that would be what I need. The double-slash (/book/body/chapter//p) doesn't seem to be supported. Thanks, -Jay 2009/7/1 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com complete xpath is not supported /book/body/chapter/p should work. if you wish all the text under chapter irrespective of nesting , tag names use this field column=body xpath=/book/body/chapter flatten=true/ On Thu, Jul 2, 2009 at 5:31 AM, Jay Hilljayallenh...@gmail.com wrote: I'm using the XPathEntityProcessor to parse an xml structure that looks like this: book authorJoe Smith/author titleWorld Atlas/title body chapter pContent I want is here/p pMore content I want is here./p pStill more content here./p /chapter /body /book The author and title parse out fine: field column=title xpath=/book/title/ field column=author xpath=/book/author/ But I can't get at the data inside the p tags. I want to get all non-markup text inside the body tag with something like this: field column=body xpath=/book/body/chapter//p/ but that is not supported. Does anyone know of a way that I can get the content within the p tags without the markup? Thanks, -Jay -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Master Slave data distribution | rsync fail issue
Yes. Permissions are same across cores ~Vikrant Bill Au wrote: Are the user/group/permissions on the snapshot files the same for both cases (manual vs postCommit/postOptimize events)? Bill On Tue, May 5, 2009 at 12:54 PM, tushar kapoor tushar_kapoor...@rediffmail.com wrote: Hi, I am facing an issue while performing snapshot pulling thru Snappuller script from slave server : We have the setup of multicores on Master Solr and Slave Solr servers. Scenario , 2 cores are set : i) CORE_WWW.ABCD.COM ii) CORE_WWW.XYZ.COM rsync-enable and rsync-start script run from CORE_WWW.ABCD.COM on master server. Thus rsyncd.commf file got generated on CORE_WWW.ABCD.COM only , but not on CORE_WWW.XYZ.COM. Rsyncd.conf of CORE_WWW.ABCD.COM : rsyncd.conf file uid = webuser gid = webuser use chroot = no list = no pid file = /opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.ABCD.COM/logs/rsyncd.pid log file = /opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.ABCD.COM/logs/rsyncd.log [solr] path = /opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.ABCD.COM/data comment = Solr rsync error used to get generated while doing the pulling of master server snapshot of a particular core CORE_WWW.XYZ.COM from slave end, for core CORE_WWW.ABCD.COM snappuller occured without any error. Also, this issue is coming only when snapshot are generated at master end thru the way given below: A) Snapshot are generated automatically by editing “${SOLR_HOME}/solr/conf/solrconfig.xml” to let either commit index or optimize index trigger the snapshooter (search “postCommit” and “postOptimize” to find the configuration section). Sample of solrconfig.xml entry on Master server End: I) listener event=postCommit class=solr.RunExecutableListener str name=exe/opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.ABCD.COM/bin/snapshooter/str str name=dir/opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.ABCD.COM/bin/str bool name=waittrue/bool arr name=args strarg1/str strarg2/str /arr arr name=env strMYVAR=val1/str /arr /listener same way done for core CORE_WWW.XYZ.COM solrConfig.xml. II) The dataDir tag remains commented on both the cores .XML on master server. Log sample for more clearity : rsyncd.log of the core CORE_WWW.XYZ.COM: 2009/05/01 15:48:40 command: ./rsyncd-start 2009/05/01 15:48:40 [15064] rsyncd version 2.6.3 starting, listening on port 18983 2009/05/01 15:48:40 rsyncd started with data_dir=/opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.XYZ.COm/data and accepting requests 2009/05/01 15:50:36 [15195] rsync on solr/snapshot.20090501153311/ from deltrialmac.mac1.com (10.210.7.191) 2009/05/01 15:50:36 [15195] rsync: link_stat snapshot.20090501153311/. (in solr) failed: No such file or directory (2) 2009/05/01 15:50:36 [15195] rsync error: some files could not be transferred (code 23) at main.c(442) 2009/05/01 15:52:23 [15301] rsync on solr/snapshot.20090501155030/ from delpearsondm.sapient.com (10.210.7.191) 2009/05/01 15:52:23 [15301] wrote 3438 bytes read 290 bytes total size 2779 2009/05/01 16:03:31 [15553] rsync on solr/snapshot.20090501160112/ from deltrialmac.mac1.com (10.210.7.191) 2009/05/01 16:03:31 [15553] rsync: link_stat snapshot.20090501160112/. (in solr) failed: No such file or directory (2) 2009/05/01 16:03:31 [15553] rsync error: some files could not be transferred (code 23) at main.c(442) 2009/05/01 16:04:27 [15674] rsync on solr/snapshot.20090501160054/ from deltrialmac.mac1.com (10.210.7.191) 2009/05/01 16:04:27 [15674] wrote 4173214 bytes read 290 bytes total size 4174633 I m unable to figure out that from where /. gets appeneded at the end snapshot.20090501153311/. Snappuller.log 2009/05/04 16:55:43 started by solrUser 2009/05/04 16:55:43 command: /opt/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.PUFFINBOOKS.CA/bin/snappuller -u http://CORE_WWW.PUFFINBOOKS.CA/bin/snappuller%0A-u webuser 2009/05/04 16:55:52 pulling snapshot snapshot.20090504164935 2009/05/04 16:56:09 rsync failed 2009/05/04 16:56:24 failed (elapsed time: 41 sec) Error shown on console : rsync: link_stat snapshot.20090504164935/. (in solr) failed: No such file or directory (2) client: nothing to do: perhaps you need to specify some filenames or the --recursive option? rsync error: some files could not be transferred (code 23) at main.c(723) B) The same issue is not coming while manually running the Snapshot script after reguler interval of time at Master server and then running Snappuller script at slave end for multiple cores. The postCommit/postOptimize part of solrConfig.xml has been commented. Here also rsync script run thru the core CORE_WWW.ABCD.COM. Snappuller and snapinstaller occurred
Re: Is there any other way to load the index beside using http connection?
LuSql can be found here: http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql User Manual: http://cuvier.cisti.nrc.ca/~gnewton/lusql/v0.9/lusqlManual.pdf.html LuSql can communicate directly with Oracle and create a Lucene index for you. Of course - as mentioned by other posters - you need to make sure the versions of Lucene and Solr are compatible (use same jars), you use the same Analyzers, and you create the appropriate 'schema' that Solr understands. -glen 2009/7/2 Francis Yakin fya...@liquid.com: Glen, Database we use is Oracle, I am not the database administrator, so I don't familiar with their script. SO, basically we have the Oracle SQL script to load the XML files over HTTP connection to our Solr Master. My question is there any other way instead of using HTTP connection to load the XML files to our SOLR Master? You mentioned about LuSql, I am not familiar with that. Can you provide us the docs or something? Again I am not the database Guys, I am only the solr Guy. The database we have is a different box than Solr master and both are running linux(RedHat). Thanks Francis -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Wednesday, July 01, 2009 8:06 PM To: solr-user@lucene.apache.org Subject: Re: Is there any other way to load the index beside using http connection? You can directly load to the backend Lucene using LuSql[1]. It is faster than Solr, sometimes as much as an order of magnitude faster. Disclosure: I am the author of LuSql -Glen http://zzzoot.blogspot.com/ [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql 2009/7/1 Francis Yakin fya...@liquid.com: We have several thousands of xml files in database that we load it to solr master The Database uses http connection and transfer those files to solr master. Solr then translate xml files to their lindex. We are experiencing issue with close/open connection in the firewall and very very slow. Is there any other way to load the data/index from Database to solr master beside using http connection, so it means we just scp/ftp the xml file from Database system to solr master and let solr convert those to lucene indexes? Any input or help will be much appreciated. Thanks Francis -- - -- -
Re: Is there any other way to load the index beside using http connection?
Are you saying that we have to use LuSql replacing our Solr? To load your data: Yes, it is an option To search your data: No, LuSql is only a loading tool -glen 2009/7/2 Francis Yakin fya...@liquid.com: Glen, Are you saying that we have to use LuSql replacing our Solr? Francis -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Wednesday, July 01, 2009 8:06 PM To: solr-user@lucene.apache.org Subject: Re: Is there any other way to load the index beside using http connection? You can directly load to the backend Lucene using LuSql[1]. It is faster than Solr, sometimes as much as an order of magnitude faster. Disclosure: I am the author of LuSql -Glen http://zzzoot.blogspot.com/ [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql 2009/7/1 Francis Yakin fya...@liquid.com: We have several thousands of xml files in database that we load it to solr master The Database uses http connection and transfer those files to solr master. Solr then translate xml files to their lindex. We are experiencing issue with close/open connection in the firewall and very very slow. Is there any other way to load the data/index from Database to solr master beside using http connection, so it means we just scp/ftp the xml file from Database system to solr master and let solr convert those to lucene indexes? Any input or help will be much appreciated. Thanks Francis -- - -- -
Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory
I'm shooting a bit in the dark here, but I'd guess that these are actually understandable results. If you replace then stem, the stemming algorithm works on the exact same word. And you got the results you expect. If you stem then replace, the inputs are different to thestemmer, so the fact that your outputs are different isn't a surprise. That is your implicit assumption, it seems to me, is that'wärme' and 'waerme' should go through the stemmer and become 'wärm' and 'waerm', that you can then do the substitution on and produce the same output. I don't think that's a valid assumption. You could probably check the actual contents of your index with Luke and verify whether your assumptions are correct or not Best Erick On Thu, Jul 2, 2009 at 9:27 AM, Michael Lackhoff mich...@lackhoff.dewrote: In Germany we have a strange habbit of seeing some sort of equivalence between Umlaut letters and a two letter representation. Example 'ä' and 'ae' are expected to give the same search results. To achieve this I added this filter to the text fieldtype definition: filter class=solr.PatternReplaceFilterFactory pattern=ä replacement=ae replace=all / to both index and query analyzers (and more for the other umlauts). This works well when I search for a name (a word not stemmed) but not e.g. with the word Wärme. search for 'wärme' works search for 'waerme' does not work search for 'waerm' works if I move the EnglishPorterFilterFactory after the PatternReplaceFilterFactory. DebugQuery for waerme gives a parsedquery FS:waerm. What I don't understand is why the (existing) records are not found. If I understand it right, there should be 'waerm' in the index as well. By the way, the reason why I keep the EnglishPorterFilterFactory is that the records are in many languages and the English stemming gives good results in many cases and I don't want (yet) to multiply my fields to have language specific versions. But even if the stemming is not right because the language is not English I think records should be found as long as the analyzers are the same for index and query. This is with Solr 1.3. Can someone shed some light on what is going on and how I can achieve my goal? -Michael
Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory
On 02.07.2009 16:34 Walter Underwood wrote: First, don't use an English stemmer on German text. It will give some odd results. I know but at the moment I only have the choice between no stemmer at all and one stemmer and since more than half of the records are English (about 60% English, 30% German, some Italian, French and others) the results are not too bad. Are you using the same conversions on the index and query side? Yes, index and query look exactly the same. That is what I don't understand. I am not complaining about a misbehaving stemmer, unless it does already something odd with the umlauts. The German stemmer might already handle typewriter umlauts. If it doesn't, use the pattern replace factory. You will also need to convert ß to ss. That is what I tried. And yes I also have a filter for ß to ss. It just doesn't work as expected. You really do need separate fields for each language. Eventually. But now I have to get ready really soon with a small application and people don't find what they expect. Handling these characters is language-specific. The typewriter umlaut conversion is wrong for English. It is correct, but rare, to see a diaresis in English when vowels are pronounced separately, like coöperate. In Swedish, it is not OK to convert ö to another letter or combination of letters. It is just for German users and at the moment it would be totally ok to have coöperate indexed as cooeperate, I know it is wrong and it will be fixed but given the tight schedule all I want at the moment is the combination of some stemming (perhaps 70% right or more) and typewriter umlauts (perhaps 90% correct, you gave examples for the missing 10%). Do I have any chance? -Michael
Re: DIH: Limited xpath syntax unable to parse all xml elements
Thanks Noble, I gave those examples a try. If I use field column=body xpath=/book/body/chapter/p / I only get the text from the last p element, not from all elements. Hm, I am sure I have done this. In your schema.xml is the field body multiValued or not? If I use field column=body xpath=/book/body/chapter flatten=true/ or field column=body xpath=/book/body/chapter/ flatten=true/ I don't get back anything for the body column. So the first example is close, but it only gets the text for the last p element. If I could get all p elements at the same level that would be what I need. The double-slash (/book/body/chapter//p) doesn't seem to be supported. Thanks, -Jay 2009/7/1 Noble Paul ?? Â Ë³Ë noble.p...@corp.aol.com complete xpath is not supported /book/body/chapter/p should work. if you wish all the text under chapter irrespective of nesting , tag names use this field column=body xpath=/book/body/chapter flatten=true/ On Thu, Jul 2, 2009 at 5:31 AM, Jay Hilljayallenh...@gmail.com wrote: I'm using the XPathEntityProcessor to parse an xml structure that looks like this: book authorJoe Smith/author titleWorld Atlas/title body chapter pContent I want is here/p pMore content I want is here./p pStill more content here./p /chapter /body /book The author and title parse out fine: field column=title xpath=/book/title/ field column=author xpath=/book/author/ But I can't get at the data inside the p tags. I want to get all non-markup text inside the body tag with something like this: field column=body xpath=/book/body/chapter//p/ but that is not supported. Does anyone know of a way that I can get the content within the p tags without the markup? Thanks, -Jay -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Implementing PhraseQuery and MoreLikeThis Query in one app
wunder, thank you. (Sorry, I'm not sure this is your first name). I thought the MoreLikeThis query normally uses tf.idf of the terms when deciding what terms are the most important (not the most frequent). And if this is not the case, how can I change its behavior? SergeyG wrote: Hi, Recently I've posted a question regarding using stop words in a PhraseQuery and in a MoreLikeThis query in the same app. I posted it twice. Unfortunately I didn't get any responses. I realize that the question might not have been formulated clearly. So let me reformulate it. Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in the same app taking into account the fact that for the former to work the stop words list needs to be included and this results in the latter putting stop words among the most important words? Or these two queries need to use two different indexes and thus have to be implemented in different applications or in different cores of Solr (with different schema.xml files: one with the StopWord Filter and another without it.)? Any opinion will be highly appreciated. Thank you. Redards, Sergey Goldberg P.S. Just for the reference, here is my original message. 1. There're 3 kinds of searches in my application: a) PhraseQuery search; b) search for separate words; c) MLT search. The problem I encountered is in the use of a stop words list. If I don't take it into account, the MLT query picks up common words as the most important words what is not right. And when I use it, the PhraseQuery stops working. I tried it with the ps and qs parameters (ps=100, qs=100) but that didn't change anything. (Both indexed fields are of type text, the StandardAnalyzer is applied, and all docs are in English.) 2. Do I understand it right that the query q=id:1mlt=truemlt.fl=content... should bring back documents where the most important words are in the set of those for the doc with id=1? -- View this message in context: http://www.nabble.com/Implementing-PhraseQuery-and-MoreLikeThis-Query-in-one-app-tp24303817p24309831.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH: Limited xpath syntax unable to parse all xml elements
It is not multivalued. The intention is to get all text under they body element into one body field in the index that is not multivalued. Essentially everything within the body element minus the markup. Thanks, -Jay On Thu, Jul 2, 2009 at 8:55 AM, Fergus McMenemie fer...@twig.me.uk wrote: Thanks Noble, I gave those examples a try. If I use field column=body xpath=/book/body/chapter/p / I only get the text from the last p element, not from all elements. Hm, I am sure I have done this. In your schema.xml is the field body multiValued or not? If I use field column=body xpath=/book/body/chapter flatten=true/ or field column=body xpath=/book/body/chapter/ flatten=true/ I don't get back anything for the body column. So the first example is close, but it only gets the text for the last p element. If I could get all p elements at the same level that would be what I need. The double-slash (/book/body/chapter//p) doesn't seem to be supported. Thanks, -Jay 2009/7/1 Noble Paul ?? Â Ë³Ë noble.p...@corp.aol.com complete xpath is not supported /book/body/chapter/p should work. if you wish all the text under chapter irrespective of nesting , tag names use this field column=body xpath=/book/body/chapter flatten=true/ On Thu, Jul 2, 2009 at 5:31 AM, Jay Hilljayallenh...@gmail.com wrote: I'm using the XPathEntityProcessor to parse an xml structure that looks like this: book authorJoe Smith/author titleWorld Atlas/title body chapter pContent I want is here/p pMore content I want is here./p pStill more content here./p /chapter /body /book The author and title parse out fine: field column=title xpath=/book/title/ field column=author xpath=/book/author/ But I can't get at the data inside the p tags. I want to get all non-markup text inside the body tag with something like this: field column=body xpath=/book/body/chapter//p/ but that is not supported. Does anyone know of a way that I can get the content within the p tags without the markup? Thanks, -Jay -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- === Fergus McMenemie Email:fer...@twig.me.ukemail%3afer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory
Also, check out MappingCharFilterFactory in Solr 1.4 and mapping-ISOLatin1Accent.txt in example/solr/conf -Yonik http://www.lucidimagination.com On Thu, Jul 2, 2009 at 9:27 AM, Michael Lackhoffmich...@lackhoff.de wrote: In Germany we have a strange habbit of seeing some sort of equivalence between Umlaut letters and a two letter representation. Example 'ä' and 'ae' are expected to give the same search results. To achieve this I added this filter to the text fieldtype definition: filter class=solr.PatternReplaceFilterFactory pattern=ä replacement=ae replace=all / to both index and query analyzers (and more for the other umlauts). This works well when I search for a name (a word not stemmed) but not e.g. with the word Wärme. search for 'wärme' works search for 'waerme' does not work search for 'waerm' works if I move the EnglishPorterFilterFactory after the PatternReplaceFilterFactory. DebugQuery for waerme gives a parsedquery FS:waerm. What I don't understand is why the (existing) records are not found. If I understand it right, there should be 'waerm' in the index as well. By the way, the reason why I keep the EnglishPorterFilterFactory is that the records are in many languages and the English stemming gives good results in many cases and I don't want (yet) to multiply my fields to have language specific versions. But even if the stemming is not right because the language is not English I think records should be found as long as the analyzers are the same for index and query. This is with Solr 1.3. Can someone shed some light on what is going on and how I can achieve my goal? -Michael
Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory
On 02.07.2009 17:28 Erick Erickson wrote: I'm shooting a bit in the dark here, but I'd guess that these are actually understandable results. Perhaps not too much in the dark That is your implicit assumption, it seems to me, is that'wärme' and 'waerme' should go through the stemmer and become 'wärm' and 'waerm', that you can then do the substitution on and produce the same output. I don't think that's a valid assumption. Sounds very reasonable. Will see what I can make out of all this to keep our librarians happy... Yonik Seeley wrote: Also, check out MappingCharFilterFactory in Solr 1.4 and mapping-ISOLatin1Accent.txt in example/solr/conf Thanks for the hint, looking forward to the 1.4 release ;-) at the moment we are on 1.3 though, I hope to upgrade soon but probably not soon enough for this app. -Michael
Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory
You might try a German stemmer. English gets a small benefit from stemming, maybe 5%. German is more heavily inflected than English, so may get a bigger improvement. German search usually needs wordbreaking, so that Orgelmusik can be split into Orgel and Musik. To get that, you will probably need a commercial stemmer. wunder On 7/2/09 8:42 AM, Michael Lackhoff mich...@lackhoff.de wrote: On 02.07.2009 16:34 Walter Underwood wrote: First, don't use an English stemmer on German text. It will give some odd results. I know but at the moment I only have the choice between no stemmer at all and one stemmer and since more than half of the records are English (about 60% English, 30% German, some Italian, French and others) the results are not too bad. Are you using the same conversions on the index and query side? Yes, index and query look exactly the same. That is what I don't understand. I am not complaining about a misbehaving stemmer, unless it does already something odd with the umlauts. The German stemmer might already handle typewriter umlauts. If it doesn't, use the pattern replace factory. You will also need to convert ß to ss. That is what I tried. And yes I also have a filter for ß to ss. It just doesn't work as expected. You really do need separate fields for each language. Eventually. But now I have to get ready really soon with a small application and people don't find what they expect. Handling these characters is language-specific. The typewriter umlaut conversion is wrong for English. It is correct, but rare, to see a diaresis in English when vowels are pronounced separately, like coöperate. In Swedish, it is not OK to convert ö to another letter or combination of letters. It is just for German users and at the moment it would be totally ok to have coöperate indexed as cooeperate, I know it is wrong and it will be fixed but given the tight schedule all I want at the moment is the combination of some stemming (perhaps 70% right or more) and typewriter umlauts (perhaps 90% correct, you gave examples for the missing 10%). Do I have any chance? -Michael
Re: DIH: Limited xpath syntax unable to parse all xml elements
On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller markrmil...@gmail.com wrote: It looks like DIH implements its own subset of the Xpath spec. Right, DIH has a streaming implementation supporting a subset of XPath only. The supported things are in the wiki examples. I don't see any tests with multiple matching sub nodes, so perhaps DIH Xpath does not properly support that and just selects the last matching node? It selects all matching nodes. But if the field is not multi-valued, it will store only the last value. I guess this is what is happening here. -- Regards, Shalin Shekhar Mangar.
Re: multi-word synonyms with multiple matches
: vp,vice president : svp,senior vice president : : However, a search for vp does not return results where the title is : senior vice president. It appears that the term vp is not indexed : when there is a longer string that matches a different synonym. Is this : by design, and is there any way to make solr index all synonyms that : match a term, even if it is contained in a longer synonym? Thanks! You haven't given us the full details on how you are using the SynonymFilterFactory (expand true or false?) but in general: yes the SynonymFilter finds the longest match it can. if every svp is also a vp, then being explict in your synonyms (when doing index time expansion) should work... vp,vice president svp,senior vice president=vp,svp,senior vice president -Hoss
Re: Master Slave data distribution | rsync fail issue
You can add the -V option to both your automatic and manual invocation of snappuller and snapinstaller tor both core and compare the debug info. Bill On Thu, Jul 2, 2009 at 11:02 AM, Vicky_Dev vikrantv_shirbh...@yahoo.co.inwrote: Yes. Permissions are same across cores ~Vikrant Bill Au wrote: Are the user/group/permissions on the snapshot files the same for both cases (manual vs postCommit/postOptimize events)? Bill On Tue, May 5, 2009 at 12:54 PM, tushar kapoor tushar_kapoor...@rediffmail.com wrote: Hi, I am facing an issue while performing snapshot pulling thru Snappuller script from slave server : We have the setup of multicores on Master Solr and Slave Solr servers. Scenario , 2 cores are set : i) CORE_WWW.ABCD.COM ii) CORE_WWW.XYZ.COM rsync-enable and rsync-start script run from CORE_WWW.ABCD.COM on master server. Thus rsyncd.commf file got generated on CORE_WWW.ABCD.COM only , but not on CORE_WWW.XYZ.COM. Rsyncd.conf of CORE_WWW.ABCD.COM : rsyncd.conf file uid = webuser gid = webuser use chroot = no list = no pid file = /opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.ABCD.COM/logs/rsyncd.pid log file = /opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.ABCD.COM/logs/rsyncd.log [solr] path = /opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.ABCD.COM/data comment = Solr rsync error used to get generated while doing the pulling of master server snapshot of a particular core CORE_WWW.XYZ.COM from slave end, for core CORE_WWW.ABCD.COM snappuller occured without any error. Also, this issue is coming only when snapshot are generated at master end thru the way given below: A) Snapshot are generated automatically by editing “${SOLR_HOME}/solr/conf/solrconfig.xml” to let either commit index or optimize index trigger the snapshooter (search “postCommit” and “postOptimize” to find the configuration section). Sample of solrconfig.xml entry on Master server End: I) listener event=postCommit class=solr.RunExecutableListener str name=exe/opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.ABCD.COM/bin/snapshooter/str str name=dir/opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.ABCD.COM/bin/str bool name=waittrue/bool arr name=args strarg1/str strarg2/str /arr arr name=env strMYVAR=val1/str /arr /listener same way done for core CORE_WWW.XYZ.COM solrConfig.xml. II) The dataDir tag remains commented on both the cores .XML on master server. Log sample for more clearity : rsyncd.log of the core CORE_WWW.XYZ.COM: 2009/05/01 15:48:40 command: ./rsyncd-start 2009/05/01 15:48:40 [15064] rsyncd version 2.6.3 starting, listening on port 18983 2009/05/01 15:48:40 rsyncd started with data_dir=/opt/apache-tomcat-6.0.18/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.XYZ.COm/data and accepting requests 2009/05/01 15:50:36 [15195] rsync on solr/snapshot.20090501153311/ from deltrialmac.mac1.com (10.210.7.191) 2009/05/01 15:50:36 [15195] rsync: link_stat snapshot.20090501153311/. (in solr) failed: No such file or directory (2) 2009/05/01 15:50:36 [15195] rsync error: some files could not be transferred (code 23) at main.c(442) 2009/05/01 15:52:23 [15301] rsync on solr/snapshot.20090501155030/ from delpearsondm.sapient.com (10.210.7.191) 2009/05/01 15:52:23 [15301] wrote 3438 bytes read 290 bytes total size 2779 2009/05/01 16:03:31 [15553] rsync on solr/snapshot.20090501160112/ from deltrialmac.mac1.com (10.210.7.191) 2009/05/01 16:03:31 [15553] rsync: link_stat snapshot.20090501160112/. (in solr) failed: No such file or directory (2) 2009/05/01 16:03:31 [15553] rsync error: some files could not be transferred (code 23) at main.c(442) 2009/05/01 16:04:27 [15674] rsync on solr/snapshot.20090501160054/ from deltrialmac.mac1.com (10.210.7.191) 2009/05/01 16:04:27 [15674] wrote 4173214 bytes read 290 bytes total size 4174633 I m unable to figure out that from where /. gets appeneded at the end snapshot.20090501153311/. Snappuller.log 2009/05/04 16:55:43 started by solrUser 2009/05/04 16:55:43 command: /opt/apache-solr-1.3.0/example/solr/multicore/ CORE_WWW.PUFFINBOOKS.CA/bin/snappuller -u http://CORE_WWW.PUFFINBOOKS.CA/bin/snappuller%0A-u webuser 2009/05/04 16:55:52 pulling snapshot snapshot.20090504164935 2009/05/04 16:56:09 rsync failed 2009/05/04 16:56:24 failed (elapsed time: 41 sec) Error shown on console : rsync: link_stat snapshot.20090504164935/. (in solr) failed: No such file or directory (2) client: nothing to do: perhaps you need to specify some filenames or the --recursive option? rsync error: some files could not be transferred (code 23) at main.c(723)
RE: Is there any other way to load the index beside using http connection?
Norberto, Thanks for your input. What do you mean with Have you tried connecting to SOLR over HTTP from localhost, therefore avoiding any firewall issues and network latency ? it should work a LOT faster than from a remote site. ? Here are how our servers lay out: 1) Database ( Oracle ) is running on separate machine 2) Solr master is running on separate machine by itself 3) 6 solr slaves ( these 6 pulll the index from master using rsync) We have a SQL(Oracle) script to post the data/index from Oracle Database machine to Solr Master over http. We wrote those script(Someone in Oracle Database administrator write it). In Solr master configuration we have scripts.conf that like this: user= solr_hostname=localhost solr_port=7001 rsyncd_port=18983 data_dir= webapp_name=solr master_host=localhost master_data_dir=solr/snapshot master_status_dir=solr/status So, basically from Oracle system we launch the Oracle/SQL script posting the data to Solr Master using http://solrmaster/solr/update ( inside the SQL script we put this). We can not do localhost since it's solr is not running on Oracle machine. Another alternative that we think of is to transform XML into CSV and import/export it. How about if LUSQL, some mentioned about this? Is this apps free(open source) application? Do you have any experience with this apps? Thanks All for your valuable suggestions! Francis -Original Message- From: Norberto Meijome [mailto:numard...@gmail.com] Sent: Thursday, July 02, 2009 3:01 AM To: solr-user@lucene.apache.org Cc: Francis Yakin Subject: Re: Is there any other way to load the index beside using http connection? On Wed, 1 Jul 2009 15:07:12 -0700 Francis Yakin fya...@liquid.com wrote: We have several thousands of xml files in database that we load it to solr master The Database uses http connection and transfer those files to solr master. Solr then translate xml files to their lindex. We are experiencing issue with close/open connection in the firewall and very very slow. Is there any other way to load the data/index from Database to solr master beside using http connection, so it means we just scp/ftp the xml file from Database system to solr master and let solr convert those to lucene indexes? Francis, after reading the whole thread, it seems you have : - Data source : Oracle DB, on separate location to your SOLR. - Data format : XML output. definitely DIH is a great option, but since you are on 1.2, not available to you (you should look into upgrading if you can!). Have you tried connecting to SOLR over HTTP from localhost, therefore avoiding any firewall issues and network latency ? it should work a LOT faster than from a remote site. Also make sure not to commit until you really needed. Other alternatives are to transform the XML into csv and import it that way. Or write a simple app that will parse the xml and post it directly using the embedded solr method. plenty of options, all of them documented @ solr's site. good luck, b _ {Beto|Norberto|Numard} Meijome People demand freedom of speech to make up for the freedom of thought which they avoid. Soren Aabye Kierkegaard I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
RE: Is there any other way to load the index beside using http connection?
Glen, Is this LuSql is free? Is that an open source. Is that requires a separate machine with Solr Master I forgot to tell you that we have Master/Slaves environment of Solr. The Database is running Oracle and it's separate machine that running in different network than Master and Slaves Solr(There is a firewall between Oracle machine and Solr Machines). If we have LuSql Machine, do you think it's better to put into the same network with DataBase machine or Solr machines? Do I need to create a sql script to get the data from Oarcle and loading it using LuSql and convert it to Lucene index, and how solr master will get that data? Thanks Francis -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Thursday, July 02, 2009 8:22 AM To: solr-user@lucene.apache.org Subject: Re: Is there any other way to load the index beside using http connection? LuSql can be found here: http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql User Manual: http://cuvier.cisti.nrc.ca/~gnewton/lusql/v0.9/lusqlManual.pdf.html LuSql can communicate directly with Oracle and create a Lucene index for you. Of course - as mentioned by other posters - you need to make sure the versions of Lucene and Solr are compatible (use same jars), you use the same Analyzers, and you create the appropriate 'schema' that Solr understands. -glen 2009/7/2 Francis Yakin fya...@liquid.com: Glen, Database we use is Oracle, I am not the database administrator, so I don't familiar with their script. SO, basically we have the Oracle SQL script to load the XML files over HTTP connection to our Solr Master. My question is there any other way instead of using HTTP connection to load the XML files to our SOLR Master? You mentioned about LuSql, I am not familiar with that. Can you provide us the docs or something? Again I am not the database Guys, I am only the solr Guy. The database we have is a different box than Solr master and both are running linux(RedHat). Thanks Francis -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Wednesday, July 01, 2009 8:06 PM To: solr-user@lucene.apache.org Subject: Re: Is there any other way to load the index beside using http connection? You can directly load to the backend Lucene using LuSql[1]. It is faster than Solr, sometimes as much as an order of magnitude faster. Disclosure: I am the author of LuSql -Glen http://zzzoot.blogspot.com/ [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql 2009/7/1 Francis Yakin fya...@liquid.com: We have several thousands of xml files in database that we load it to solr master The Database uses http connection and transfer those files to solr master. Solr then translate xml files to their lindex. We are experiencing issue with close/open connection in the firewall and very very slow. Is there any other way to load the data/index from Database to solr master beside using http connection, so it means we just scp/ftp the xml file from Database system to solr master and let solr convert those to lucene indexes? Any input or help will be much appreciated. Thanks Francis -- - -- -
Re: DIH: Limited xpath syntax unable to parse all xml elements
Shalin Shekhar Mangar wrote: On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller markrmil...@gmail.com wrote: It looks like DIH implements its own subset of the Xpath spec. Right, DIH has a streaming implementation supporting a subset of XPath only. The supported things are in the wiki examples. I don't see any tests with multiple matching sub nodes, so perhaps DIH Xpath does not properly support that and just selects the last matching node? It selects all matching nodes. But if the field is not multi-valued, it will store only the last value. I guess this is what is happening here. So do you think it should match them all and add the concatenated text as one field? That would be more Xpath like I think, and less arbitrary than just choosing the last one. -- - Mark http://www.lucidimagination.com
Re: Changing the score of a document based on the value of a field
: The SolrRelevancyFAQ has a heading that's the same as my message's subject: : : http://wiki.apache.org/solr/SolrRelevancyFAQ#head-f013f5f2811e3ed28b200f326dd686afa491be5e : : There's a TODO on the wiki to provide an actual example. Does anybody happen : to have an example handy that I could model my query after? Thank you the types of things that are possible was alreayd pretty clear if you read up on function queries, but i went ahead and added some simple examples. -Hoss
Re: Is there any other way to load the index beside using http connection?
2009/7/2 Francis Yakin fya...@liquid.com: Glen, Is this LuSql is free? Is that an open source. LuSql is an Open Source project. Is that requires a separate machine with Solr Master LuSql is a Java application that runs on the command line. It connects to a the database using JDBC and creates a local Lucene index, based on the configuration you supply to it. I forgot to tell you that we have Master/Slaves environment of Solr. The Database is running Oracle and it's separate machine that running in different network than Master and Slaves Solr(There is a firewall between Oracle machine and Solr Machines). If we have LuSql Machine, do you think it's better to put into the same network with DataBase machine or Solr machines? LuSql is heavily multi-threaded, and can suck up the resources of all cores (this is why it runs so fast), so you need to decide if this is not appropriate for your database machine (i.e. if it is a production machine). You can isolate LuSql to specific cores using something like numactl http://www.linuxmanpages.com/man8/numactl.8.php Do I need to create a sql script to get the data from Oarcle and loading it using LuSql and convert it to Lucene index, and how solr master will get that data? LuSql reads from Oracle and writes to a Lucene index. You just need to give LuSql a configuration that has it generate the appropriate index for Solr. thanks, Glen http://zzzoot.blogspot.com/search?q=lucene Thanks Francis -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Thursday, July 02, 2009 8:22 AM To: solr-user@lucene.apache.org Subject: Re: Is there any other way to load the index beside using http connection? LuSql can be found here: http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql User Manual: http://cuvier.cisti.nrc.ca/~gnewton/lusql/v0.9/lusqlManual.pdf.html LuSql can communicate directly with Oracle and create a Lucene index for you. Of course - as mentioned by other posters - you need to make sure the versions of Lucene and Solr are compatible (use same jars), you use the same Analyzers, and you create the appropriate 'schema' that Solr understands. -glen 2009/7/2 Francis Yakin fya...@liquid.com: Glen, Database we use is Oracle, I am not the database administrator, so I don't familiar with their script. SO, basically we have the Oracle SQL script to load the XML files over HTTP connection to our Solr Master. My question is there any other way instead of using HTTP connection to load the XML files to our SOLR Master? You mentioned about LuSql, I am not familiar with that. Can you provide us the docs or something? Again I am not the database Guys, I am only the solr Guy. The database we have is a different box than Solr master and both are running linux(RedHat). Thanks Francis -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Wednesday, July 01, 2009 8:06 PM To: solr-user@lucene.apache.org Subject: Re: Is there any other way to load the index beside using http connection? You can directly load to the backend Lucene using LuSql[1]. It is faster than Solr, sometimes as much as an order of magnitude faster. Disclosure: I am the author of LuSql -Glen http://zzzoot.blogspot.com/ [1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql 2009/7/1 Francis Yakin fya...@liquid.com: We have several thousands of xml files in database that we load it to solr master The Database uses http connection and transfer those files to solr master. Solr then translate xml files to their lindex. We are experiencing issue with close/open connection in the firewall and very very slow. Is there any other way to load the data/index from Database to solr master beside using http connection, so it means we just scp/ftp the xml file from Database system to solr master and let solr convert those to lucene indexes? Any input or help will be much appreciated. Thanks Francis -- - -- - -- -
Re: DIH: Limited xpath syntax unable to parse all xml elements
On Thu, Jul 2, 2009 at 11:38 PM, Mark Miller markrmil...@gmail.com wrote: Shalin Shekhar Mangar wrote: It selects all matching nodes. But if the field is not multi-valued, it will store only the last value. I guess this is what is happening here. So do you think it should match them all and add the concatenated text as one field? That would be more Xpath like I think, and less arbitrary than just choosing the last one. I won't call it arbitrary because it creates a SolrInputDocument with values from all the matching nodes just like you'd create any multi-valued field. The problem is that his field is not declared to be multi-valued. The same would happen if you posted an XML document to /update with multiple values for a single-valued field. XPathEntityProcessor provides the flatten=true option if you want to add it as concatenated test. Jay mentioned that flatten did not work for him which is something we should investigate. Jay, which version of Solr are you running? The flatten option is a 1.4 feature (added with SOLR-1003). -- Regards, Shalin Shekhar Mangar.
Re: DIH: Limited xpath syntax unable to parse all xml elements
Shalin Shekhar Mangar wrote: On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller markrmil...@gmail.com wrote: It looks like DIH implements its own subset of the Xpath spec. Right, DIH has a streaming implementation supporting a subset of XPath only. The supported things are in the wiki examples. I don't see any tests with multiple matching sub nodes, so perhaps DIH Xpath does not properly support that and just selects the last matching node? It selects all matching nodes. But if the field is not multi-valued, it will store only the last value. I guess this is what is happening here. So do you think it should match them all and add the concatenated text as one field? That would be more Xpath like I think, and less arbitrary than just choosing the last one. Only when the field in schema.xml in not multiValued. If the field is multiValued is should still behave as at present? Also... what went wrong with the suggested:- field column=body xpath=/book/body/chapter flatten=true/ Regards Fergus.
Re: DIH: Limited xpath syntax unable to parse all xml elements
Shalin Shekhar Mangar wrote: On Thu, Jul 2, 2009 at 11:38 PM, Mark Miller markrmil...@gmail.com wrote: Shalin Shekhar Mangar wrote: It selects all matching nodes. But if the field is not multi-valued, it will store only the last value. I guess this is what is happening here. So do you think it should match them all and add the concatenated text as one field? That would be more Xpath like I think, and less arbitrary than just choosing the last one. I won't call it arbitrary because it creates a SolrInputDocument with values from all the matching nodes just like you'd create any multi-valued field. Then shouldnt it throw an error? If your field is not multivalued, but the XML is multivalued, it does seem arbitrary to pick the last node when Xpath says to select them all. It seems it should through an error (saying to use flatten or a multifield?) or concatenate all the text? -- - Mark http://www.lucidimagination.com
Re: DIH: Limited xpath syntax unable to parse all xml elements
I'm on the trunk, built on July 2: 1.4-dev 789506 Thanks, -Jay On Thu, Jul 2, 2009 at 11:33 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Jul 2, 2009 at 11:38 PM, Mark Miller markrmil...@gmail.com wrote: Shalin Shekhar Mangar wrote: It selects all matching nodes. But if the field is not multi-valued, it will store only the last value. I guess this is what is happening here. So do you think it should match them all and add the concatenated text as one field? That would be more Xpath like I think, and less arbitrary than just choosing the last one. I won't call it arbitrary because it creates a SolrInputDocument with values from all the matching nodes just like you'd create any multi-valued field. The problem is that his field is not declared to be multi-valued. The same would happen if you posted an XML document to /update with multiple values for a single-valued field. XPathEntityProcessor provides the flatten=true option if you want to add it as concatenated test. Jay mentioned that flatten did not work for him which is something we should investigate. Jay, which version of Solr are you running? The flatten option is a 1.4 feature (added with SOLR-1003). -- Regards, Shalin Shekhar Mangar.
Preparing the ground for a real multilang index
As pointed out in the recent thread about stemmers and other language specifics I should handle them all in their own right. But how? The first problem is how to know the language. Sometimes I have a language identifier within the record, sometimes I have more than one, sometimes I have none. How should I handle the non-obvious cases? Given I somehow know record1 is English and record2 is German. Then I need all my (relevant) fields for every language, e.g. I will have TITLE_ENG and TITLE_GER and both will have their respective stemmer. But what with exotic languages? Use a catch all language without a stemmer? Now a user searches for TITLE:term and I don't know beforehand the language of term. Do I have to expand the query to something like TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ... or is there some sort of copyfield for analyzed fields? Then I could just copy all the TITLE_* fields to TITLE and don't bother with the language of the query. Are there any solutions that prevent an index with thousands of fields and dozens of ORed query terms? I know I will have to implement some better multilanguage support but would also like to keep it as simple as possible. -Michael
Re: DIH: Limited xpath syntax unable to parse all xml elements
Thanks Fergus, setting the field to multivalued did work: field column=body xpath=/book/body/chapter/p flatten=true/ gets all the p elements as multivalue fields in the body field. The only thing is, the body field is used by some other content sources, so I have to look at the implications setting it to multi-valued will have on the other data sources. Still, this might do the trick. Thanks to all that helped on this! -Jay On Thu, Jul 2, 2009 at 11:40 AM, Fergus McMenemie fer...@twig.me.uk wrote: Shalin Shekhar Mangar wrote: On Thu, Jul 2, 2009 at 11:08 PM, Mark Miller markrmil...@gmail.com wrote: It looks like DIH implements its own subset of the Xpath spec. Right, DIH has a streaming implementation supporting a subset of XPath only. The supported things are in the wiki examples. I don't see any tests with multiple matching sub nodes, so perhaps DIH Xpath does not properly support that and just selects the last matching node? It selects all matching nodes. But if the field is not multi-valued, it will store only the last value. I guess this is what is happening here. So do you think it should match them all and add the concatenated text as one field? That would be more Xpath like I think, and less arbitrary than just choosing the last one. Only when the field in schema.xml in not multiValued. If the field is multiValued is should still behave as at present? Also... what went wrong with the suggested:- field column=body xpath=/book/body/chapter flatten=true/ Regards Fergus.
Re: Plugin Performance Issues
: I'm not entirely convinced that it's related to our code, but it could be. : Just trying to get a sense if other plugins have had similar problems, just : by the nature of using Solr's resource loading from the /lib directory. Plugins aren't something that every Solr users -- but enough people use them that if there was a fundemental memory leak just from loading plugin jars i'm guessing more people would be complaining. I use plugins in several solr instances, and i've never noticed any problems like you describe -- but i don't personally use tomcat. Otis is right on the money: you need to use profiling tools to really look at the heap and see what's taking up all that ram. Alternately: a quick way to rule out the special plugin class loader would be to embed your custom handler directly into the solr.war (The Old Way on the SolrPlugins wiki) ... if you still have problems, then the cause isn't the plugin classloader. -Hoss
Retrieve docs with 1 multivalue field hits
Greetings! I thought I remembered seeing a thread related to retrieving only documents that had more than one hit in a particular multivalue field, but I cannot find it now. Regardless, is this possible in Solr 1.3? Solr 1.4? -- A. Steven Anderson Independent Consultant
Confirming doc change for Wiki for schema / plugins config
There's a particular confusion I've had with the Solr schema and plugins, Though this stuff is obvious to the gurus, looking around I guess I wasn't alone in my confusion. I believe I understand it now and wanted to capture that on the Wiki, but just double checking and maybe the gurus would have some additional comments? Two Syntaxes AND Two Plugin Sets There is an abbreviated syntax for specifying plugins in the schema, but there is a more powerful syntax that is preferred. Also, Solr supports both solr-specific plugins, and is also compatible with Lucene plugins. Solr plugins use the more more modern longer syntax, but Lucene plugins generally must use the abbreviated syntax OR use a custom adapter class. These two differences tend to coincide. Solr plugins use the longer, more powerful syntax, whereas Lucene plugins tend to use the shorter syntax (or an adapter, see below). Two Syntaxes for Defining Field Type Plugins: Abbreviated Syntax: fieldType name=... class=... analyzer class=SomeAnalyer / !-- Do not put additional plugins here -- /fieldType Modern Syntax: fieldType name=... class=... analyzer tokenizer class=SomeTokenizer / filter class=SomeFilter / !-- other filters ... -- /analyzer /fieldType Of course you can have multiple analyzer blocks in the newer syntax, one for index time and one for search. And the filters can have options, etc. This is confusing because the analyzer tag can EITHER have a class= attribute OR nested subelements, usually of type tokenizer and filter. You should not do both! Futher, the main fieldType element also takes a class attribute, which is required, but this is a separate class (...could use some narrative as to why) Two Common Sources of Plugins: When looking at schema configurations you find online, it's very important to notice the prefixes in the class name. Classes starting with org.apache.solr.analysis. or the shorthand solr. are Solr specific, and will use the longhand syntax. Classes starting with org.apache.lucene.analysis. are NOT native Solr plugins and must EITHER use the short hand syntax (which limits your functionality), or you need to add a custom adapter class. This is generally a good thing. There are quite a few Lucene plugins out there, and Solr can use any of them out of the box without the need for breaking out a Java compiler. However, when used in this compatibility mode, you give up some functionality. And you can't just use the longer syntax with the Lucene plugins. The advanced syntax isn't directly compatible (at this time). If you want the advantages of the long form syntax you need to use a Lucene to Solr adapater class, often called a factory class. Examples of Right and Wrong Configurations. Asian language Solr users will often want to use the CJK processor (CJK = Chinese, Japanese and Korean). They will typically use the base Lucene plugin, but in various configurations. Examples using CJK Plugins: !-- Correct Short form using Lucene compatible syntax -- fieldType name=text_cjk class=solr.TextField analyzer class=org.apache.lucene. analysis.cjk.CJKAnalyzer/ /fieldType !-- Incorrect attempt to use long form with Lucene plugins -- fieldType name=text_cjk class=solr.TextField analyzer class=org.apache.lucene.analysis.cjk.CJKAnalyzer/ !-- Wrong: won't be used! -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ !-- ... other filters ... -- /fieldType !-- Correct Long Form syntax for Lucene plugins THAT HAVE AN ADAPTER -- fieldType name=text_cjk class=solr.TextField analyzer !-- This ONLY works if you have an adapter class -- tokenizer class=solr.CJKTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ !-- ... other filters ... -- /analyzer /fieldType There is a nice thread about the adapter class you need. Later on in the thread the discussion evolves into whether or not to make an uber Lucene class loader, and the performance impact that might have here: http://www.mail-archive.com/solr-user@lucene.apache.org/msg04487.html -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
Re: Confirming doc change for Wiki for schema / plugins config
On Thu, Jul 2, 2009 at 3:53 PM, Mark Bennettmbenn...@ideaeng.com wrote: There is an abbreviated syntax for specifying plugins in the schema, but there is a more powerful syntax that is preferred. I think of it as specifying the Analyzer for a field: one can either specify a Java Analyzer class (opaque, but good for legacy Analyzer implementations or implementations that don't even use Tokenizer/TokenFilter chains), or specify an Analyzer as a Tokenizer followed by a list of Filters. I'm still planning on cleaning up the schema for 1.4 - I'll see if the comments can be made a little clearer. This is confusing because the analyzer tag can EITHER have a class= attribute OR nested subelements, usually of type tokenizer and filter. You should not do both! Futher, the main fieldType element also takes a class attribute, which is required, but this is a separate class (...could use some narrative as to why) For polymorphic behavior for everything that falls outside Analyzer. Classes starting with org.apache.lucene.analysis. are NOT native Solr plugins and must EITHER use the short hand syntax (which limits your functionality), or you need to add a custom adapter class. Yeah, for years I've meant to look into getting this to just work w/o having to create a factory. FYI - the long-form/short-form is just a classloading thing, and doesn't relate to factories. It's only correlated in that something in the solr namespace should have a factory. -Yonik http://www.lucidimagination.com
Re: Preparing the ground for a real multilang index
Michael, I think you really aught to know the language of the query (from a pulldown, from the browser, from user settings, somewhere) and pass that to the backend unless your queries are sufficiently long that their language can be identified. Here is a handy tool for playing with language identification: http://www.sematext.com/demo/lid/ You'll see how hard it is to guess a language of very short texts. :) You really want to avoid that huge OR. Often it makes no sense to OR in multilingual context. Think about the word die (English and German, as you know) and what happens when you include that in an OR. And does it make sense to include a very language specific word, say wunderbar, in an OR that goes across multiple/all languages? Funny, they have it listed at http://www.merriam-webster.com/dictionary/wunderbar Otis-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Michael Lackhoff mich...@lackhoff.de To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 2:58:41 PM Subject: Preparing the ground for a real multilang index As pointed out in the recent thread about stemmers and other language specifics I should handle them all in their own right. But how? The first problem is how to know the language. Sometimes I have a language identifier within the record, sometimes I have more than one, sometimes I have none. How should I handle the non-obvious cases? Given I somehow know record1 is English and record2 is German. Then I need all my (relevant) fields for every language, e.g. I will have TITLE_ENG and TITLE_GER and both will have their respective stemmer. But what with exotic languages? Use a catch all language without a stemmer? Now a user searches for TITLE:term and I don't know beforehand the language of term. Do I have to expand the query to something like TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ... or is there some sort of copyfield for analyzed fields? Then I could just copy all the TITLE_* fields to TITLE and don't bother with the language of the query. Are there any solutions that prevent an index with thousands of fields and dozens of ORed query terms? I know I will have to implement some better multilanguage support but would also like to keep it as simple as possible. -Michael
Re: Implementing PhraseQuery and MoreLikeThis Query in one app
I could be wrong about MLT - maybe it really does use TF IDF and not raw frequency. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Walter Underwood wunderw...@netflix.com To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 10:26:33 AM Subject: Re: Implementing PhraseQuery and MoreLikeThis Query in one app I think it works better to use the highest tf.idf terms, not the highest tf. That is what I implemented for Ultraseek ten years ago. With tf, you get lots of terms with low discrimination power. wunder On 7/2/09 4:48 AM, Otis Gospodnetic wrote: Michael - because they are the most frequent, which is how MLT selects terms to use for querying, IIRC. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Michael Ludwig To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 6:20:05 AM Subject: Re: Implementing PhraseQuery and MoreLikeThis Query in one app SergeyG schrieb: Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in the same app taking into account the fact that for the former to work the stop words list needs to be included and this results in the latter putting stop words among the most important words? Why would the inclusion of a stopword list result in stopwords being of top importance in the MoreLikeThis query? Michael Ludwig
DocSlice andNotSize
Hi, I have a simple question rel the DocSlice class. I'm trying to use the (very handy) set operations on DocSlices and I'm rather confused by the way it behaves. I have 2 DocSlices, atlDocs which, by looking at the debugger, holds a docs array of ints of size 1; the second DocSlice is btlDocs, with a docs array of ints of size 67. I know that atlDocs is a subset of btlDocs, so the doing btlDocs.andNotSize(atlDocs) should really return 66. But it's returning 10. Any idea what I'm understanding wrong here? Thanks in advance. Candide
Re: Preparing the ground for a real multilang index
Not to mention Americans who call themselves wunder. Or brand names, like LaserJet, which are the same in all languages. Queries are far too short for effective language id. You can get language preferences from an HTTP request headers, then allow people to override them. I think the header is Accept-language, but it has been a long time since I did that. I recommend using ISO language codes, en, de, es, fr, and so on, instead of making up your own, like eng and ger. Don't confuse them with ISO country codes: uk, us, etc. Korean and Japanese are easy to mix up with the country codes. wunder On 7/2/09 1:15 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Michael, I think you really aught to know the language of the query (from a pulldown, from the browser, from user settings, somewhere) and pass that to the backend unless your queries are sufficiently long that their language can be identified. Here is a handy tool for playing with language identification: http://www.sematext.com/demo/lid/ You'll see how hard it is to guess a language of very short texts. :) You really want to avoid that huge OR. Often it makes no sense to OR in multilingual context. Think about the word die (English and German, as you know) and what happens when you include that in an OR. And does it make sense to include a very language specific word, say wunderbar, in an OR that goes across multiple/all languages? Funny, they have it listed at http://www.merriam-webster.com/dictionary/wunderbar Otis-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Michael Lackhoff mich...@lackhoff.de To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 2:58:41 PM Subject: Preparing the ground for a real multilang index As pointed out in the recent thread about stemmers and other language specifics I should handle them all in their own right. But how? The first problem is how to know the language. Sometimes I have a language identifier within the record, sometimes I have more than one, sometimes I have none. How should I handle the non-obvious cases? Given I somehow know record1 is English and record2 is German. Then I need all my (relevant) fields for every language, e.g. I will have TITLE_ENG and TITLE_GER and both will have their respective stemmer. But what with exotic languages? Use a catch all language without a stemmer? Now a user searches for TITLE:term and I don't know beforehand the language of term. Do I have to expand the query to something like TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ... or is there some sort of copyfield for analyzed fields? Then I could just copy all the TITLE_* fields to TITLE and don't bother with the language of the query. Are there any solutions that prevent an index with thousands of fields and dozens of ORed query terms? I know I will have to implement some better multilanguage support but would also like to keep it as simple as possible. -Michael
Re: Deleting from SolrQueryResponse
: Hi, I was wondering if any has had luck deleting added documents to : SolrQueryResponse? I am subclassing StandardRequestHandler and after I run : the handle request body method (super.handleRequestBody(req, rsp);) I won't : to filter out some of the hits. DocLists are immutable (if i remember correctly) but your handler can always remove the DocList from the SolrQueryResponse and then replce it with a new one after you've made your changes. one thing to keep in mind however is that post-processing a DocList to filter stuff out is almost never a good idea -- things get really convoluted when you think about dealing with pagination and except for some really trivial use cases you can never know what your upper bound should be when deciding how many hits to request from underlying IndexSearcher. You're usually better off restructuring your problem so that you can construct a Query/Filter/DocSet that you want to filter by first and *then* executing the search to generate the DocList in a single pass. PS: replying to your own message (or reposting it) to bump it up generally doesn't encourage replies any faster -- it just increases the volume of traffic on the list, and if anything antagonizes people and makes them less interested in responding. -Hoss
Re: complex OR query not working
: I want to execute the following query: : (spacegroupID:g*) OR (!userID:g*). First: ! is not a negation operator in the lucene/solr query parser : In above syntax (!userID:g*) gives results correctly. ...i don't think it's doing what you think it's doing. second: boolean queries can't be purely negative. they need to select something. the second clause of your main query is a boolean query with a single negative clause. try this instead... spacegroupID:g* (*:* -userID:g*) ...that will match any doc with a spacegroupId starting with the g character OR: any doc, except those with userID starting with the g character -Hoss
Re: Building Solr index with Lucene
On Wed, Jul 1, 2009 at 6:49 PM, Ben Bangertb...@groovie.org wrote: For performance reasons, we're attempting to build the index used with Solr Solr 1.4 has a binary communications format, and a StreamingUpdateSolrServer that massively improves indexing performance. You way want to revisit the decision to bypass Solr, esp as more indexing functionality emerges (update processors, etc). There is also EmbeddedSolrServer if you want something in-process. -Yonik http://www.lucidimagination.com
Re: Excluding characters from a wildcard query
: I'm not sure if you can do prefix queries with the fq parameter. You will : need to use the 'q' parameter for that. fq supports anything q supports ... with the QParser and local params options it can be any syntax you want (as long as there is a QParser for it) -Hoss
Re: Solr spring application context error
: I did try that. The problem is that you can't tell : FileSystemXmlApplicationContext to load with a different ClassLoader. why not? it subclasses DefaultResourceLoader which has the setClassLoader method Mark pointed out. -Hoss
Re: DocSlice andNotSize
On Thu, Jul 2, 2009 at 4:24 PM, Candide Kemmlercand...@palacehotel.org wrote: I have a simple question rel the DocSlice class. I'm trying to use the (very handy) set operations on DocSlices and I'm rather confused by the way it behaves. I have 2 DocSlices, atlDocs which, by looking at the debugger, holds a docs array of ints of size 1; the second DocSlice is btlDocs, with a docs array of ints of size 67. I know that atlDocs is a subset of btlDocs, so the doing btlDocs.andNotSize(atlDocs) should really return 66. But it's returning 10. The short answer is that all of the set operations were only designed for DocSets (as opposed to DocLists). Yes, perhaps DocList should not have extended DocSet... -Yonik http://www.lucidimagination.com
Re: Implementing PhraseQuery and MoreLikeThis Query in one app
Otis, Your recipe does work: after copying an indexing field and excluding stop words the MoreLikeThis query started fetching meaningful results. :) Just one issue remained. When I execute query in this way: String query = q=id:1mlt.fl=content...fl=title+author+score; HttpClient client = new HttpClient(); GetMethod get = new GetMethod(http://localhost:8080/solr/mlt;); get.setQueryString(query); client.executeMethod(get); ... it works fine bringing results as an XML string. But when I use Solr-like approach: String query = id:1; solrQuery.setQuery(query); solrQuery.setParam(mlt, true); solrQuery.setParam(mlt.fl, content); solrQuery.setParam(fl, title author score); QueryResponse queryResponse = server.query( solrQuery ); the result contains only one doc with id=1 and no other more like docs. In my solrconfig.xml, I have these settings: requestHandler name=/mlt class=solr.MoreLikeThisHandler ... requestHandler name=standard class=solr.SearchHandler default=true ... I guess it all is a matter of syntax but I can't figure out what's wrong. Thank you very much (and again, thanks to Michael and Walter). Cheers, Sergey Michael Ludwig-4 wrote: SergeyG schrieb: Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in the same app taking into account the fact that for the former to work the stop words list needs to be included and this results in the latter putting stop words among the most important words? Why would the inclusion of a stopword list result in stopwords being of top importance in the MoreLikeThis query? Michael Ludwig -- View this message in context: http://www.nabble.com/Implementing-PhraseQuery-and-MoreLikeThis-Query-in-one-app-tp24303817p24314840.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Preparing the ground for a real multilang index
I believe the proper way is for the server to compute a list of accepted languages in order of preferences. The web-platform language (e.g. the user-setting), and the values in the Accept-Language http header (which are from the browser or platform). Then you expand your query for surfing waves (say) to: - phrase query: surfing waves exactly (^2.0) - two terms, no stemming: surfing waves (^1.5) - iterate through the languages and query for stemmed variants: - english: surf wav ^1.0 - german surfing wave ^0.9 - - then maybe even try the phonetic analyzer (matched in a separate field probably) I think this is a common pattern on the web where the users, browsers, and servers are all somewhat multilingual. paul Le 02-juil.-09 à 22:15, Otis Gospodnetic a écrit : Michael, I think you really aught to know the language of the query (from a pulldown, from the browser, from user settings, somewhere) and pass that to the backend unless your queries are sufficiently long that their language can be identified. Here is a handy tool for playing with language identification: http://www.sematext.com/demo/lid/ You'll see how hard it is to guess a language of very short texts. :) You really want to avoid that huge OR. Often it makes no sense to OR in multilingual context. Think about the word die (English and German, as you know) and what happens when you include that in an OR. And does it make sense to include a very language specific word, say wunderbar, in an OR that goes across multiple/all languages? Funny, they have it listed at http://www.merriam-webster.com/dictionary/wunderbar Otis-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Michael Lackhoff mich...@lackhoff.de To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 2:58:41 PM Subject: Preparing the ground for a real multilang index As pointed out in the recent thread about stemmers and other language specifics I should handle them all in their own right. But how? The first problem is how to know the language. Sometimes I have a language identifier within the record, sometimes I have more than one, sometimes I have none. How should I handle the non-obvious cases? Given I somehow know record1 is English and record2 is German. Then I need all my (relevant) fields for every language, e.g. I will have TITLE_ENG and TITLE_GER and both will have their respective stemmer. But what with exotic languages? Use a catch all language without a stemmer? Now a user searches for TITLE:term and I don't know beforehand the language of term. Do I have to expand the query to something like TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ... or is there some sort of copyfield for analyzed fields? Then I could just copy all the TITLE_* fields to TITLE and don't bother with the language of the query. Are there any solutions that prevent an index with thousands of fields and dozens of ORed query terms? I know I will have to implement some better multilanguage support but would also like to keep it as simple as possible. -Michael smime.p7s Description: S/MIME cryptographic signature
Re: Deleting from SolrQueryResponse
hossman wrote: one thing to keep in mind however is that post-processing a DocList to filter stuff out is almost never a good idea -- things get really convoluted when you think about dealing with pagination and except for some really trivial use cases you can never know what your upper bound should be when deciding how many hits to request from underlying IndexSearcher. You're usually better off restructuring your problem so that you can construct a Query/Filter/DocSet that you want to filter by first and *then* executing the search to generate the DocList in a single pass. I was wanting to edit the DocList in a custom SearchComponent to be executed after the QueryComponent. I do not require facetting etc. If do not want facetted results, will I still need to take any special steps not to break the doclist? hossman wrote: PS: replying to your own message (or reposting it) to bump it up generally doesn't encourage replies any faster -- it just increases the volume of traffic on the list, and if anything antagonizes people and makes them less interested in responding. Okay, sorry I wasn't certain on the protocol on that. Thanks, Brett. -- View this message in context: http://www.nabble.com/Deleting-from-SolrQueryResponse-tp24266686p24315607.html Sent from the Solr - User mailing list archive at Nabble.com.
reindexed data on master not replicated to slave
Hi, When index data were corrupted on master instance, I wanted to wipe out all the index data and re-index everything. I was hoping the newly created index data would be replicated to slaves, but it wasn't. Here are the steps I performed: 1. stop master 2. delete the directory 'index' 3. start master 4. disable replication on master 5. index all data from scratch 6. enable replication on master It seemed from log file that the slave instances discovered that new index are available and claimed that new index installed, and then trying to update index properties, but looking into the index directory on slaves, you will find that no index data files were updated or added, plus slaves keep trying to get new index. Here are some from slave's log file: Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Starting replication process Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Number of files in latest snapshot in master: 69 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Total time taken for download : 0 secs Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Conf files are not downloaded or are in sync Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps INFO: New index installed. Updating index properties... Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Master's version: 1246488421310, generation: 9 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Slave's version: 1246385166228, generation: 56 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Starting replication process Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Number of files in latest snapshot in master: 69 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Total time taken for download : 0 secs Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Conf files are not downloaded or are in sync Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps INFO: New index installed. Updating index properties... Is this process incorrect, or it is a bug? If the process is incorrect, what is the right process? Thanks, J
Re: reindexed data on master not replicated to slave
Jay, You didn't mention which version of Solr you are using. It looks like some trunk or nightly version. Maybe you can try the latest nightly? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: solr jay solr...@gmail.com To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 9:14:48 PM Subject: reindexed data on master not replicated to slave Hi, When index data were corrupted on master instance, I wanted to wipe out all the index data and re-index everything. I was hoping the newly created index data would be replicated to slaves, but it wasn't. Here are the steps I performed: 1. stop master 2. delete the directory 'index' 3. start master 4. disable replication on master 5. index all data from scratch 6. enable replication on master It seemed from log file that the slave instances discovered that new index are available and claimed that new index installed, and then trying to update index properties, but looking into the index directory on slaves, you will find that no index data files were updated or added, plus slaves keep trying to get new index. Here are some from slave's log file: Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Starting replication process Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Number of files in latest snapshot in master: 69 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Total time taken for download : 0 secs Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Conf files are not downloaded or are in sync Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps INFO: New index installed. Updating index properties... Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Master's version: 1246488421310, generation: 9 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Slave's version: 1246385166228, generation: 56 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Starting replication process Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Number of files in latest snapshot in master: 69 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Total time taken for download : 0 secs Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Conf files are not downloaded or are in sync Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps INFO: New index installed. Updating index properties... Is this process incorrect, or it is a bug? If the process is incorrect, what is the right process? Thanks, J
Re: Implementing PhraseQuery and MoreLikeThis Query in one app
Sergey, Glad to hear the suggestion worked! I can't spot the problem (though I think you want to use a comma to separate the list of fields in the fl parameter value). I suggest you look at the servlet container logs and Solr logs and compare requests that these two calls make. Once you see what how the second one is different from the first one, you will probably be able to figure out how to adjust the second one to produce the same results as the first one. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: SergeyG sgoldb...@mail.ru To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 6:17:59 PM Subject: Re: Implementing PhraseQuery and MoreLikeThis Query in one app Otis, Your recipe does work: after copying an indexing field and excluding stop words the MoreLikeThis query started fetching meaningful results. :) Just one issue remained. When I execute query in this way: String query = q=id:1mlt.fl=content...fl=title+author+score; HttpClient client = new HttpClient(); GetMethod get = new GetMethod(http://localhost:8080/solr/mlt;); get.setQueryString(query); client.executeMethod(get); ... it works fine bringing results as an XML string. But when I use Solr-like approach: String query = id:1; solrQuery.setQuery(query); solrQuery.setParam(mlt, true); solrQuery.setParam(mlt.fl, content); solrQuery.setParam(fl, title author score); QueryResponse queryResponse = server.query( solrQuery ); the result contains only one doc with id=1 and no other more like docs. In my solrconfig.xml, I have these settings: ... ... I guess it all is a matter of syntax but I can't figure out what's wrong. Thank you very much (and again, thanks to Michael and Walter). Cheers, Sergey Michael Ludwig-4 wrote: SergeyG schrieb: Can both queries - PhraseQuery and MoreLikeThis Query - be implemented in the same app taking into account the fact that for the former to work the stop words list needs to be included and this results in the latter putting stop words among the most important words? Why would the inclusion of a stopword list result in stopwords being of top importance in the MoreLikeThis query? Michael Ludwig -- View this message in context: http://www.nabble.com/Implementing-PhraseQuery-and-MoreLikeThis-Query-in-one-app-tp24303817p24314840.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr slave Heap space error and index size issue
I don't think the index should *suddenly* increase in size if you are just adding/updating/deleting documents. It is normal that it temporarily increases during optimization. 35GB for 1.5M docs sounds a lot. You either have large fields or you store them or both? Maybe share your schema, show relevant solrconfig settings, list your index directory, share some of the stats from the Solr admin stats page, tell us about your JVM parameters, your RAM, etc. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Vikash Kontia vikash.kon...@gmail.com To: solr-user@lucene.apache.org Sent: Sunday, June 28, 2009 10:46:57 PM Subject: Solr slave Heap space error and index size issue Hi All, I have 1 master and 1 slave machine at deployment. I am using Solr1.4 nightly Build. My fresh index size is 35GB for 1.5 million documents with approx 50 fields each document I taken care of omitNorm and Stored in schema. I have approx 1 update daily and I run commit in every hr. in 5-6 days after fresh index index size suddenly increased (no optimization in between) by 150GB and then query takes long time and java heap error comes. I run optimize in this index Its takes long time and result it increase index size more more then 200GB and it didn't show about optimize completed. merge factor is default as given in solr build. For fixing this issue I have to use re-index in every week almost. I think issue with frequent update on index. Please help me for debug the issue. Or Is I am missing something in confuguration. Thanks Vikash Kontia -- View this message in context: http://www.nabble.com/Solr-slave-Heap-space-error-and-index-size-issue-tp24247690p24247690.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: reindexed data on master not replicated to slave
it's nightly build of May 10. I'll try the latest. Thanks, J On Thu, Jul 2, 2009 at 8:09 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Jay, You didn't mention which version of Solr you are using. It looks like some trunk or nightly version. Maybe you can try the latest nightly? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: solr jay solr...@gmail.com To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 9:14:48 PM Subject: reindexed data on master not replicated to slave Hi, When index data were corrupted on master instance, I wanted to wipe out all the index data and re-index everything. I was hoping the newly created index data would be replicated to slaves, but it wasn't. Here are the steps I performed: 1. stop master 2. delete the directory 'index' 3. start master 4. disable replication on master 5. index all data from scratch 6. enable replication on master It seemed from log file that the slave instances discovered that new index are available and claimed that new index installed, and then trying to update index properties, but looking into the index directory on slaves, you will find that no index data files were updated or added, plus slaves keep trying to get new index. Here are some from slave's log file: Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Starting replication process Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Number of files in latest snapshot in master: 69 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Total time taken for download : 0 secs Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Conf files are not downloaded or are in sync Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps INFO: New index installed. Updating index properties... Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Master's version: 1246488421310, generation: 9 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Slave's version: 1246385166228, generation: 56 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Starting replication process Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Number of files in latest snapshot in master: 69 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Total time taken for download : 0 secs Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Conf files are not downloaded or are in sync Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps INFO: New index installed. Updating index properties... Is this process incorrect, or it is a bug? If the process is incorrect, what is the right process? Thanks, J
Re: reindexed data on master not replicated to slave
jay , I see updating index properties... twice this should happen rarely. in your case it should have happened only once. because you cleaned up the master only once On Fri, Jul 3, 2009 at 6:09 AM, Otis Gospodneticotis_gospodne...@yahoo.com wrote: Jay, You didn't mention which version of Solr you are using. It looks like some trunk or nightly version. Maybe you can try the latest nightly? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: solr jay solr...@gmail.com To: solr-user@lucene.apache.org Sent: Thursday, July 2, 2009 9:14:48 PM Subject: reindexed data on master not replicated to slave Hi, When index data were corrupted on master instance, I wanted to wipe out all the index data and re-index everything. I was hoping the newly created index data would be replicated to slaves, but it wasn't. Here are the steps I performed: 1. stop master 2. delete the directory 'index' 3. start master 4. disable replication on master 5. index all data from scratch 6. enable replication on master It seemed from log file that the slave instances discovered that new index are available and claimed that new index installed, and then trying to update index properties, but looking into the index directory on slaves, you will find that no index data files were updated or added, plus slaves keep trying to get new index. Here are some from slave's log file: Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Starting replication process Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Number of files in latest snapshot in master: 69 Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Total time taken for download : 0 secs Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Conf files are not downloaded or are in sync Jul 1, 2009 3:59:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps INFO: New index installed. Updating index properties... Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Master's version: 1246488421310, generation: 9 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Slave's version: 1246385166228, generation: 56 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Starting replication process Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Number of files in latest snapshot in master: 69 Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Total time taken for download : 0 secs Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller fetchLatestIndex INFO: Conf files are not downloaded or are in sync Jul 1, 2009 4:00:33 PM org.apache.solr.handler.SnapPuller modifyIndexProps INFO: New index installed. Updating index properties... Is this process incorrect, or it is a bug? If the process is incorrect, what is the right process? Thanks, J -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Preparing the ground for a real multilang index
On 03.07.2009 00:49 Paul Libbrecht wrote: [I'll try to address the other responses as well] I believe the proper way is for the server to compute a list of accepted languages in order of preferences. The web-platform language (e.g. the user-setting), and the values in the Accept-Language http header (which are from the browser or platform). All this is not going to help much because the main application is a scientific search portal for books and articles with many users searching cross-language. The most typical use case is a German user searching multilingual. So we might even get the search multilingual, e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for Accept-headers or a language select field (would be left on any in most cases). Other popular use cases are citations (in whatever language) cut and pasted into the search field. Then you expand your query for surfing waves (say) to: - phrase query: surfing waves exactly (^2.0) - two terms, no stemming: surfing waves (^1.5) - iterate through the languages and query for stemmed variants: - english: surf wav ^1.0 - german surfing wave ^0.9 - - then maybe even try the phonetic analyzer (matched in a separate field probably) This is an even more sophisticated variant of the multiple OR I came up with. Oh well... I think this is a common pattern on the web where the users, browsers, and servers are all somewhat multilingual. indeed and often users are not even aware of it, especially in a scientific context they use their native tongue and English almost interchangably -- and they expect the search engine to cope with it. I think the best would be to process the data according to its language but don't make any assumptions about the query language and I am totally lost how to get a clever schema.xml out of all this. Thanks everyone for listening and I am still open for good suggestions to deal with this problem! -Michael