a bug in commit script?
Hi, guys, It seems there's a small bug in the bin/commit script for solr 1.2. I was able to run snapinstaller successfully to install the index and open a new searcher. (This is verified by querying the new docs through the web admin UI.) However, the snapinstaller script failed due to the commit script's failure. The commit.log shows: 2007/09/19 23:54:43 started by solruser 2007/09/19 23:54:43 command: /var/SolrHome/solr/bin/commit 2007/09/19 23:54:43 commit request to Solr at http://localhost:6080/solr/update failed: 2007/09/19 23:54:43 ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime47/int/lst /response 2007/09/19 23:54:43 failed (elapsed time: 0 sec) I then checked the commit script which has the following line: echo $rs | grep 'result.*status=0' /dev/null 21 However, this is the older pattern of the response. The XML schema changed in 1.2. Should someone fix this? -- Regards, -Hui
Re: a bug in commit script?
: : It seems there's a small bug in the bin/commit script for solr 1.2. A fix was already commited to the trunk for this as part of SOLR-282 (but there doesn't seem to be a note about it in the changelog) -Hoss
Re: Solr Index - no segments* file found in org.apache.lucene.store.FSDirectory
: Does this case arise when i do a search when there is no index?? - If yes, : then i guess the Exception can be made more meaningful. in normal operation, i believe this shouldn't happen -- Solr will create the index for you on startup if there isn't one. You're attampting a fairly advanced / non trivial approach where you aren't letting Solr manage the index for you. you haven't given us any idea what the code you are using to build the index looks like -- but if i had to guess, i would bet that somewhere in there you are directly manupulating the file system directory -- not just the Lucene FSDirectory. that's the only situation i can think of where that index directory would ever exist but be completely empty -- the low level Lucene APIs should create a segments file as soon as you start adding docs, even if you haven't lcosed the writer yet. -Hoss
Re: How can i make a distribute search on Solr?
Maybe I got this wrong...but isn't this what mapreduce is meant to deal with? eg, 1) get the job (a query) 2) map it to workers ( servers that provide search results from their own indexing) 3) wait for the results from all workers that reply within acceptable timeframe. 4) comb through the lot of results from all workers, reduce them according to your own biz rules (eg, remove dupes, sort them by quality / priority... here possibly relying on the original parameters of the query in 1) 5) return the reduced results to the frontend. That seems to be how Sphinx works: http://www.sphinxsearch.com/doc.html#distributed Of course, the details of this are far over my head for either system, so I don't really know if that's a sensible way of doing things or not. Ciao, -- David N. Welton http://www.welton.it/davidw/
Re: Strange behavior when searching with accents
On Thu, 2007-09-20 at 10:11 +0200, Thierry Collogne wrote: Hello, We are experiencing some strange behavior while searching with words containing accents. We are using two examples rené and matthé When we search for rené or for rene, we get the same results, so that is ok. But when we search for matthé or for matthe, we get two totally different results. Can someone tell me why this happens? We would like the results to be the same. That highly depends on your schema. Do you use filter class=solr.ISOLatin1AccentFilterFactory/? I am using the following an it works like a charm fieldType name=stringSimilar class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ /analyzer analyzer type=query !--tokenizer class=solr.LowerCaseTokenizerFactory/-- tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ /analyzer /fieldType HTH salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions
Re: Strange behavior when searching with accents
On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote: ..when we search for matthé or for matthe, we get two totally different results The analyzer admin tool should help you find out what's happening, see http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9 -Bertrand
Re: Strange behavior when searching with accents
We are using this schema definition fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer /fieldType I will take a look at the analyzer took. Thank you both for the quick response. On 20/09/2007, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote: ..when we search for matthé or for matthe, we get two totally different results The analyzer admin tool should help you find out what's happening, see http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9 -Bertrand
Re: Strange behavior when searching with accents
I have entered the the matthé term in the the analyzer, but as far as I understand, it should be ok. I have made some screenshots with the results. http://farm2.static.flickr.com/1407/1412619772_0b697789cd_o.jpg http://farm2.static.flickr.com/1245/1412619774_3351b287bc_o.jpg I find it strange that the second screenshost doesnt give any matches. Can someone take a look at them and perhaps clarify why it does not work? Thank you. On 20/09/2007, Thierry Collogne [EMAIL PROTECTED] wrote: We are using this schema definition fieldType name=text class= solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class= solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer analyzer type=query tokenizer class= solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class= solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer /fieldType I will take a look at the analyzer took. Thank you both for the quick response. On 20/09/2007, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote: ..when we search for matthé or for matthe, we get two totally different results The analyzer admin tool should help you find out what's happening, see http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9 -Bertrand
Re: Strange behavior when searching with accents
On Thu, 2007-09-20 at 13:33 +0200, Thierry Collogne wrote: We are using this schema definition Thierry, try to move the solr.ISOLatin1AccentFilterFactory up the filter cue, like: ... tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ ... for both indexing and query. This way you make sure that all accent are gone before you do further filtering. You may need to reindex all documents to make sure we are not going to use the old index. HTH salu2 fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer /fieldType I will take a look at the analyzer took. Thank you both for the quick response. On 20/09/2007, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote: ..when we search for matthé or for matthe, we get two totally different results The analyzer admin tool should help you find out what's happening, see http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9 -Bertrand -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions
Re: Strange behavior when searching with accents
On Thu, 2007-09-20 at 14:01 +0200, Thierry Collogne wrote: I have entered the the matthé term in the the analyzer, but as far as I understand, it should be ok. I have made some screenshots with the results. http://farm2.static.flickr.com/1407/1412619772_0b697789cd_o.jpg http://farm2.static.flickr.com/1245/1412619774_3351b287bc_o.jpg I find it strange that the second screenshost doesnt give any matches. Can someone take a look at them and perhaps clarify why it does not work? See my other response, but the 2nd screenshoot has changed the the query field using the non accent way. Further you want to use the verbose output option to better analyze. salu2 Thank you. On 20/09/2007, Thierry Collogne [EMAIL PROTECTED] wrote: We are using this schema definition fieldType name=text class= solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class= solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer analyzer type=query tokenizer class= solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class= solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer /fieldType I will take a look at the analyzer took. Thank you both for the quick response. On 20/09/2007, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote: ..when we search for matthé or for matthe, we get two totally different results The analyzer admin tool should help you find out what's happening, see http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9 -Bertrand -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions
Re: Strange behavior when searching with accents
On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote: ...Thank you very much. Moving the filter class= solr.ISOLatin1AccentFilterFactory/ up in the chain fixed it Yes, the problem was the EnglishPorterFilterFactory before the accents removal: the stemmer doesn't know about accents, so no stemming occured on matthé whereas matthe was stemmed to matth. BTW, your rené example makes me think you're indexing french, if that's the case you might want to use a stemmer configured for that language, for example filter class=Solr.SnowballPorterFilterFactory language=French/ -Bertrand
Re: Strange behavior when searching with accents
Thorsten, Thank you very much. Moving the filter class= solr.ISOLatin1AccentFilterFactory/ up in the chain fixed it. On 20/09/2007, Thorsten Scherler [EMAIL PROTECTED] wrote: On Thu, 2007-09-20 at 14:01 +0200, Thierry Collogne wrote: I have entered the the matthé term in the the analyzer, but as far as I understand, it should be ok. I have made some screenshots with the results. http://farm2.static.flickr.com/1407/1412619772_0b697789cd_o.jpg http://farm2.static.flickr.com/1245/1412619774_3351b287bc_o.jpg I find it strange that the second screenshost doesnt give any matches. Can someone take a look at them and perhaps clarify why it does not work? See my other response, but the 2nd screenshoot has changed the the query field using the non accent way. Further you want to use the verbose output option to better analyze. salu2 Thank you. On 20/09/2007, Thierry Collogne [EMAIL PROTECTED] wrote: We are using this schema definition fieldType name=text class= solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class= solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer analyzer type=query tokenizer class= solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms= synonyms.txt ignoreCase=true expand=true/ filter class= solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected= protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer /fieldType I will take a look at the analyzer took. Thank you both for the quick response. On 20/09/2007, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote: ..when we search for matthé or for matthe, we get two totally different results The analyzer admin tool should help you find out what's happening, see http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9 -Bertrand -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions
Re: Term extraction
Not sure if this is in the same league or not, but Yahoo offers a term extraction web service. http://developer.yahoo.com/search/content/V1/termExtraction.html On 9/20/07, Grant Ingersoll [EMAIL PROTECTED] wrote: You might investigate some tools like Alias-i's LingPipe or do some searches for phrase recognition software, etc. -Grant On Sep 19, 2007, at 9:58 PM, Pieter Berkel wrote: I'm currently looking at methods of term extraction and automatic keyword generation from indexed documents. I've been experimenting with MoreLikeThis and values returned by the mlt.interestingTerms parameter and so far this approach has worked well. However, I'd like to be able to analyze documents more intelligently to recognize phrase keywords such as open source, Microsoft Office, Bill Gates rather than splitting each word into separate tokens (the field is never used in search queries so matching is not an issue). I've been looking at SynonymFilterFactory as a possible solution to this problem but haven't been able to work out the specifics of how to configure it for phrase mappings. Has anybody else dealt with this problem before or able to offer any insights into achieve the desired results? Thanks in advance, Pieter -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ -- Michael Kimsal http://webdevradio.com
Re: Filter by Group
Thanks, Pieter. I'll go for that then. Mark On Sep 19, 2007, at 10:15 PM, Pieter Berkel wrote: Sounds like you're on the right track, if your groups overap (i.e. a document can be in group A and B), then you should ensure your groups field is multivalued. If you are searching for foo in documents contained in group A, then it might be more efficient to use a filter query (fq) like: q=foofq=groups:A See the wiki page on common query parameters for more info: http://wiki.apache.org/solr/ CommonQueryParameters#head-6522ef80f22d0e50d2f12ec487758577506d6002 cheers, Piete On 20/09/2007, mark angelillo [EMAIL PROTECTED] wrote: Hey all, Let's say I have an index of one hundred documents, and these documents are grouped into 4 groups A, B, C, and D. The groups do in fact overlap. What would people recommend as the best way to apply a search query and return only the documents that are in group A? Also, how about if we run the same search query but return only those documents in groups A, C and D? I imagine that I could do this by indexing a text field populated with the group names and adding something like groups:A to the query but I'm wondering if there's a better solution. Thanks in advance, Mark mark angelillo snooth inc. o: 646.723.4328 c: 484.437.9915 [EMAIL PROTECTED] snooth -- 1.7 million ratings and counting... mark angelillo snooth inc. o: 646.723.4328 c: 484.437.9915 [EMAIL PROTECTED] snooth -- 1.7 million ratings and counting...
Re: Strange behavior when searching with accents
We are indexing both french and dutch. I will take a look at SnowballPorterFilterFactory later, but thanks for the advice. On 20/09/2007, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote: ...Thank you very much. Moving the filter class= solr.ISOLatin1AccentFilterFactory/ up in the chain fixed it Yes, the problem was the EnglishPorterFilterFactory before the accents removal: the stemmer doesn't know about accents, so no stemming occured on matthé whereas matthe was stemmed to matth. BTW, your rené example makes me think you're indexing french, if that's the case you might want to use a stemmer configured for that language, for example filter class=Solr.SnowballPorterFilterFactory language=French/ -Bertrand
Re: How can i make a distribute search on Solr?
On 9/19/07, Norberto Meijome [EMAIL PROTECTED] wrote: Maybe I got this wrong...but isn't this what mapreduce is meant to deal with? Not really... you could force a *lot* of different problems into map-reduce (that's sort of the point... being able to automatically parallelize a lot of different problems). It really isn't the best fit though, and would end up being much slower than a custom job. Then there is the issue that the way map-reduce is implemented (like hadoop) is also tuned for longer running batch jobs on huge data (temporary files are used, external sorts, initial input, final output is via files, etc). Check out the google map-reduce paper - they don't use it for their search side either. Things are already progressing in the distributed search area: https://issues.apache.org/jira/browse/SOLR-303 Hopefully I'll have time to dig into it more myself in a few weeks. -Yonik
Re: Term extraction
On 9/19/07, Pieter Berkel [EMAIL PROTECTED] wrote: However, I'd like to be able to analyze documents more intelligently to recognize phrase keywords such as open source, Microsoft Office, Bill Gates rather than splitting each word into separate tokens (the field is never used in search queries so matching is not an issue). I've been looking at SynonymFilterFactory as a possible solution to this problem but haven't been able to work out the specifics of how to configure it for phrase mappings. SynonymFilter works out-of-the-box with multi-token synonyms... Microsoft Office = microsoft_office Bill Gates, William Gates = bill_gates Just don't use a word-delimiter filter if you use underscore to join words. -Yonik
Re: Strange behavior when searching with accents
On Thu, 2007-09-20 at 15:27 +0200, Bertrand Delacretaz wrote: On 9/20/07, Thierry Collogne [EMAIL PROTECTED] wrote: ...Thank you very much. Moving the filter class= solr.ISOLatin1AccentFilterFactory/ up in the chain fixed it Yes, the problem was the EnglishPorterFilterFactory before the accents removal: the stemmer doesn't know about accents, so no stemming occured on matthé whereas matthe was stemmed to matth. BTW, your rené example makes me think you're indexing french, if that's the case you might want to use a stemmer configured for that language, for example filter class=Solr.SnowballPorterFilterFactory language=French/ Betrand, does the French Snowball work fine? A colleague of mine exchanged mails with Porter about the Spanish filter and he came to the conclusion that it is not really working well for Spanish: So -orio on the whole changes meaning too much (acceso = access, accessorio = accessory differ as much in Spanish as English; -atorio similarly (aclarar to rinse, clear (in a very general sense), brighten up; aclaratorio = explanatory). Diminutives, augmentatives usually fall under (a) and (c). -illo, -ote, -isimo are in this category. -al and -iz look like plausible candidates for ending removal, but, unlike their English counterparts, removing them makes little difference or improvement. Similarly with -ion removal after -s. There is a difficulty with pure vowel endings, and the stemmer can't always get this right. So in English 'academic' is stemmed to 'academ' but 'academy' does not lose the final -y (or -i). This explains the residual vowels with -io, -ia endings etc. salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions
Re: How can i make a distribute search on Solr?
On Thu, 20 Sep 2007 09:58:17 +0200 David Welton [EMAIL PROTECTED] wrote: That seems to be how Sphinx works: http://www.sphinxsearch.com/doc.html#distributed Of course, the details of this are far over my head for either system, so I don't really know if that's a sensible way of doing things or not. thanks for the pointer. it does seem that it's pretty much what I had in mind... but it doesn't seem to be based on Lucene (which I particular like, specially for the community...) ... cheers, _ {Beto|Norberto|Numard} Meijome The freethinking of one age is the common sense of the next. Matthew Arnold I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: How can i make a distribute search on Solr?
On Thu, 20 Sep 2007 09:53:46 -0400 Yonik Seeley [EMAIL PROTECTED] wrote: On 9/19/07, Norberto Meijome [EMAIL PROTECTED] wrote: Maybe I got this wrong...but isn't this what mapreduce is meant to deal with? Not really... you could force a *lot* of different problems into map-reduce (that's sort of the point... being able to automatically parallelize a lot of different problems). It really isn't the best fit though, and would end up being much slower than a custom job. good point..i wondered whether the whole sorting/whatever wasn't going to make it far slower than something custom. I dont care about mapreduce in particular, but yes the effect - n indexers / searches all fulfilling their part of the overall search results. Then there is the issue that the way map-reduce is implemented (like hadoop) is also tuned for longer running batch jobs on huge data (temporary files are used, external sorts, initial input, final output is via files, etc). I see, didn't know this. Check out the google map-reduce paper - they don't use it for their search side either. yeah, need to :) Things are already progressing in the distributed search area: https://issues.apache.org/jira/browse/SOLR-303 Hopefully I'll have time to dig into it more myself in a few weeks. excellent , thanks _ {Beto|Norberto|Numard} Meijome He uses statistics as a drunken man uses lamp-posts ... for support rather than illumination. Andrew Lang (1844-1912) I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
RE: Index/Update Problems with Solrj/Tomcat and Larger Files
I am running against 1.2. Where would I get the 1.3-dev version? I will try different versions of Tomcat and/or Jetty. Thanks for all your suggestions, I'll let you know. -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 19, 2007 8:30 PM To: solr-user@lucene.apache.org Subject: Re: Index/Update Problems with Solrj/Tomcat and Larger Files However, if I go to the tomcat server and restart it after I have issued the process command, the program returns and the documents are all posted correctly! Very strange behavioram I somehow not closing the connection properly? What version is the solr you are connecting to? 1.2 or 1.3-dev? (I have not tested against 1.2) Does this only happen with tomcat? If you run with jetty do you get the same behavior? (again, just stabs in the dark) If you can make a small repeatable problem, post it in JIRA and I'll look into it. ryan
Re: Strange behavior when searching with accents
On 9/20/07, Thorsten Scherler [EMAIL PROTECTED] wrote: ...Betrand, does the French Snowball work fine?... I've seen some weirdnesses, like tennis and tenir (means to hold) both stemmed to ten, but in all of our (simple) tests it was ok. The application where we're using it does not require high precision though, so it looked good enough and we didn't do create very extensive tests for it. -Bertrand
Solr and FieldCache
I have an index with several fields, but just one stored: ID (string, unique). I need to access that ID field for each of the tops nodes docs in my results (this is done inside a handler I wrote), code looks like: Hits hits = searcher.search(query); for(int i=0; inodes; i++) { id[i]=hits.doc(i).get(ID); score[i]=hits.score(i); } I noticed that retrieving the code is slow. if I use the FieldCache, like: id[i]=FieldCache.DEFAULT.getStrings(searcher.getReader(), ID)[hits.id(i)]; after the first execution (the initialization of the cache take some times), it seems to run much faster. But what happens when SOLR reload the index (after a commit, or an optimize for example)? Will it refresh the cache with new reader (in the warmup process?), or it will be the first query execution of that code (with the new reader) that will force the refresh? (this could mean that every first query after a reload will be slower) Is there any way to tell SOLR to cache and warmup when needed this ID field? Thanks, Walter
Re: Solr and FieldCache
At 5:30 PM +0200 9/20/07, Walter Ferrara wrote: I have an index with several fields, but just one stored: ID (string, unique). I need to access that ID field for each of the tops nodes docs in my results (this is done inside a handler I wrote), code looks like: Hits hits = searcher.search(query); for(int i=0; inodes; i++) { id[i]=hits.doc(i).get(ID); score[i]=hits.score(i); } I noticed that retrieving the code is slow. if I use the FieldCache, like: id[i]=FieldCache.DEFAULT.getStrings(searcher.getReader(), ID)[hits.id(i)]; I assume you're putting FieldCache.DEFAULT.getStrings(searcher.getReader(), ID) in an array outside the loop, saving 2 redundant method calls per iteration. after the first execution (the initialization of the cache take some times), it seems to run much faster. Do note that FieldCache.DEFAULT is caching the indexed values, not the stored values. Since your field is an ID you are probably indexing it in such a way that both are identical, e.g. with KeywordTokenizer, so you're not seeing a difference. But what happens when SOLR reload the index (after a commit, or an optimize for example)? Will it refresh the cache with new reader (in the warmup process?), or it will be the first query execution of that code (with the new reader) that will force the refresh? (this could mean that every first query after a reload will be slower) It is refreshed by Lucene the first time the FieldCache array is requested from the new IndexReader. Is there any way to tell SOLR to cache and warmup when needed this ID field? Absolutely, just put a warmup query in solrconfig.xml which makes request that invokes FieldCache.DEFAULT.getStrings on that field. Simplest would probably be to invoke your custom handler, perhaps passing arguments that limit it to only processing one document to limit the data which gets cached; since getStrings returns the entire array, one pass through your loop is fine. If that's not easy with your handler, you could achieve the same effect by setting up a handler which facets on the ID field, sorting by ID (facet.sort=false), and only asks for a single value (facet.limit=1) (the entire id[docid] array will get scanned to count references to that ID, but that ensures it gets paged in). - J.J.
Re: Solr and FieldCache
About stored/index difference: ID is a string, (= solr.StrField) so FieldCache give me what I need. I'm just wondering, as this cached object could be (theoretically) pretty big, do I need to be aware of some OOM? I know that FieldCache use weakmaps, so I presume the cached array for the older reader(s) will be gc-ed when the reader is no longer referenced (i.e. when solr load the new one, after its warmup and so on), is that right? Thanks -- J.J. Larrea wrote: At 5:30 PM +0200 9/20/07, Walter Ferrara wrote: I have an index with several fields, but just one stored: ID (string, unique). I need to access that ID field for each of the tops nodes docs in my results (this is done inside a handler I wrote), code looks like: Hits hits = searcher.search(query); for(int i=0; inodes; i++) { id[i]=hits.doc(i).get(ID); score[i]=hits.score(i); } I noticed that retrieving the code is slow. if I use the FieldCache, like: id[i]=FieldCache.DEFAULT.getStrings(searcher.getReader(), ID)[hits.id(i)]; I assume you're putting FieldCache.DEFAULT.getStrings(searcher.getReader(), ID) in an array outside the loop, saving 2 redundant method calls per iteration. after the first execution (the initialization of the cache take some times), it seems to run much faster. Do note that FieldCache.DEFAULT is caching the indexed values, not the stored values. Since your field is an ID you are probably indexing it in such a way that both are identical, e.g. with KeywordTokenizer, so you're not seeing a difference. But what happens when SOLR reload the index (after a commit, or an optimize for example)? Will it refresh the cache with new reader (in the warmup process?), or it will be the first query execution of that code (with the new reader) that will force the refresh? (this could mean that every first query after a reload will be slower) It is refreshed by Lucene the first time the FieldCache array is requested from the new IndexReader. Is there any way to tell SOLR to cache and warmup when needed this ID field? Absolutely, just put a warmup query in solrconfig.xml which makes request that invokes FieldCache.DEFAULT.getStrings on that field. Simplest would probably be to invoke your custom handler, perhaps passing arguments that limit it to only processing one document to limit the data which gets cached; since getStrings returns the entire array, one pass through your loop is fine. If that's not easy with your handler, you could achieve the same effect by setting up a handler which facets on the ID field, sorting by ID (facet.sort=false), and only asks for a single value (facet.limit=1) (the entire id[docid] array will get scanned to count references to that ID, but that ensures it gets paged in). - J.J.
Faceting question
I've been struggling with this a bit so here goes: I'm using faceting to get some results. I also want to get another field - the id field along with it. Is it possible to get that somehow in the facet results? Thanks!
Re: Solr and FieldCache
On 9/20/07, Walter Ferrara [EMAIL PROTECTED] wrote: I'm just wondering, as this cached object could be (theoretically) pretty big, do I need to be aware of some OOM? I know that FieldCache use weakmaps, so I presume the cached array for the older reader(s) will be gc-ed when the reader is no longer referenced (i.e. when solr load the new one, after its warmup and so on), is that right? Right. You will need room for two entries (one for the current searcher and one for the warming searcher). -Yonik
Re: rsync start and enable for multiple solr instances within one tomcat
Ok, I should correct myself. For #1, I think we need to 1) config different port for each solr home dir (since they run on the same host); 2) run rsync-start script under each of the solr home's bin dir. (btw, just to make clear, we should run rsync-start after rsync-enable that I understand.) Can someone confirm my understanding? Does the #3 question suggests a hard-coded solr that shouldn't be? Thanks, -Hui On 9/19/07, Yu-Hui Jin [EMAIL PROTECTED] wrote: Hi, there, So we are using the Tomcat's JNDI method to set up multiple solr instances within a tomcat server. Each instance has a solr home directory. Now we want to set up collection distribution for all these solr home indexes. My understanding is: 1. we only need to run rsync-start once use the script under any of the solr home dirs. 2. we need to run each of the rsync-enable scripts under the solr home's bin dirs. 3. the twiki page at http://wiki.apache.org/solr/SolrCollectionDistributionScripts keeps refering to solr/xxx. Is this solr the example solr home dir? If so, would it be hard-coded in any of the scripts? For example, I saw in snappuller line 226 (solr 1.2): ${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/ ${data_dir}/${name}-wip Is the above solr a hard-coded solr home name? If so, it's not desirable since we have multiple solr homes with different names. If not, what is this solr? thanks, -Hui -- Regards, -Hui
Re: rsync start and enable for multiple solr instances within one tomcat
: 1) config different port for each solr home dir (since they run on the same : host); you mean a differnet rsync port right? ... yes the scripts as distributed assume that each rsync daemon will be dedicated to a single solr instance .. the idea beaing that even if you have 12 Solr intances running on one servlet container port, you have 12 seperate rsync ports so you can start/stop enable/disable them independently when doing index rebuilds, etc... : 2) run rsync-start script under each of the solr home's bin dir. : (btw, just to make clear, we should run rsync-start after rsync-enable that : I understand.) correct, rsyncd-enable just sets the flag file so that rsyncd-start will function ... the idea being that you can install rsyncd-start in such a way that it will run whenever your port is startup, or whenever you box is booted, but disable it from happening without removing the script from those places. : Can someone confirm my understanding? Does the #3 question suggests a : hard-coded solr that shouldn't be? solr/conf, solr/bin, solr/data, solr/logs ... all assume your solr home directory is named solr/, but that's not a requirement. It's a pretty pervasive documentation shortcut that could be changed if osmeone wanted to be systematic about it, but I don't think it's all that bad since that's a decent common case -Hoss
Re: Solr and FieldCache
On 9/20/07, Walter Ferrara [EMAIL PROTECTED] wrote: I have an index with several fields, but just one stored: ID (string, unique). I need to access that ID field for each of the tops nodes docs in my results (this is done inside a handler I wrote), code looks like: Hits hits = searcher.search(query); for(int i=0; inodes; i++) { id[i]=hits.doc(i).get(ID); score[i]=hits.score(i); } What is the higher level use-case you are trying to address that makes it necessary to write a plugin? -Yonik
Re: rsync start and enable for multiple solr instances within one tomcat
Thanks, Hoss. For the last question, yes I understand now it's referring to whatever solr home we have named. However, there's still the last part of my question that feels suspicious why the solr string is directly coded in the script (unlike other cases they usually use ${solr_root} to get to specific dirs. ) I pasted this line again below: I saw in snappuller line 226 (solr 1.2): ${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/ ${data_dir}/${name}-wip Is the above solr a hard-coded solr home name? If so, it's not desirable since we have multiple solr homes with different names. If not, what is this solr? Thanks, -Hui On 9/20/07, Chris Hostetter [EMAIL PROTECTED] wrote: : 1) config different port for each solr home dir (since they run on the same : host); you mean a differnet rsync port right? ... yes the scripts as distributed assume that each rsync daemon will be dedicated to a single solr instance .. the idea beaing that even if you have 12 Solr intances running on one servlet container port, you have 12 seperate rsync ports so you can start/stop enable/disable them independently when doing index rebuilds, etc... : 2) run rsync-start script under each of the solr home's bin dir. : (btw, just to make clear, we should run rsync-start after rsync-enable that : I understand.) correct, rsyncd-enable just sets the flag file so that rsyncd-start will function ... the idea being that you can install rsyncd-start in such a way that it will run whenever your port is startup, or whenever you box is booted, but disable it from happening without removing the script from those places. : Can someone confirm my understanding? Does the #3 question suggests a : hard-coded solr that shouldn't be? solr/conf, solr/bin, solr/data, solr/logs ... all assume your solr home directory is named solr/, but that's not a requirement. It's a pretty pervasive documentation shortcut that could be changed if osmeone wanted to be systematic about it, but I don't think it's all that bad since that's a decent common case -Hoss -- Regards, -Hui
Re: Faceting question
: I'm using faceting to get some results. I also want to get another field - : the id field along with it. Is it possible to get that somehow in the facet : results? you're going to have to elaborate on what it is you are trying to do ... i genuinely have no idea what you are asking (and i think i'm usually pretty good at reading between the lines and guessing what people mean). -Hoss
RE: Faceting question
You mean, when it says that facet term foo has 10 documents, you want those 10 ids? I think that will require a further query from your application. Peter -Original Message- From: Cric Digs [mailto:[EMAIL PROTECTED] Sent: Thursday, September 20, 2007 12:43 PM To: solr-user@lucene.apache.org Subject: Faceting question I've been struggling with this a bit so here goes: I'm using faceting to get some results. I also want to get another field - the id field along with it. Is it possible to get that somehow in the facet results? Thanks!
Re: rsync start and enable for multiple solr instances within one tomcat
ok. Hoss. I think I'll believe you since nobody raised any issue running the script. And I'm about to try it out shortly with different solr home names. So just to help my knowledge, where does this virtual setting of this solr string happen? Should it be in some config file or sth? thanks, -Hui On 9/20/07, Chris Hostetter [EMAIL PROTECTED] wrote: : home we have named. However, there's still the last part of my question : that feels suspicious why the solr string is directly coded in the script : (unlike other cases they usually use ${solr_root} to get to specific dirs. : ) I pasted this line again below: sorry ... i didn't realize you were talking about the script, i thought you were talking aboutthe docs. : I saw in snappuller line 226 (solr 1.2): : : ${stats} rsync://${master_host}:${rsyncd_port}/solr/${name}/ : ${data_dir}/${name}-wip : : Is the above solr a hard-coded solr home name? If so, it's not desirable I'm not 100% positive, but I believe that is just an arbitrary virtual path relative the root of the rsyncd server ... it could be anything, as long as snappuller and the rsyncd agree on what it is, so it's hardcoded to be solr. If we used ${solr_root} then the slaves and the master would have to use teh exact same solr home directory. -Hoss -- Regards, -Hui
Re: rsync start and enable for multiple solr instances within one tomcat
: So just to help my knowledge, where does this virtual setting of this solr : string happen? Should it be in some config file or sth? rsyncd-start creates an rsync config file on the fly ... much of it is constants, but it fills in the rsync port using a variable from your config. -Hoss
Re: rsync start and enable for multiple solr instances within one tomcat
The solr that you are referring to in your third question in the name of the rsync area which is map to the solr data directory. This is defined in the rsyncd configuration file which is generated on the fly as Chris has pointed out. Take a look at rsyncd-start. snappuller rsync the index from this 'solr' area (the command you have quoted) on the master. The name of the rsync area had nothing to do with the name of the index. We set up this area for rsyncd so that one is restricted within this area when trying to access files on the master going through rsyncd. The name of the rsyncd area does not have to be 'solr'. It can be anything as long as the value in rsyncd-start matches the value in snappuller. Bill On 9/20/07, Chris Hostetter [EMAIL PROTECTED] wrote: : So just to help my knowledge, where does this virtual setting of this solr : string happen? Should it be in some config file or sth? rsyncd-start creates an rsync config file on the fly ... much of it is constants, but it fills in the rsync port using a variable from your config. -Hoss
Re: a bug in commit script?
That would be my bad. I noticed the problem while fixing SOLR-282 which is not related. I fixed both problems in stead of opening a different bug for the response format issue. I will update the change log. Bill On 9/20/07, Chris Hostetter [EMAIL PROTECTED] wrote: : : It seems there's a small bug in the bin/commit script for solr 1.2. A fix was already commited to the trunk for this as part of SOLR-282 (but there doesn't seem to be a note about it in the changelog) -Hoss
clarification needed for the Ranking score
Hi, I need a clarification regarding the SOLR Ranking. consider the scenario for searching for courses based on following relevance: a. Courses with the term in the courseTitle, courseTag and in the courseDescription would appear first b. Courses with the term in the courseTitle and in the courseDescription would appear next c. Courses with the term only in the courseTitle appear next. d. Courses with the term only in the courseDescription appear next. e. Courses with the term only in the courseTag appear last. Let me know if my understanding is correct with the following solution + (basequery) courseTitle^1 courseTag^1000 courseDescription^100; courseTitle asc, courseDescription asc,courseTag asc; How do we set the relevancy while performing a search? is there any configuration to set it in the solrconfig files? Also how do we set the Term Proximity? Could you clarify? Thanks in advance Regards, Dilip TS