can't overwrite and can't delete by id
Recently, I found out that I can't delete doc by id or overwrite a doc from/in my SOLR index which is based on SOLR 4.4.0 version. Say, I have a doc http://pastebin.com/GqPP4Uw4 (to make it easier to view, I use pastebin here). And I tried to add a dynamic field rank_ti to it, want to make it like http://pastebin.com/dGnRRwux Funny thing is that after I inserted the new version of doc, if I do query curl 'localhost:8995/solr/select?wt=jsonindent=trueq=id:28583776' , the two versions above will appear randomly. And after half a minute, version 2 will disappear, which means the update is not get write into the disk. I tried to delete by id with rsolr, and the doc just can't be removed. Insert new doc into the index is fine though. Anyone ran into this strange behavior before? Thanks Ming
Re: can't overwrite and can't delete by id
BTW: it's a 4 shards solorcloud cluster using zookeeper 3.3.5 On Fri, Nov 22, 2013 at 11:07 AM, Mingfeng Yang mfy...@wisewindow.comwrote: Recently, I found out that I can't delete doc by id or overwrite a doc from/in my SOLR index which is based on SOLR 4.4.0 version. Say, I have a doc http://pastebin.com/GqPP4Uw4 (to make it easier to view, I use pastebin here). And I tried to add a dynamic field rank_ti to it, want to make it like http://pastebin.com/dGnRRwux Funny thing is that after I inserted the new version of doc, if I do query curl 'localhost:8995/solr/select?wt=jsonindent=trueq=id:28583776' , the two versions above will appear randomly. And after half a minute, version 2 will disappear, which means the update is not get write into the disk. I tried to delete by id with rsolr, and the doc just can't be removed. Insert new doc into the index is fine though. Anyone ran into this strange behavior before? Thanks Ming
Re: Problem of facet on 170M documents
Erick, It could have more than 4M distinct values. The purpose of this facet is to display the most frequent, say top 500, urls to users. Sascha, Thanks for the info. I will look into this thread thing. Mingfeng On Mon, Nov 4, 2013 at 4:47 AM, Erick Erickson erickerick...@gmail.comwrote: How many unique URLs do you have in your 9M docs? If your 9M hits have 4M distinct URLs, then this is not very valuable to the user. Sascha: Was that speedup on a single field or were you faceting over multiple fields? Because as I remember that code spins off threads on a per-field basis, and if I'm mis-remembering I need to look again! Best, Erick On Sat, Nov 2, 2013 at 5:07 AM, Sascha SZOTT sz...@gmx.de wrote: Hi Ming, which Solr version are you using? In case you use one of the latest versions (4.5 or above) try the new parameter facet.threads with a reasonable value (4 to 8 gave me a massive performance speedup when working with large facets, i.e. nTerms 10^7). -Sascha Mingfeng Yang wrote: I have an index with 170M documents, and two of the fields for each doc is source and url. And I want to know the top 500 most frequent urls from Video source. So I did a facet with fq=source:Videofacet=truefacet.field=urlfacet.limit=500, and the matching documents are about 9 millions. The solr cluster is hosted on two ec2 instances each with 4 cpu, and 32G memory. 16G is allocated tfor java heap. 4 master shards on one machine, and 4 replica on another machine. Connected together via zookeeper. Whenever I did the query above, the response is just taking too long and the client will get timed out. Sometimes, when the end user is impatient, so he/she may wait for a few second for the results, and then kill the connection, and then issue the same query again and again. Then the server will have to deal with multiple such heavy queries simultaneously and being so busy that we got no server hosting shard error, probably due to lost communication between solr node and zookeeper. Is there any way to deal with such problem? Thanks, Ming
Problem of facet on 170M documents
I have an index with 170M documents, and two of the fields for each doc is source and url. And I want to know the top 500 most frequent urls from Video source. So I did a facet with fq=source:Videofacet=truefacet.field=urlfacet.limit=500, and the matching documents are about 9 millions. The solr cluster is hosted on two ec2 instances each with 4 cpu, and 32G memory. 16G is allocated tfor java heap. 4 master shards on one machine, and 4 replica on another machine. Connected together via zookeeper. Whenever I did the query above, the response is just taking too long and the client will get timed out. Sometimes, when the end user is impatient, so he/she may wait for a few second for the results, and then kill the connection, and then issue the same query again and again. Then the server will have to deal with multiple such heavy queries simultaneously and being so busy that we got no server hosting shard error, probably due to lost communication between solr node and zookeeper. Is there any way to deal with such problem? Thanks, Ming
Re: spatial search, geofilt does not work
Oh, man. I have been trying to figure out the problem for half day. Probably Solr could use some error msg if the query format is invalid. But, THANKS! David, you probably saved me another half day. Ming- On Mon, Aug 19, 2013 at 10:20 PM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote: Thank goodness for Solr's feature of echo'ing params back in the response as it helps diagnose problems like this. In your case, the filter query that Solr is seeing isn't what you (seemed) to have given on the command line: fq:!geofilt sfield=author_geo Clearly wrong. Try escaping the braces with URL percent escapes, etc. ~ David Mingfeng Yang wrote My solr index has a field called author_geo which contains the author's location, and when I am trying to get all docs whose author are within 10 km of 35.0,35.0 using the following query. curl ' http://localhost/solr/select?q=*:*fq={!geofilt%20sfield=author_geo}pt=35.0,35.0d=10wt=jsonindent=truefl=author_geo ' I got one match document which actually has no value of author_geo. { responseHeader:{ status:0, QTime:7, params:{ d:10, fl:author_geo, indent:true, q:*:*, pt:35.0,35.0, wt:json, fq:!geofilt sfield=author_geo}}, response:{numFound:1,start:0,maxScore:1.0,docs:[ {}] }} But if I run the following query to do a sorting, it shows clearly that there are at least 6 docs which are within 10km of 35.0,35.0. curl ' http://localhost/solr/select?q=*:*sort=geodist(author_geo,35,35)+ascwt=jsonindent=truefl=author_geo,geodist(author_geo,35,35)fq=author_geo :\[0,0%20TO%20360,360\]' { responseHeader:{ status:0, QTime:10, params:{ fl:author_geo,geodist(author_geo,35,35), sort:geodist(author_geo,35,35) asc, indent:true, q:*:*, wt:json, fq:author_geo:[0,0 TO 360,360]}}, response:{numFound:78133,start:0,docs:[ { author_geo:34.991199,34.991199, geodist(author_geo,35,35):1.2650756688780775}, { author_geo:34.991199,34.991199, geodist(author_geo,35,35):1.2650756688780775}, { author_geo:34.991199,34.991199, geodist(author_geo,35,35):1.2650756688780775}, { author_geo:35.032242,35.032242, geodist(author_geo,35,35):4.634071252404282}, { author_geo:35.04644,35.04644, geodist(author_geo,35,35):6.674485609316976}, { author_geo:35.060379,35.060379, geodist(author_geo,35,35):8.67754019129343}, { author_geo:34.924019,34.924019, geodist(author_geo,35,35):10.923479728441448}, { author_geo:34.89296,34.89296, geodist(author_geo,35,35):15.389876355902395}, { author_geo:34.89296,34.89296, geodist(author_geo,35,35):15.389876355902395}, { author_geo:35.109669,35.109669, geodist(author_geo,35,35):15.759483283896515}] }} Can anyone tell me if anything is wrong here? I am using Solr 4.4. Thanks, Ming- - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/spatial-search-geofilt-does-not-work-tp4085551p4085590.html Sent from the Solr - User mailing list archive at Nabble.com.
spatial search, geofilt does not work
My solr index has a field called author_geo which contains the author's location, and when I am trying to get all docs whose author are within 10 km of 35.0,35.0 using the following query. curl ' http://localhost/solr/select?q=*:*fq={!geofilt%20sfield=author_geo}pt=35.0,35.0d=10wt=jsonindent=truefl=author_geo ' I got one match document which actually has no value of author_geo. { responseHeader:{ status:0, QTime:7, params:{ d:10, fl:author_geo, indent:true, q:*:*, pt:35.0,35.0, wt:json, fq:!geofilt sfield=author_geo}}, response:{numFound:1,start:0,maxScore:1.0,docs:[ {}] }} But if I run the following query to do a sorting, it shows clearly that there are at least 6 docs which are within 10km of 35.0,35.0. curl ' http://localhost/solr/select?q=*:*sort=geodist(author_geo,35,35)+ascwt=jsonindent=truefl=author_geo,geodist(author_geo,35,35)fq=author_geo :\[0,0%20TO%20360,360\]' { responseHeader:{ status:0, QTime:10, params:{ fl:author_geo,geodist(author_geo,35,35), sort:geodist(author_geo,35,35) asc, indent:true, q:*:*, wt:json, fq:author_geo:[0,0 TO 360,360]}}, response:{numFound:78133,start:0,docs:[ { author_geo:34.991199,34.991199, geodist(author_geo,35,35):1.2650756688780775}, { author_geo:34.991199,34.991199, geodist(author_geo,35,35):1.2650756688780775}, { author_geo:34.991199,34.991199, geodist(author_geo,35,35):1.2650756688780775}, { author_geo:35.032242,35.032242, geodist(author_geo,35,35):4.634071252404282}, { author_geo:35.04644,35.04644, geodist(author_geo,35,35):6.674485609316976}, { author_geo:35.060379,35.060379, geodist(author_geo,35,35):8.67754019129343}, { author_geo:34.924019,34.924019, geodist(author_geo,35,35):10.923479728441448}, { author_geo:34.89296,34.89296, geodist(author_geo,35,35):15.389876355902395}, { author_geo:34.89296,34.89296, geodist(author_geo,35,35):15.389876355902395}, { author_geo:35.109669,35.109669, geodist(author_geo,35,35):15.759483283896515}] }} Can anyone tell me if anything is wrong here? I am using Solr 4.4. Thanks, Ming-
Re: spatial search, geofilt does not work
BTW: my schema.xml contains the following related lines. fieldType name=location class=solr.LatLonType subFieldSuffix=_coordinate/ field name=author_geo type=location indexed=true stored=true/ dynamicField name=*_coordinate type=tdouble indexed=true stored=false/ On Mon, Aug 19, 2013 at 2:02 PM, Mingfeng Yang mfy...@wisewindow.comwrote: My solr index has a field called author_geo which contains the author's location, and when I am trying to get all docs whose author are within 10 km of 35.0,35.0 using the following query. curl ' http://localhost/solr/select?q=*:*fq={!geofilt%20sfield=author_geo}pt=35.0,35.0d=10wt=jsonindent=truefl=author_geohttp://localhost/solr/select?q=*:*fq=%7B!geofilt%20sfield=author_geo%7Dpt=35.0,35.0d=10wt=jsonindent=truefl=author_geo ' I got one match document which actually has no value of author_geo. { responseHeader:{ status:0, QTime:7, params:{ d:10, fl:author_geo, indent:true, q:*:*, pt:35.0,35.0, wt:json, fq:!geofilt sfield=author_geo}}, response:{numFound:1,start:0,maxScore:1.0,docs:[ {}] }} But if I run the following query to do a sorting, it shows clearly that there are at least 6 docs which are within 10km of 35.0,35.0. curl ' http://localhost/solr/select?q=*:*sort=geodist(author_geo,35,35)+ascwt=jsonindent=truefl=author_geo,geodist(author_geo,35,35)fq=author_geo :\[0,0%20TO%20360,360\]' { responseHeader:{ status:0, QTime:10, params:{ fl:author_geo,geodist(author_geo,35,35), sort:geodist(author_geo,35,35) asc, indent:true, q:*:*, wt:json, fq:author_geo:[0,0 TO 360,360]}}, response:{numFound:78133,start:0,docs:[ { author_geo:34.991199,34.991199, geodist(author_geo,35,35):1.2650756688780775}, { author_geo:34.991199,34.991199, geodist(author_geo,35,35):1.2650756688780775}, { author_geo:34.991199,34.991199, geodist(author_geo,35,35):1.2650756688780775}, { author_geo:35.032242,35.032242, geodist(author_geo,35,35):4.634071252404282}, { author_geo:35.04644,35.04644, geodist(author_geo,35,35):6.674485609316976}, { author_geo:35.060379,35.060379, geodist(author_geo,35,35):8.67754019129343}, { author_geo:34.924019,34.924019, geodist(author_geo,35,35):10.923479728441448}, { author_geo:34.89296,34.89296, geodist(author_geo,35,35):15.389876355902395}, { author_geo:34.89296,34.89296, geodist(author_geo,35,35):15.389876355902395}, { author_geo:35.109669,35.109669, geodist(author_geo,35,35):15.759483283896515}] }} Can anyone tell me if anything is wrong here? I am using Solr 4.4. Thanks, Ming-
list docs with geo location info
I have a schema with a geolocation field named author_geo defined as field name=author_geo type=location indexed=true stored=true / How can I list docs whose author_geo fields are not empty? Seems filter query fq=author_geo:* does not work like other fields which are string or text or float type. curl 'localhost/solr/select?q=*:*rows=10wt=jsonindent=truefq=author_geo:*fl=author_geo' What's the right way of doing it? Thanks, Mingfeng
Re: list docs with geo location info
Figured out. use author_geo:[* TO *] will do the trick. On Thu, Aug 15, 2013 at 1:26 PM, Mingfeng Yang mfy...@wisewindow.comwrote: I have a schema with a geolocation field named author_geo defined as field name=author_geo type=location indexed=true stored=true / How can I list docs whose author_geo fields are not empty? Seems filter query fq=author_geo:* does not work like other fields which are string or text or float type. curl 'localhost/solr/select?q=*:*rows=10wt=jsonindent=truefq=author_geo:*fl=author_geo' What's the right way of doing it? Thanks, Mingfeng
plugin init failure for ShingleFilterFactory
I am trying to upgrade solr to 4.4 version, and looks like solr cann't load the ShingleFilterFactory class. 417 [coreLoadExecutor-4-thread-1] ERROR org.apache.solr.core.CoreContainer – Unable to create core: collection1 org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType textshingle: Plugin init failure for [schema.xml] analyzer/filter: Error instantiating class: 'org.apache.lucene.analysis.shingle.ShingleFilterFactory' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:467) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:164) at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69) at org.apache.solr.core.ZkContainer.createFromZk(ZkContainer.java:268) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:655) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:364) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:356) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) the field definition in the schema.xml is fieldType name=textshingle class=solr.TextField positionIncrementGap=100 stored=false analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=3 outputUnigrams=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=3 outputUnigrams=true outputUnigramIfNoNgram=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType
preserve special characters
We need to index and search lots of tweets which can like @solr: solr is great. or @solr_lucene, good combination. And we want to search with @solr or @solr_lucene. How can we preserve @ and _ in the index? If using whitespacetokennizer followed by worddelimiterfilter, @solr_lucene will be broken down into solr and lucene, which make the search results contain lots of non-relevant docs. If using standardtokenizer, the @ symbol is stripped. Thanks, Ming-
Re: preserve special characters
Hi Jack, That seems like the solution I am looking for. Thanks so much! //Can't find this types for WDF anywhere. Ming- On Tue, Jun 18, 2013 at 4:52 PM, Jack Krupansky j...@basetechnology.comwrote: The WDF has a types attribute which can specify one or more character type mapping files. You could create a file like: @ = ALPHA _ = ALPHA For example (from the book!): Example - Treat at-sign and underscores as text fieldType name=text_at_under class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=**true analyzer tokenizer class=solr.**WhitespaceTokenizerFactory/ filter class=solr.**WordDelimiterFilterFactory types=at-under-alpha.txt/ /analyzer /fieldType The file +at-under-alpha.txt+ would contain: @ = ALPHA _ = ALPHA The analysis results: Source: Hello @World_bar, r@end. Tokens: 1: Hello 2: @World_bar 3: r@end -- Jack Krupansky -Original Message- From: Mingfeng Yang Sent: Tuesday, June 18, 2013 6:58 PM To: solr-user@lucene.apache.org Subject: preserve special characters We need to index and search lots of tweets which can like @solr: solr is great. or @solr_lucene, good combination. And we want to search with @solr or @solr_lucene. How can we preserve @ and _ in the index? If using whitespacetokennizer followed by worddelimiterfilter, @solr_lucene will be broken down into solr and lucene, which make the search results contain lots of non-relevant docs. If using standardtokenizer, the @ symbol is stripped. Thanks, Ming-
dynamic field
How is daynamic field in solr implemented? Does it get saved into the same Document as other regular fields in lucene index? Ming-
retrieve datefield value from document
I have an index first built with solr1.4 and later upgraded to solr3.6, which has 150million documents, and all docs have a datefield which are not blank. (verified by solr query). I am using the following code snippet to retrieve import org.apache.lucene.index.IndexReader; import org.apache.lucene.store.*; import org.apache.lucene.document.*; IndexReader input = IndexReader.open(indexDir); Document d = input.document(i); int maxDoc = input.maxDoc(); for (int i = 0; i maxDoc; i++) { System.out.println(d.get('date'); } However, about 100 million docs give null for d.get('date') and about other 50 million docs give the right values. What could be wrong? Ming-
Re: retrieve datefield value from document
Michael, That's what I thought as well. I would assume an optimization of the index would rewrite all documents in the newer format then? Ming- On Fri, Jun 14, 2013 at 1:25 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Shot in the dark: You're using Lucene to read the index. That's sort of circumventing all the typing stuff that Solr does. Solr can deal with an index where some of the segments are in one format (say 1.4) and others are in another (3.6). Maybe they're being stored in a format in the newer (or older) segments that doesn't work with raw retrieval of the values through Lucene in the same way. Maybe it's able to retrieve the stored value from the indexed representation in one case rather than needing to store it. I'd query your index using EmbeddedSolrServer instead and see if that changes what you see. Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Fri, Jun 14, 2013 at 4:15 PM, Mingfeng Yang mfy...@wisewindow.com wrote: I have an index first built with solr1.4 and later upgraded to solr3.6, which has 150million documents, and all docs have a datefield which are not blank. (verified by solr query). I am using the following code snippet to retrieve import org.apache.lucene.index.IndexReader; import org.apache.lucene.store.*; import org.apache.lucene.document.*; IndexReader input = IndexReader.open(indexDir); Document d = input.document(i); int maxDoc = input.maxDoc(); for (int i = 0; i maxDoc; i++) { System.out.println(d.get('date'); } However, about 100 million docs give null for d.get('date') and about other 50 million docs give the right values. What could be wrong? Ming-
Re: retrieve datefield value from document
HI Dmitry, No, the docs are not deleted. Ming- On Fri, Jun 14, 2013 at 1:31 PM, Dmitry Kan solrexp...@gmail.com wrote: Maybe a document was marked as deleted? *isDeleted http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexReader.html#isDeleted(int) * On Fri, Jun 14, 2013 at 11:25 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Shot in the dark: You're using Lucene to read the index. That's sort of circumventing all the typing stuff that Solr does. Solr can deal with an index where some of the segments are in one format (say 1.4) and others are in another (3.6). Maybe they're being stored in a format in the newer (or older) segments that doesn't work with raw retrieval of the values through Lucene in the same way. Maybe it's able to retrieve the stored value from the indexed representation in one case rather than needing to store it. I'd query your index using EmbeddedSolrServer instead and see if that changes what you see. Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Fri, Jun 14, 2013 at 4:15 PM, Mingfeng Yang mfy...@wisewindow.com wrote: I have an index first built with solr1.4 and later upgraded to solr3.6, which has 150million documents, and all docs have a datefield which are not blank. (verified by solr query). I am using the following code snippet to retrieve import org.apache.lucene.index.IndexReader; import org.apache.lucene.store.*; import org.apache.lucene.document.*; IndexReader input = IndexReader.open(indexDir); Document d = input.document(i); int maxDoc = input.maxDoc(); for (int i = 0; i maxDoc; i++) { System.out.println(d.get('date'); } However, about 100 million docs give null for d.get('date') and about other 50 million docs give the right values. What could be wrong? Ming-
Re: retrieve datefield value from document
How did you solve the problem then? MIng On Fri, Jun 14, 2013 at 3:24 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Yes, that should be what happens. But then I'd guess you'd be able to retrieve no dates. I've encountered this myself. On Jun 14, 2013 6:05 PM, Mingfeng Yang mfy...@wisewindow.com wrote: Michael, That's what I thought as well. I would assume an optimization of the index would rewrite all documents in the newer format then? Ming- On Fri, Jun 14, 2013 at 1:25 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Shot in the dark: You're using Lucene to read the index. That's sort of circumventing all the typing stuff that Solr does. Solr can deal with an index where some of the segments are in one format (say 1.4) and others are in another (3.6). Maybe they're being stored in a format in the newer (or older) segments that doesn't work with raw retrieval of the values through Lucene in the same way. Maybe it's able to retrieve the stored value from the indexed representation in one case rather than needing to store it. I'd query your index using EmbeddedSolrServer instead and see if that changes what you see. Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Fri, Jun 14, 2013 at 4:15 PM, Mingfeng Yang mfy...@wisewindow.com wrote: I have an index first built with solr1.4 and later upgraded to solr3.6, which has 150million documents, and all docs have a datefield which are not blank. (verified by solr query). I am using the following code snippet to retrieve import org.apache.lucene.index.IndexReader; import org.apache.lucene.store.*; import org.apache.lucene.document.*; IndexReader input = IndexReader.open(indexDir); Document d = input.document(i); int maxDoc = input.maxDoc(); for (int i = 0; i maxDoc; i++) { System.out.println(d.get('date'); } However, about 100 million docs give null for d.get('date') and about other 50 million docs give the right values. What could be wrong? Ming-
Re: retrieve datefield value from document
Figured out the solution. The datefield in those documents were stored as binary, so what I should do is Fieldable df = doc.getFieldable(fname); byte[] ary = df.getBinaryValue(); ByteBuffer bb = ByteBuffer.wrap(ary); long num = bb.getLong(); ate dt = DateTools.stringToDate(DateTools.timeToString(num, DateTools.Resolution.SECOND)); Then you get dt as a string in the right format. Ming- On Fri, Jun 14, 2013 at 4:20 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Use EmbeddedSolrServer rather than Lucene directly. On Jun 14, 2013 6:47 PM, Mingfeng Yang mfy...@wisewindow.com wrote: How did you solve the problem then? MIng On Fri, Jun 14, 2013 at 3:24 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Yes, that should be what happens. But then I'd guess you'd be able to retrieve no dates. I've encountered this myself. On Jun 14, 2013 6:05 PM, Mingfeng Yang mfy...@wisewindow.com wrote: Michael, That's what I thought as well. I would assume an optimization of the index would rewrite all documents in the newer format then? Ming- On Fri, Jun 14, 2013 at 1:25 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Shot in the dark: You're using Lucene to read the index. That's sort of circumventing all the typing stuff that Solr does. Solr can deal with an index where some of the segments are in one format (say 1.4) and others are in another (3.6). Maybe they're being stored in a format in the newer (or older) segments that doesn't work with raw retrieval of the values through Lucene in the same way. Maybe it's able to retrieve the stored value from the indexed representation in one case rather than needing to store it. I'd query your index using EmbeddedSolrServer instead and see if that changes what you see. Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Fri, Jun 14, 2013 at 4:15 PM, Mingfeng Yang mfy...@wisewindow.com wrote: I have an index first built with solr1.4 and later upgraded to solr3.6, which has 150million documents, and all docs have a datefield which are not blank. (verified by solr query). I am using the following code snippet to retrieve import org.apache.lucene.index.IndexReader; import org.apache.lucene.store.*; import org.apache.lucene.document.*; IndexReader input = IndexReader.open(indexDir); Document d = input.document(i); int maxDoc = input.maxDoc(); for (int i = 0; i maxDoc; i++) { System.out.println(d.get('date'); } However, about 100 million docs give null for d.get('date') and about other 50 million docs give the right values. What could be wrong? Ming-
SolrEntityProcessor gets slower and slower
I trying to migrate 100M documents from a solr index (v3.6) to a solrcloud index (v4.1, 4 shards) by using SolrEntityProcessor. My data-config.xml is like dataConfig document entity name=sep processor=SolrEntityProcessor url=http://10.64.35.117:8995/solr/; query=*:* rows=2000 fl= author_class,authorlink,author_location_text,author_text,author,category,date,dimension,entity,id,language,md5_text,op_dimension,opinion_text,query_id,search_source,sentiment,source_domain_text,source_domain,text,textshingle,title,topic,topic_text,url / /document /dataConfig Initially, the data import rate is about 1K docs/second, but it eventually decrease to 20docs/second after running for tens of hours. Last time I tried data import with solorentityprocessor, the transfer rate can be as high as 3K docs/seconds. Anyone has any clues what can cause the slowdown? Thanks, Ming-
shard splitting
From the solr wiki, I saw this command ( http://localhost:8983/solr/admin/collections?action=SPLITSHARDcollection=collection_nameshard=shardId) which split one index into 2 shards. However, is there someway to split into more shards? Thanks, Ming-
Re: shard splitting
Hi Shalin, Do you mean that we can do 1-2, 2-4, 4-8 to get 8 shards eventually? After splitting, if we want to set up a solrcloud with all 8 shards, how shall we allocate the shards then? Thanks, Ming- On Mon, Jun 10, 2013 at 9:55 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: No, it is hard coded to split into two shards only. You can call it recursively on a sub shard to split into more pieces. Please note that some serious bugs were found in that command which will be fixed in the next (4.3.1) release of Solr. On Tue, Jun 11, 2013 at 9:43 AM, Mingfeng Yang mfy...@wisewindow.com wrote: From the solr wiki, I saw this command ( http://localhost:8983/solr/admin/collections?action=SPLITSHARDcollection= collection_nameshard=shardId) which split one index into 2 shards. However, is there someway to split into more shards? Thanks, Ming- -- Regards, Shalin Shekhar Mangar.
solr 3.6 use only one CPU
We have a solr instance running on a 4 CPU box. Sometimes, we send a query to our solr server and it take up 100% of one CPU and 60% of memory. I assume that if we send another query request, solr should be able to use another idling CPU. However, it is not the case. Using top, I only see one cpu is busy, and the client side just gets stucked. Is solr 3.6 able to do multithreading to process requests? Ming-
Re: iterate through each document in Solr
Hi Dmitry, My index is not sharded, and since its size is so big, sharding won't help much on the paging issue. Do you know any API which can help read from lucene binary index directly? I will be nice if we can just scan through the docs directly. Thanks! Ming- On Mon, May 6, 2013 at 3:33 AM, Dmitry Kan solrexp...@gmail.com wrote: Are you doing it once? Is your index sharded? If so, can you ask each shard individually? Another way would be to do it on Lucene level, i.e. read from the binary indices (API exists). Dmitry On Mon, May 6, 2013 at 5:48 AM, Mingfeng Yang mfy...@wisewindow.com wrote: Dear Solr Users, Does anyone know what is the best way to iterate through each document in a Solr index with billion entries? I tried to use select?q=*:*start=xxrows=500 to get 500 docs each time and then change start value, but it got very slow after getting through about 10 million docs. Thanks, Ming-
Re: iterate through each document in Solr
Andre, Thanks for the info! Unfortunately, my solr is on 3.6 version, and looks like those options are not available. :( Ming- On Mon, May 6, 2013 at 5:32 AM, Andre Bois-Crettez andre.b...@kelkoo.comwrote: On 05/06/2013 06:03 AM, Michael Sokolov wrote: On 5/5/13 7:48 PM, Mingfeng Yang wrote: Dear Solr Users, Does anyone know what is the best way to iterate through each document in a Solr index with billion entries? I tried to use select?q=*:*start=xxrows=500 to get 500 docs each time and then change start value, but it got very slow after getting through about 10 million docs. Thanks, Ming- You need to use a unique and stable sort key and get documents sortkey. For example, if you have a unique key, retrieve documents ordered by the unique key, and for each batch get documents max (key) from the previous batch -Mike There is more details on the wiki : http://wiki.apache.org/solr/**CommonQueryParameters#pageDoc_** and_pageScorehttp://wiki.apache.org/solr/CommonQueryParameters#pageDoc_and_pageScore -- André Bois-Crettez Search technology, Kelkoo http://www.kelkoo.com/ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
iterate through each document in Solr
Dear Solr Users, Does anyone know what is the best way to iterate through each document in a Solr index with billion entries? I tried to use select?q=*:*start=xxrows=500 to get 500 docs each time and then change start value, but it got very slow after getting through about 10 million docs. Thanks, Ming-
Re: facet.method enum vs fc
Joel, Thanks for your kind reply. The problem is solved with sharding and using facet.method=enum. I am curious about what's the different between enum and fc, so that enum works but fc does not. Do you know something about this? Thank you! Regards, Ming On Fri, Apr 19, 2013 at 6:18 AM, Joel Bernstein joels...@gmail.com wrote: Faceting on a high cardinality string field, like url, on a 120 million record index is going to be very memory intensive. You will very likely need to shard the index to get the performance that you need. In Solr 4.2, you can make the url field a Disk based DocValue and shift the memory from Solr to the file system cache. But to run efficiently this is still going to take a lot of memory in the OS file cache. On Thu, Apr 18, 2013 at 12:00 PM, Mingfeng Yang mfy...@wisewindow.com wrote: 20G is allocated to Solr already. Ming On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote: I am doing faceting on an index of 120M documents, on the field of url[...] I would guess that you would need 3-4GB for that. How much memory do you allocate to Solr? - Toke Eskildsen -- Joel Bernstein Professional Services LucidWorks
Re: Updating clusterstate from the zookeeper
Right. I am wondering if/how we can download a specific file from the zookeeper, modify it and then upload to rewrite it. Anyone ? Thanks, Ming On Fri, Apr 19, 2013 at 10:53 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: I would like to know the answer to this as well. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Thu, Apr 18, 2013 at 8:15 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, After creating a distributed collection on several different servers I sometimes get to deal with failing servers (cores appear not available = grey) or failing cores (Down / unable to recover = brown / red). In case i wish to delete this errorneous collection (through collection API) only the green nodes get erased, leaving a meaningless unavailable collection in the clusterstate.json. Is there any way to edit explicitly the clusterstate.json? If not, how do i update it so the collection as above gets deleted? Cheers, Manu
Re: facet.method enum vs fc
20G is allocated to Solr already. Ming On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote: I am doing faceting on an index of 120M documents, on the field of url[...] I would guess that you would need 3-4GB for that. How much memory do you allocate to Solr? - Toke Eskildsen
facet.method enum vs fc
I am doing faceting on an index of 120M documents, on the field of url, using the following two queries. Note that the only difference of the two queries is that first one uses default facet.method, and the second one uses face.method=enum. ( each document in the index contains a review we extracted from internet with multiple fields, and url field stands for the link to the original web pages. The matching document size is like 5.3 million. ) http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0 http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0facet.method=enum The first method gives me outofmemory error( ERROR 500: Java heap space java.lang.OutOfMemoryError: Java heap space), but the second one runs fine though very slow (163 seconds) According to the wiki and solr documentation, the default facet.method=fc uses less memory than facet.method=enum, isn't it? Thanks, Ming
Re: facet.method enum vs fc
Does Solr 3.6 has facet.method=fcs? I tried anyway, and got ERROR 500: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded. On Wed, Apr 17, 2013 at 12:38 PM, Timothy Potter thelabd...@gmail.comwrote: What are your results when using facet.method=fcs? On Wed, Apr 17, 2013 at 12:06 PM, Mingfeng Yang mfy...@wisewindow.com wrote: I am doing faceting on an index of 120M documents, on the field of url, using the following two queries. Note that the only difference of the two queries is that first one uses default facet.method, and the second one uses face.method=enum. ( each document in the index contains a review we extracted from internet with multiple fields, and url field stands for the link to the original web pages. The matching document size is like 5.3 million. ) http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0 http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0facet.method=enum The first method gives me outofmemory error( ERROR 500: Java heap space java.lang.OutOfMemoryError: Java heap space), but the second one runs fine though very slow (163 seconds) According to the wiki and solr documentation, the default facet.method=fc uses less memory than facet.method=enum, isn't it? Thanks, Ming
Re: tokenizer of solr
Jack, Thanks so much for this info. It's awesome. Ming On Thu, Apr 11, 2013 at 7:32 PM, Jack Krupansky j...@basetechnology.comwrote: In that case, use the types=wdfftypes.txt attribute of WDF and map @ and _ to ALPHA as shown in: http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.** WordDelimiterFilterFactoryhttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory . -- Jack Krupansky -Original Message- From: Mingfeng Yang Sent: Thursday, April 11, 2013 8:50 PM To: solr-user@lucene.apache.org Subject: Re: tokenizer of solr looks like it's due to the word delimiter filter. Anyone know if the protected file support regular expression or not? Ming On Thu, Apr 11, 2013 at 4:58 PM, Jack Krupansky j...@basetechnology.com* *wrote: Try the whitespace tokenizer. -- Jack Krupansky -Original Message- From: Mingfeng Yang Sent: Thursday, April 11, 2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr Dear Solr users and developers, I am trying to index some documents some of which are twitter messages, and we have a problem when indexing retweet. Say a twitter user named jpc_108 post a tweet, and then someone retweet his msg, and now @jpc_108 become part of the tweet text body. Seems like before indexing, the tokenizer factory of solr turns @jpc_108 into jpc and 108, and when we search for jpc_108, it's not there anymore. Is there anyway we can keep jcp_108 when it appears as @jpc_108? Thanks, Ming-
tokenizer of solr
Dear Solr users and developers, I am trying to index some documents some of which are twitter messages, and we have a problem when indexing retweet. Say a twitter user named jpc_108 post a tweet, and then someone retweet his msg, and now @jpc_108 become part of the tweet text body. Seems like before indexing, the tokenizer factory of solr turns @jpc_108 into jpc and 108, and when we search for jpc_108, it's not there anymore. Is there anyway we can keep jcp_108 when it appears as @jpc_108? Thanks, Ming-
Re: tokenizer of solr
looks like it's due to the word delimiter filter. Anyone know if the protected file support regular expression or not? Ming On Thu, Apr 11, 2013 at 4:58 PM, Jack Krupansky j...@basetechnology.comwrote: Try the whitespace tokenizer. -- Jack Krupansky -Original Message- From: Mingfeng Yang Sent: Thursday, April 11, 2013 7:48 PM To: solr-user@lucene.apache.org Subject: tokenizer of solr Dear Solr users and developers, I am trying to index some documents some of which are twitter messages, and we have a problem when indexing retweet. Say a twitter user named jpc_108 post a tweet, and then someone retweet his msg, and now @jpc_108 become part of the tweet text body. Seems like before indexing, the tokenizer factory of solr turns @jpc_108 into jpc and 108, and when we search for jpc_108, it's not there anymore. Is there anyway we can keep jcp_108 when it appears as @jpc_108? Thanks, Ming-
update some fields vs replace the whole document
Generally speaking, which has better performance for Solr? 1. updating some fields or adding new fields into a document. or 2. replacing the whole document. As I understand, update fields need to search for the corresponding doc first, and then replace field values. While replacing the whole document is just like adding new document. Is it right?
Re: update some fields vs replace the whole document
Then what's the difference between adding a new document vs. replacing/overwriting a document? Ming- On Fri, Mar 8, 2013 at 2:07 PM, Upayavira u...@odoko.co.uk wrote: With an atomic update, you need to retrieve the stored fields in order to build up the full document to insert back. In either case, you'll have to locate the previous version and mark it deleted before you can insert the new version. I bet that the amount of time spent retrieving stored fields is matched by the time saved by not having to transmit those fields over the wire, although I'd be very curious to see someone actually test that. Upayavira On Fri, Mar 8, 2013, at 09:51 PM, Mingfeng Yang wrote: Generally speaking, which has better performance for Solr? 1. updating some fields or adding new fields into a document. or 2. replacing the whole document. As I understand, update fields need to search for the corresponding doc first, and then replace field values. While replacing the whole document is just like adding new document. Is it right?
pivot facet with solrcloud (solr 4.1)
Looks like pivot facet with solrcloud does not work (I am using Solr 4.1). The query below return no pivot search result unless I added shards=shard1. http://localhost:8995/solr/collection1/select?q=*%3A*facet=truefacet.mincount=1facet.pivot=source_domain,authorrows=1wt=jsonfacet.limit=5 When this JIRA (https://issues.apache.org/jira/browse/SOLR-2894) will be implemented? Thanks, Ming-
solrcloud data directory structure
I see the items under my solorcloud data directory of replica node as drwxr-xr-x 2 solr solr42 Feb 22 18:19 index drwxr-xr-x 2 solr solr 12288 Feb 23 01:00 index.20130222181947835 -rw-r--r-- 1 solr solr78 Feb 22 18:25 index.properties -rw-r--r-- 1 solr solr 209 Feb 22 18:25 replication.properties drwxr-xr-x 2 solr solr99 Feb 23 01:00 tlog The index.timestamp directory is always there. But in old solr master replication setup, the index.timestamp directory becomes index after replication is done. What's the reason? Is it because in solrcloud, the replica node is always replicating? Thanks, Ming
Re: How to change the index dir in Solr 4.1
How about passing -Dsolr.data.dir=/ur/data/dir in the command line to java when you start Solr service. On Thu, Feb 21, 2013 at 9:05 AM, chamara chama...@gmail.com wrote: Yes that is what i am doing now? I taught this solution is not elegant for a deployment? Is there any other way to do this from the SolrConfig.xml? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-change-the-index-dir-in-Solr-4-1-tp4041891p4041950.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: can i install new SOLR 4.1 as slaver(3.3 Master)
I cannot give an affirmative answer. But I am thinking that it would have potential problem, as the index format in 3.3 and 4.1 are slightly different. Why don't you upgrade to 4.1? The only thing you need to do is 1. install solr 4.1 2.1 copy all related config files from 3.3 2.2 back up the index data folder 3. shutdown solr 3.3 4 start solr 4.1 with solr.data.dir pointing to the old dir On Thu, Feb 21, 2013 at 10:54 AM, michaelweica m...@hipdigital.com wrote: Hi , our SOLR master version is 3.3, can i install new box SOLR 4.1 as slaver, and replication from master data. thanks -- View this message in context: http://lucene.472066.n3.nabble.com/can-i-install-new-SOLR-4-1-as-slaver-3-3-Master-tp4041976.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: RequestHandler init failure
Chris, My config file did include the section of loading related plugin. Ming On Tue, Feb 19, 2013 at 10:42 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Found it by myself. It's here : http://mirrors.ibiblio.org/maven2/org/apache/solr/solr-dataimporthandler/4.1.0/ : : Download and move the jar file to solr-webapp/webapp/WEB-INF/lib directory, : and the errors are all gone. you don't need to move/copy/add any jars into hte solr webapp (where they will be blown away if/when you upgrade the webapp) All you need to do is load the jar as a plugin... https://wiki.apache.org/solr/SolrPlugins#How_to_Load_Plugins https://wiki.apache.org/solr/SolrConfigXml#lib -Hoss
RequestHandler init failure
When trying to use SolrEntityProcessor to do data import from another solr index (solor 4.1) I added the following in solrconfig.xml requestHandler name=/data class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-config.xml/str /lst /requestHandler and create new file data-config.xml with dataConfig document entity name=sep processor=SolrEntityProcessor url=http://wolf:1Xnbdoq@myserver:8995/solr/; query=*:* fl=id,md5_text,title,text/ /document /dataConfig I got the following errors: org.apache.solr.common.SolrException: RequestHandler init failure at org.apache.solr.core.SolrCore.init(SolrCore.java:794) at org.apache.solr.core.SolrCore.init(SolrCore.java:607) at org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:949) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.solr.common.SolrException: RequestHandler init failure at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:168) at org.apache.solr.core.SolrCore.init(SolrCore.java:731) ... 13 more Caused by: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.dataimport.DataImportHandler' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:438) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:507) at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:581) at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:154) ... 14 more Caused by: java.lang.ClassNotFoundException: org.apache.solr.handler.dataimport.DataImportHandler at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:627) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:422) ... 17 more Feb 18, 2013 7:24:43 PM org.apache.solr.common.SolrException log SEVERE: null:org.apache.solr.common.SolrException: Unable to create core: collection1 at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629) I assume that it's because jar file related to dataimporthandler is not included in default solr 4.1 distribution. Where can I find it? Thanks Ming
Re: RequestHandler init failure
Found it by myself. It's here http://mirrors.ibiblio.org/maven2/org/apache/solr/solr-dataimporthandler/4.1.0/ Download and move the jar file to solr-webapp/webapp/WEB-INF/lib directory, and the errors are all gone. Ming On Mon, Feb 18, 2013 at 11:52 AM, Mingfeng Yang mfy...@wisewindow.comwrote: When trying to use SolrEntityProcessor to do data import from another solr index (solor 4.1) I added the following in solrconfig.xml requestHandler name=/data class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-config.xml/str /lst /requestHandler and create new file data-config.xml with dataConfig document entity name=sep processor=SolrEntityProcessor url=http://wolf:1Xnbdoq@myserver:8995/solr/; query=*:* fl=id,md5_text,title,text/ /document /dataConfig I got the following errors: org.apache.solr.common.SolrException: RequestHandler init failure at org.apache.solr.core.SolrCore.init(SolrCore.java:794) at org.apache.solr.core.SolrCore.init(SolrCore.java:607) at org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:949) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.solr.common.SolrException: RequestHandler init failure at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:168) at org.apache.solr.core.SolrCore.init(SolrCore.java:731) ... 13 more Caused by: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.dataimport.DataImportHandler' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:438) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:507) at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:581) at org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:154) ... 14 more Caused by: java.lang.ClassNotFoundException: org.apache.solr.handler.dataimport.DataImportHandler at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:627) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:422) ... 17 more Feb 18, 2013 7:24:43 PM org.apache.solr.common.SolrException log SEVERE: null:org.apache.solr.common.SolrException: Unable to create core: collection1 at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629) I assume that it's because jar file related to dataimporthandler is not included in default solr 4.1 distribution. Where can I find it? Thanks Ming
fatest way to rebuild Solr index
I have a few Solr indexes, each with 20-200 millions documents, which were indexed by querying multiple PostgreSQL databases. If I do rebuild the index by the same way, it would take a few months, because the PostgresSQL query is slow. Now, I need to do the following changes to all indexes. 1. delete a couple fields from the Solr index 2. add a couple new fields 3. change the type of one field from string to int Luckily, all fields were indexed and stored. My plan is to query an old index, and get values for all fields, and then add them into new index. Any faster ways to build new indexes in my case? Thanks, Ming
Re: fatest way to rebuild Solr index
Shawn, Awesome. Exactly something I am looking for. Thanks! Ming On Thu, Feb 14, 2013 at 12:00 PM, Shawn Heisey s...@elyograg.org wrote: On 2/14/2013 12:46 PM, Mingfeng Yang wrote: I have a few Solr indexes, each with 20-200 millions documents, which were indexed by querying multiple PostgreSQL databases. If I do rebuild the index by the same way, it would take a few months, because the PostgresSQL query is slow. Now, I need to do the following changes to all indexes. 1. delete a couple fields from the Solr index 2. add a couple new fields 3. change the type of one field from string to int Luckily, all fields were indexed and stored. My plan is to query an old index, and get values for all fields, and then add them into new index. Using the DataImportHandler with SolrEntityProcessor is probably your best bet. I believe you would want to avoid updating the source index while using this. http://wiki.apache.org/solr/**DataImportHandler#**SolrEntityProcessorhttp://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor Thanks, Shawn
Traditional replication behind SolrCloud
Our application of Solr is somehow non-typical. We constantly feed Solr with lots of documents grabbed from internet, and NRT searching is not required. A typical search will return millions of result, and query response need to be as fast as possible. Since in SolrCloud environment, indexing request is constantly distributing to all leaders and replicas, and I think that may impact the query performance since the replicas are doing indexing and searching at the same time. I think about setting up a traditional replication behind each shard of SolrCloud, and set the replication interval to a few minutes, to minimize the impact of indexing on system resources. Or is there already some way to enforce traditional type of replication in the replicas of SolrCloud? Thanks, Ming
Re: How to migrate SolrCloud shards to different servers?
An experiment found that stop all shards, remove the zoo_data (assume your zookeeper is used for this particular solrcloud, otherwise, be cautious), and then start instance by order works fine. Ming On Sat, Jan 26, 2013 at 5:31 AM, Per Steffensen st...@designware.dk wrote: Hi We have actually tested this and found that the following will do it * Shutdown all Solr nodes - make sure ZKs are still running * For each replica (shard-instance) move its data-folder to the new server (if they are not already available to it through some shared storage) * For each repilca (shard-instance) also move solr.xmls * Extract clusterstate.json from ZK into a file. Modify that file so that hosts/IPs and ports are correct according to new setup. Replace clusterstate.json in ZK with the modified content of the clusterstate.json file * Start new Solr nodes Good luck! Regards, Per Steffensen On 1/26/13 6:56 AM, Mingfeng Yang wrote: Hi Mark, When I did testing with SolrCloud, I found the following. 1. I started 4 shards on the same host on port 8983, 8973, 8963, and 8953. 2. Index some data. 3. Shutdown all 4 shards. 4. Started 4 shards again, all pointing to the same data directory and use the same configuration, except that now we use different ports 8983, 8973, 7633 and 7648. 5. Now Solr has problem to load all cores properly. Therefore, I had the impression that ZooKeeper may have a memory of which hosts correspond to which shards. If I change the host info, it may get confused. I could not find any related documentation or discussion about this issue. Thanks, Ming On Fri, Jan 25, 2013 at 5:52 PM, Mark Miller markrmil...@gmail.com wrote: You could do it that way. I'm not sure why you are worried about the leaders. That shouldn't matter. You could also start up new Solrs on the new machines as replicas of the cores you want to move - then once they are active, unload the cores on the old machine, stop the Solr instances and remove the stuff left on the filesystem. - Mark On Jan 25, 2013, at 7:42 PM, Mingfeng Yang mfy...@wisewindow.com wrote: Right now I have an index with four shards on a single EC2 server, each running on different ports. Now I'd like to migrate three shards to independent servers. What should I do to safely accomplish this process? Can I just 1. shutdown all four solr instances. 2. copy three shards (indexes) to different servers. 3. launch 4 solr instances on 4 different servers, each with -zKhost specified, pointing to the zookeeper servers. In my impression, zookeeper remembers which shards are leaders. What I plan to do above could not elect the three new servers as leaders. If so, what's the correct way to do it? Thanks, Ming
Re: Distibuted search
In your case, since there is no co-current queries, adding replicas won't help much on improving the response speed. However, break your index into a few shards do help increase query performance. I recently break an index with 30 million documents (30G) into 4 shards, and the boost is pretty impressive (roughly 2-5x faster for a complicated query) Ming On Mon, Jan 28, 2013 at 10:54 AM, Isaac Hebsh isaac.he...@gmail.com wrote: Does adding replicas (on additional servers) help to improve search performance? It is known that each query goes to all the shards. It's clear that if we have massive load, then multiple cores serving the same shard are very useful. But what happens if I'll never have concurrent queries (one query is in the system at any time), but I want these single queries to return faster. Is a bigger replication factor will contribute? Especially, Will a complicated query (with a large amount of queried fields) go to multiple cores *of the same shard*? (E.g. core1 searching for term1 in field1, and core2 searching for term 2 in field2) And what about a query on a single field, which contains a lot of terms? Thanks in advance..
secure Solr server
Before Solr 4.0, I secure solr by enable password protection in Jetty. However, password protection will make solrcloud not work. We use EC2 now, and we need the www admin interface of solr to be accessible (with password) from anywhere. How do you protect your solr sever from unauthorized access? Thanks, Ming
maxScore field in SolrCloud response
We are migrating our Solr index from single index to multiple shards with solrcloud. I noticed that when I query solrcloud (to all shards or just one of the shards), the response has a field of maxScore, but query of single index does not include this field. In both cases, we are using Solr 4.0. Why is there such differences? Ming
How to migrate SolrCloud shards to different servers?
Right now I have an index with four shards on a single EC2 server, each running on different ports. Now I'd like to migrate three shards to independent servers. What should I do to safely accomplish this process? Can I just 1. shutdown all four solr instances. 2. copy three shards (indexes) to different servers. 3. launch 4 solr instances on 4 different servers, each with -zKhost specified, pointing to the zookeeper servers. In my impression, zookeeper remembers which shards are leaders. What I plan to do above could not elect the three new servers as leaders. If so, what's the correct way to do it? Thanks, Ming
Re: How to migrate SolrCloud shards to different servers?
Hi Mark, When I did testing with SolrCloud, I found the following. 1. I started 4 shards on the same host on port 8983, 8973, 8963, and 8953. 2. Index some data. 3. Shutdown all 4 shards. 4. Started 4 shards again, all pointing to the same data directory and use the same configuration, except that now we use different ports 8983, 8973, 7633 and 7648. 5. Now Solr has problem to load all cores properly. Therefore, I had the impression that ZooKeeper may have a memory of which hosts correspond to which shards. If I change the host info, it may get confused. I could not find any related documentation or discussion about this issue. Thanks, Ming On Fri, Jan 25, 2013 at 5:52 PM, Mark Miller markrmil...@gmail.com wrote: You could do it that way. I'm not sure why you are worried about the leaders. That shouldn't matter. You could also start up new Solrs on the new machines as replicas of the cores you want to move - then once they are active, unload the cores on the old machine, stop the Solr instances and remove the stuff left on the filesystem. - Mark On Jan 25, 2013, at 7:42 PM, Mingfeng Yang mfy...@wisewindow.com wrote: Right now I have an index with four shards on a single EC2 server, each running on different ports. Now I'd like to migrate three shards to independent servers. What should I do to safely accomplish this process? Can I just 1. shutdown all four solr instances. 2. copy three shards (indexes) to different servers. 3. launch 4 solr instances on 4 different servers, each with -zKhost specified, pointing to the zookeeper servers. In my impression, zookeeper remembers which shards are leaders. What I plan to do above could not elect the three new servers as leaders. If so, what's the correct way to do it? Thanks, Ming