Re: search for a number within a range, where range values are mentioned in documents
During data import can you update a record with min and max fields, these would be equal in the case of a single non range value. I know this is not a solr solution but a data pre-processing one but would work? Failing the above i've saw in the docs reference to a compound value field (in the context of points, ie point = lat , lon which would be a nice way to store your range fields anthough i still think you will need to pre-process your data. cheers lee On 15 December 2010 18:22, Jonathan Rochkind rochk...@jhu.edu wrote: I'm not sure you're right that it will result in an out-of-memory error if the range is too large. I don't think it will, I think it'll be fine as far as memory goes, because of how Lucene works. Or do you actually have reason to believe it was causing you memory issues? Or do you just mean memory issues in your transformer, not actually in Solr? Using Trie fields should also make it fine as far as CPU time goes. Using a trie int field with a non-zero precision should likely be helpful in this case. It _will_ increase the on-disk size of your indexes. I'm not sure if there's a better approach, i can't think of one, but maybe someone else knows one. On 12/15/2010 12:56 PM, Arunkumar Ayyavu wrote: Hi! I have a typical case where in an attribute (in a DB record) can contain different ranges of numeric values. Let us say the range values in this attribute for record1 are (2-4,5000-8000,45000-5,454,231,1000). As you can see this attribute can also contain isolated numeric values such as 454, 231 and 1000. Now, I want to return record1 if the user searches for 20001 or 5003 or 231 or 5. Right now, I'm exploding the range values (within a transformer) and indexing record1 for each of the values within a range. But this could result in out-of-memory error if the range is too large. Could you help me figure out a better way of addressing this type of queries using Solr. Thanks a ton.
Re: Memory use during merges (OOM)
How long does it take to reach this OOM situation? Is it possible for you to try a merge with each setting in turn, and evaluate what impact they each have? That is, indexing speed and memory consumption? It might be interesting to watch garbage collection too while it is running with jstat, as that could be your speed bottleneck. Upayavira On Wed, 15 Dec 2010 18:52 -0500, Burton-West, Tom tburt...@umich.edu wrote: Hello all, Are there any general guidelines for determining the main factors in memory use during merges? We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. Below is a list of the changes and part of the indexwriter log. The changes increased the indexing though-put by almost an order of magnitude. (about 600 documents per hour to about 6000 documents per hour. Our documents are about 800K) We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20? Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr? Are there rules of thumb for the memory needed in terms of the number or size of segments? Our largest segments prior to the failed merge attempt were between 5GB and 30GB. The memory allocated to the Solr/tomcat JVM is 10GB. Tom Burton-West - Changes to indexing configuration: mergeScheduler before: serialMergeScheduler after:concurrentMergeScheduler mergeFactor before: 10 after : 20 ramBufferSizeMB before: 32 after: 320 excerpt from indexWriter.log Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 7.23609 to 7.98609: 20 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 0 to 20: add this merge Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 5.44878 to 6.19878: 20 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 20 to 40: add this merge ... Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted docIDs and 0 deleted queries on 40 segments. Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit exception flushing deletes Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit OutOfMemoryError inside updateDocument tom
RE: Dataimport performance
Check out http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com%3e This approach of not using sub entities really improved our load time. Ephraim Ofir -Original Message- From: Robert Gründler [mailto:rob...@dubture.com] Sent: Wednesday, December 15, 2010 4:49 PM To: solr-user@lucene.apache.org Subject: Re: Dataimport performance i've benchmarked the import already with 500k records, one time without the artists subquery, and one time without the join in the main query: Without subquery: 500k in 3 min 30 sec Without join and without subquery: 500k in 2 min 30. With subquery and with left join: 320k in 6 Min 30 so the joins / subqueries are definitely a bottleneck. How exactly did you implement the custom data import? In our case, we need to de-normalize the relations of the sql data for the index, so i fear i can't really get rid of the join / subquery. -robert On Dec 15, 2010, at 15:43 , Tim Heckman wrote: 2010/12/15 Robert Gründler rob...@dubture.com: The data-config.xml looks like this (only 1 entity): entity name=track query=select t.id as id, t.title as title, l.title as label from track t left join label l on (l.id = t.label_id) where t.deleted = 0 transformer=TemplateTransformer field column=title name=title_t / field column=label name=label_t / field column=id name=sf_meta_id / field column=metaclass template=Track name=sf_meta_class/ field column=metaid template=${track.id} name=sf_meta_id/ field column=uniqueid template=Track_${track.id} name=sf_unique_id/ entity name=artists query=select a.name as artist from artist a left join track_artist ta on (ta.artist_id = a.id) where ta.track_id=${track.id} field column=artist name=artists_t / /entity /entity So there's one track entity with an artist sub-entity. My (admittedly rather limited) experience has been that sub-entities, where you have to run a separate query for every row in the parent entity, really slow down data import. For my own purposes, I wrote a custom data import using SolrJ to improve the performance (from 3 hours to 10 minutes). Just as a test, how long does it take if you comment out the artists entity?
PHPSolrClient
First of all, it's a very nice piece of work. I am just getting my feet wet with Solr in general. So I 'am not even sure how a document is NORMALLY deleted. The library PHPDocs say 'add', 'get' 'delete', But does anyone know about 'update'? (obviously one can read-delete-modify-create) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Thank you!
I feel the same way about this group and the Postgres group. VERY helpful people. All of us helping heacho other. Dennis Gearon Signature Warning - Original Message From: Adam Estrada estrada.a...@gmail.com Subject: Thank you! I just want to say that this list serve has been invaluable to a newbie like me ;-) I posted a question earlier today and literally 10 minutes later I got an answer that helped me solve my problem. This is proof that there is a experienced and energetic community behind this FOSS group of projects and I really appreciate everyone who has put up with my otherwise trivial questions! More importantly, thanks to all of the contributors who make the whole thing possible! I attended the Lucene Revolution conference in Boston this year and the information that I was able to take away from the whole thing has made me and my vocation a lot more valuable. Keep up the outstanding work in the discovery of useful information from a sea of bleh ;-) Kindest regards, Adam
Re: PHPSolrClient
Hi Dennis, Not particular to the client you use (solr-php-client) for sending documents, think of update as an overwrite. This means that if you update a particular document, the previous version indexed is lost. Therefore, when updating a document, make sure that all the fields to be indexed and retrieved are present in the update. For an update to occur, only the uniqueKey id (as specified in your schema.xml) has to be the same as the document you want to update. Shortly, an update is like an add, (and performed the same way) except that the added document was previously indexed. It simple gets replaced by the update. Hope that helps, -- Tanguy 2010/12/16 Dennis Gearon gear...@sbcglobal.net: First of all, it's a very nice piece of work. I am just getting my feet wet with Solr in general. So I 'am not even sure how a document is NORMALLY deleted. The library PHPDocs say 'add', 'get' 'delete', But does anyone know about 'update'? (obviously one can read-delete-modify-create) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
indexing a lot of XML dokuments
hi, users, i serch e way to indexing a lot of iml Dokuments so fast as Possible. i have more than 1 million docs on Server 1 and a SolR multicor an Server 2 with tomcat. i dont know ho i can do it easy and fast.. I cant find a idea in the wiki, maby you have some ideas? King
Re: Memory use during merges (OOM)
RAM usage for merging is tricky. First off, merging must hold open a SegmentReader for each segment being merged. However, it's not necessarily a full segment reader; for example, merging doesn't need the terms index nor norms. But it will load deleted docs. But, if you are doing deletions (or updateDocument, which is just a delete + add under-the-hood), then this will force the terms index of the segment readers to be loaded, thus consuming more RAM. Furthermore, if the deletions you (by Term/Query) do in fact result in deleted documents (ie they were not false deletions), then the merging allocates an int[maxDoc()] for each SegmentReader that has deletions. Finally, if you have multiple merges running at once (see CSM.setMaxMergeCount) that means RAM for each currently running merge is tied up. So I think the gist is... the RAM usage will be in proportion to the net size of the merge (mergeFactor + how big each merged segment is), how many merges you allow concurrently, and whether you do false or true deletions. If you are doing false deletions (calling .updateDocument when in fact the Term you are replacing cannot exist) it'd be best if possible to change the app to not call .updateDocument if you know the Term doesn't exist. Mike On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote: Hello all, Are there any general guidelines for determining the main factors in memory use during merges? We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. Below is a list of the changes and part of the indexwriter log. The changes increased the indexing though-put by almost an order of magnitude. (about 600 documents per hour to about 6000 documents per hour. Our documents are about 800K) We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20? Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr? Are there rules of thumb for the memory needed in terms of the number or size of segments? Our largest segments prior to the failed merge attempt were between 5GB and 30GB. The memory allocated to the Solr/tomcat JVM is 10GB. Tom Burton-West - Changes to indexing configuration: mergeScheduler before: serialMergeScheduler after: concurrentMergeScheduler mergeFactor before: 10 after : 20 ramBufferSizeMB before: 32 after: 320 excerpt from indexWriter.log Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 7.23609 to 7.98609: 20 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 0 to 20: add this merge Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 5.44878 to 6.19878: 20 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 20 to 40: add this merge ... Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted docIDs and 0 deleted queries on 40 segments. Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit exception flushing deletes Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit OutOfMemoryError inside updateDocument tom
Re: Results from More then One Cors?
ok, works Great, at the Beginning, but now i get a Big Error :-( HTTP Status 500 - null java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:462) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:298) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:636)
Determining core name from a result?
Hi all, I've been bashing my head against the wall for a few hours now, trying to get mlt (more-like-this) queries working across multiple cores. I've since seen a JIRA issue and documentation saying that multicore doesn't yet support mlt queries. Oops! Anyway, to get around this, I was planning to send the mlt query just to the specific core that a particular result came from, but I can't see a way to obtain that information from the results. If I figure it out by hand, I can get a MLT query to produce similar documents from that core which is probably good enough for the time being. Does anyone know how, after performing a multi-core search to retrieve a single document, I can then find out which core that result came from? I'm using Solr branch_3x. Many thanks Mark -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: Google like search
Hi All, Thanks for your suggestions.. I got the result of what i expected.. Cheers, Satya
Re: PHPSolrClient
As Tanguy says, simply re-adding a document with the same uniqueKey will automatically delete/readd the doc. But I wanted to add a caution about your phrase read-delete-modify-create You only get back what you #stored#. So generally the update is done from the original source rather than the index. So it's simple if uniqueKey is there, just add the doc. Best Erick On Thu, Dec 16, 2010 at 4:14 AM, Dennis Gearon gear...@sbcglobal.netwrote: First of all, it's a very nice piece of work. I am just getting my feet wet with Solr in general. So I 'am not even sure how a document is NORMALLY deleted. The library PHPDocs say 'add', 'get' 'delete', But does anyone know about 'update'? (obviously one can read-delete-modify-create) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Determining core name from a result?
How are you querying the core to begin with? On Dec 16, 2010, at 6:46 AM, Mark Allan wrote: Hi all, I've been bashing my head against the wall for a few hours now, trying to get mlt (more-like-this) queries working across multiple cores. I've since seen a JIRA issue and documentation saying that multicore doesn't yet support mlt queries. Oops! Anyway, to get around this, I was planning to send the mlt query just to the specific core that a particular result came from, but I can't see a way to obtain that information from the results. If I figure it out by hand, I can get a MLT query to produce similar documents from that core which is probably good enough for the time being. Does anyone know how, after performing a multi-core search to retrieve a single document, I can then find out which core that result came from? I'm using Solr branch_3x. Many thanks Mark -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Grant Ingersoll http://www.lucidimagination.com/
Re: Thank you!
Hear hear! In the beginning of my journey with Solr/Lucene I couldn't have done it without this site. Smiley and Pugh's book was useful, but this forum was invaluable. I don't have as many questions now, but each new venture, Geospatial searching, replication and redundancy, performance tuning, brings me back again and again. This and stackoverflow.com have to be two of the most useful destinations on the internet for developers. Communities are so much more relevant than reference materials, and the consistent activity in this community is impressive. -- View this message in context: http://lucene.472066.n3.nabble.com/Thank-you-tp2096329p2098512.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Determining core name from a result?
Hi Grant, Thanks for your reply. I'm using solrj to connect via http, which eventually sends this query http://localhost:8984/solr/core0/select/?q=id:022-80633905version=2start=0rows=1fl=*indent=onshards=localhost:8984/solr/core0,localhost:8984/solr/core1,localhost:8984/solr/core2,localhost:8984/solr/core3,localhost:8984/solr/core4 I subsequently send the MLT query which ends up looking like: http://localhost:8984/solr/core0/mlt/?q=id:022-80633905version=2start=0rows=5fl=idindent=onmlt.fl=descriptionmlt.match.include=falsemlt.minwl=3mlt.mintf=1mlt.mindf=1localhost:8984/solr/core0,localhost:8984/solr/core1,localhost:8984/solr/core2,localhost:8984/solr/core3,localhost:8984/solr/core4 If I run that query in a browser, the response returned is response responseHeader status0/status QTime3/QTime /responseHeader null name='response'/ /response Now, because I know the the document with id 022-80633905 went into core 1, I get the correct results if I change the first part of the URL to http://localhost:8984/solr/core1/mlt but doing so requires my app (not just me!) to know which core the result came from. Thanks Mark On 16 Dec 2010, at 1:44 pm, Grant Ingersoll wrote: How are you querying the core to begin with? On Dec 16, 2010, at 6:46 AM, Mark Allan wrote: Hi all, I've been bashing my head against the wall for a few hours now, trying to get mlt (more-like-this) queries working across multiple cores. I've since seen a JIRA issue and documentation saying that multicore doesn't yet support mlt queries. Oops! Anyway, to get around this, I was planning to send the mlt query just to the specific core that a particular result came from, but I can't see a way to obtain that information from the results. If I figure it out by hand, I can get a MLT query to produce similar documents from that core which is probably good enough for the time being. Does anyone know how, after performing a multi-core search to retrieve a single document, I can then find out which core that result came from? I'm using Solr branch_3x. Many thanks Mark -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
STUCK Threads at org.apache.lucene.document.CompressionTools.decompress
Hello guys, I am getting threads stuck forever at * org.apache.lucene.document.CompressionTools.decompress*. I am using Weblogic 10.02, with solr deployed as ear and no work manager specifically configured for this instance. Only doing simple queries at this node (q=itemId:9 or q:skuId:9). My index has 3Giga. Now i send the thread dump of the stuck threads. Does anyone ever had this kind of problem? '[STUCK] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'' Id=19, RUNNABLE on lock=, total cpu time=187228990.ms user time=186506940.ms at java.util.zip.Inflater.inflateFast(Native Method) at java.util.zip.Inflater.inflateBytes(Inflater.java:360) at java.util.zip.Inflater.inflate(Inflater.java:218) at java.util.zip.Inflater.inflate(Inflater.java:235) at org.apache.lucene.document.CompressionTools.decompress(CompressionTools.java:108) at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:607) at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:368) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:229) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) at org.apache.lucene.index.IndexReader.document(IndexReader.java:947) at org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444) at org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:427) at org.apache.solr.util.SolrPluginUtils.optimizePreFetchDocs(SolrPluginUtils.java:267) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:269) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42) at weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3402) at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321) at weblogic.security.service.SecurityManager.runAs(Unknown Source) at weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2140) at weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:2046) at weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1398) at weblogic.work.ExecuteThread.execute(ExecuteThread.java:200) at weblogic.work.ExecuteThread.run(ExecuteThread.java:172) 'weblogic.time.TimeEventGenerator' Id=20, TIMED_WAITING on lock=weblogic.time.common.internal.timeta...@f051231a, total cpu time=60.ms user time=60.ms at java.lang.Object.wait(Native Method) at weblogic.time.common.internal.TimeTable.snooze(TimeTable.java:286) at weblogic.time.common.internal.TimeEventGenerator.run(TimeEventGenerator.java:117) at java.lang.Thread.run(Thread.java:595) 'JMAPI event thread' Id=21, RUNNABLE on lock=, total cpu time=1220.ms user time=880.ms 'weblogic.timers.TimerThread' Id=22, TIMED_WAITING on lock=weblogic.timers.internal.timerthr...@f050f3e4, total cpu time=1390.ms user time=1080.ms at java.lang.Object.wait(Native Method) at weblogic.timers.internal.TimerThread$Thread.run(TimerThread.java:265) '[STUCK] ExecuteThread: '4' for queue: 'weblogic.kernel.Default (self-tuning)'' Id=74, RUNNABLE on lock=, total cpu time=180761590.ms user time=180706770.ms at java.util.zip.Inflater.inflateFast(Native Method) at java.util.zip.Inflater.inflateBytes(Inflater.java:360) at java.util.zip.Inflater.inflate(Inflater.java:218) at java.util.zip.Inflater.inflate(Inflater.java:235) at org.apache.lucene.document.CompressionTools.decompress(CompressionTools.java:108) at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:607) at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:383) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:229) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) at org.apache.lucene.index.IndexReader.document(IndexReader.java:947) at org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444) at org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:427) at org.apache.solr.util.SolrPluginUtils.optimizePreFetchDocs(SolrPluginUtils.java:267) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:269) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at
Why does Solr commit block indexing?
Hi, See log at [1]. We are using the latest snapshot of lucene_branch3.1. We have configured Solr to use the ConcurrentMergeScheduler: mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler/ When a commit() runs, it blocks indexing (all imcoming update requests are blocked until the commit operation is finished) ... at the end of the log we notice a 4 minute gap during which none of the solr cients trying to add data receive any attention. This is a bit annoying as it leads to timeout exception on the client side. Here, the commit time is only 4 minutes, but it can be larger if there are merges of large segments I thought Solr was able to handle commits and updates at the same time: the commit operation should be done in the background, and the server still continue to receive update requests (maybe at a slower rate than normal). But it looks like it is not the case. Is it a normal behaviour ? [1] http://pastebin.com/KPkusyVb Regards -- Renaud Delbru
Re: indexing a lot of XML dokuments
I have been very successful in following this example http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_ExampleAdam On Thu, Dec 16, 2010 at 5:44 AM, Jörg Agatz joerg.ag...@googlemail.comwrote: hi, users, i serch e way to indexing a lot of iml Dokuments so fast as Possible. i have more than 1 million docs on Server 1 and a SolR multicor an Server 2 with tomcat. i dont know ho i can do it easy and fast.. I cant find a idea in the wiki, maby you have some ideas? King
Multicore Search broken
Hallo users, I have create a Multicore instance from Solr with Tomcat6, i create two Cores mail and index2 at first, mail and index2 are the Same config, after this, i change the Mail config and Indexing 30 xml No when i search in each core: http://localhost:8080/solr/mail/select?q=*:*shards=localhost:8080/solr/mail,localhost:8080/solr/http://localhost:8983/solr/core0/select?q=*:*sort=myfield+descshards=localhost:8983/solr/core0,localhost:8983/solr/core1,localhost:8983/solr/core2 index2 i get a Error __ HTTP Status 500 - null java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:462) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:298) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:636) __ when i search in one of the Cores, it works, http://localhost:8080/solr/mail/select?q=*:*http://localhost:8983/solr/core0/select?q=*:*sort=myfield+descshards=localhost:8983/solr/core0,localhost:8983/solr/core1,localhost:8983/solr/core2 = 30 results http://localhost:8080/solr/index2/select?q=*:*http://localhost:8983/solr/core0/select?q=*:*sort=myfield+descshards=localhost:8983/solr/core0,localhost:8983/solr/core1,localhost:8983/solr/core2 = one result Someone hase a Idea, what is Wrong ?
Re: how to config DataImport Scheduling
I also have the same problem, i configure dataimport.properties file as shown in http://wiki.apache.org/solr/DataImportHandler#dataimport.properties_example but no change occur, can any one help me -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-config-DataImport-Scheduling-tp2032000p2097768.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: STUCK Threads at org.apache.lucene.document.CompressionTools.decompress
What are you trying to do? It sounds like you're storing fields compressed, is that true (i.e. defining compressed=true in your field defs)? If so, why? It may be costing you more than you benefit. A quick test would be to stop returning anything except the score by specifying fl=score. Or at least stop returning the largest compressed fields... Make sure you've set enableLazyFieldLoading in solrconfig.xml appropriately. If there's no joy here, please post your field definitions and an example or two (with debugQuery=on) of offending queries. Best Erick On Thu, Dec 16, 2010 at 9:31 AM, Alexander Ramos Jardim alexander.ramos.jar...@gmail.com wrote: Hello guys, I am getting threads stuck forever at * org.apache.lucene.document.CompressionTools.decompress*. I am using Weblogic 10.02, with solr deployed as ear and no work manager specifically configured for this instance. Only doing simple queries at this node (q=itemId:9 or q:skuId:9). My index has 3Giga. Now i send the thread dump of the stuck threads. Does anyone ever had this kind of problem? '[STUCK] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'' Id=19, RUNNABLE on lock=, total cpu time=187228990.ms user time=186506940.ms at java.util.zip.Inflater.inflateFast(Native Method) at java.util.zip.Inflater.inflateBytes(Inflater.java:360) at java.util.zip.Inflater.inflate(Inflater.java:218) at java.util.zip.Inflater.inflate(Inflater.java:235) at org.apache.lucene.document.CompressionTools.decompress(CompressionTools.java:108) at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:607) at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:368) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:229) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) at org.apache.lucene.index.IndexReader.document(IndexReader.java:947) at org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444) at org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:427) at org.apache.solr.util.SolrPluginUtils.optimizePreFetchDocs(SolrPluginUtils.java:267) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:269) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42) at weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3402) at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321) at weblogic.security.service.SecurityManager.runAs(Unknown Source) at weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2140) at weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:2046) at weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1398) at weblogic.work.ExecuteThread.execute(ExecuteThread.java:200) at weblogic.work.ExecuteThread.run(ExecuteThread.java:172) 'weblogic.time.TimeEventGenerator' Id=20, TIMED_WAITING on lock=weblogic.time.common.internal.timeta...@f051231a, total cpu time=60.ms user time=60.ms at java.lang.Object.wait(Native Method) at weblogic.time.common.internal.TimeTable.snooze(TimeTable.java:286) at weblogic.time.common.internal.TimeEventGenerator.run(TimeEventGenerator.java:117) at java.lang.Thread.run(Thread.java:595) 'JMAPI event thread' Id=21, RUNNABLE on lock=, total cpu time=1220.ms user time=880.ms 'weblogic.timers.TimerThread' Id=22, TIMED_WAITING on lock=weblogic.timers.internal.timerthr...@f050f3e4, total cpu time=1390.ms user time=1080.ms at java.lang.Object.wait(Native Method) at weblogic.timers.internal.TimerThread$Thread.run(TimerThread.java:265) '[STUCK] ExecuteThread: '4' for queue: 'weblogic.kernel.Default (self-tuning)'' Id=74, RUNNABLE on lock=, total cpu time=180761590.ms user time=180706770.ms at java.util.zip.Inflater.inflateFast(Native Method) at java.util.zip.Inflater.inflateBytes(Inflater.java:360) at java.util.zip.Inflater.inflate(Inflater.java:218) at java.util.zip.Inflater.inflate(Inflater.java:235) at org.apache.lucene.document.CompressionTools.decompress(CompressionTools.java:108) at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:607) at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:383) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:229) at
Re: Determining core name from a result?
: Subject: Determining core name from a result? FYI: some people may be confused because of terminoligy -- i think what you are asking is how to know which *shard* a document came from when doing a distributed search. This isn't currently supported, there is an open issue tracking it... https://issues.apache.org/jira/browse/SOLR-705 -Hoss
Re: Query performance issue while using EdgeNGram
A couple of observations: 1 your regex at query time is interesting. You're using KeywordTokenizer, so input of search me becomes searchme before it goes through the parser. Is this your intent? 2 Why are you using EdgeNGrams for auto suggest? The TermsComponent is an easier, more efficient solution unless you have some special needs, see here: http://wiki.apache.org/solr/TermsComponent Are you trying to suggest #terms# or complete queries? Because if it's just on a term basis, TermsComponent seems much simpler. Jay's example on the Lucid web site (if that's where you started down this path) is for implementing #query# selection. 3 I'd think about checking your caches. I'm not real comfortable with a min gram size of 1 then warming all the alphabet. See the admin page, stats. Look for evictions. You're also bloating the size of your index pretty significantly because of the huge number of unique terms you'll be generating. 4 Optimizing is not all that useful unless you've deleted a bunch of documents, despite the name. What it does do is force a complete reload of the underlying index/caches, possibly you're seeing resource contention here because of that. *After* the index is warmed, do you see performance differences between optimized and un-optimized indexes? If not think about only optimizing during off hours. Best Erick On Thu, Dec 16, 2010 at 2:47 AM, Shanmugavel SRD srdshanmuga...@gmail.comwrote: While using auto suggest using EdgeNGramFilterFactory in SOLR 1.4.1, we are having performance issue on query response time. For example, even though 'p' is in auto warming, if I search for 'people' immediately after optimization is completed, then search on 'people' is taking 11-15 secs respond. But subsequent search on 'people' is responding in less than 1 sec. I want to understand why it is taking 11 secs to respond and how to reduce it to 1 sec. These are the below configurations. Could anyone suggest on what am I missing here? 1) Added query warming 2) Decreased mergeFactor to '3' 3) Increased HashDocSet maxSize as '7000' (which is 1432735 * 0.005) 4) Optimized after the data import. Data are indexed from a csv file. optimize is called immediately after date import. No of docs : 1432735 solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor3/mergeFactor ramBufferSizeMB32/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults mainIndex useCompoundFilefalse/useCompoundFile ramBufferSizeMB32/ramBufferSizeMB mergeFactor3/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength unlockOnStartupfalse/unlockOnStartup /mainIndex updateHandler class=solr.DirectUpdateHandler2 maxPendingDeletes10/maxPendingDeletes /updateHandler query maxBooleanClauses1024/maxBooleanClauses filterCache class=solr.LRUCache size=16384 initialSize=4096 autowarmCount=4096/ queryResultCache class=solr.LRUCache size=16384 initialSize=4096 autowarmCount=4096/ documentCache class=solr.LRUCache size=5000 initialSize=5000 / enableLazyFieldLoadingtrue/enableLazyFieldLoading queryResultWindowSize50/queryResultWindowSize queryResultMaxDocsCached200/queryResultMaxDocsCached HashDocSet maxSize=7000 loadFactor=0.75/ listener event=newSearcher class=solr.QuerySenderListener arr name=queries lst str name=qa/strstr name=qttypeahead/strstr name=start0/strstr name=rows100/str/lst lst str name=qb/strstr name=qttypeahead/strstr name=start0/strstr name=rows100/str/lst lst str name=qc/strstr name=qttypeahead/strstr name=start0/strstr name=rows100/str/lst lst str name=qd/strstr name=qttypeahead/strstr name=start0/strstr name=rows100/str/lst lst str name=qe/strstr name=qttypeahead/strstr name=start0/strstr name=rows100/str/lst lst str name=qf/strstr name=qttypeahead/strstr name=start0/strstr name=rows100/str/lst lst str name=qg/strstr name=qttypeahead/strstr name=start0/strstr name=rows100/str/lst lst str name=qh/strstr name=qttypeahead/strstr name=start0/strstr name=rows100/str/lst lst str name=qi/strstr name=qttypeahead/strstr name=start0/strstr name=rows100/str/lst lst str name=qj/strstr name=qttypeahead/strstr name=start0/strstr name=rows100/str/lst lst str name=qk/strstr name=qttypeahead/strstr name=start0/strstr name=rows100/str/lst lst str name=ql/strstr name=qttypeahead/strstr name=start0/strstr name=rows100/str/lst lst str name=qm/strstr name=qttypeahead/strstr name=start0/strstr name=rows100/str/lst lst str name=qn/strstr name=qttypeahead/strstr
Re: Determining core name from a result?
Oops! Sorry, I thought shard and core were one in the same and the terms could be used interchangeably - I've got a multicore setup which I'm able to search across by using the shards parameter. I think you're right, that *is* the question I was asking. Thanks for letting me know it's not supported yet. I guess the easiest thing for me to do right now is to add another field in each document saying which core it was inserted into. Thanks again Mark On 16 Dec 2010, at 3:46 pm, Chris Hostetter wrote: : Subject: Determining core name from a result? FYI: some people may be confused because of terminoligy -- i think what you are asking is how to know which *shard* a document came from when doing a distributed search. This isn't currently supported, there is an open issue tracking it... https://issues.apache.org/jira/browse/SOLR-705 -Hoss -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Case Insensitive sorting while preserving case during faceted search
Hi, I am trying to do a facet search and sort the facet values too. First I tried with 'solr.TextField' as field type. But this does not return sorted facet values. After referring to FAQ(http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F), I changed it to 'solr.StrField', it did work. But sorting was not always correct, e.g. 'ALPHA' was sorted above 'Abacus'. Then I followed the sample example schema.xml, created a copyField of type '' fieldType name=alphaOnlySort class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all/ -- /analyzer /fieldType But this gave another problem if the data contains any non-alpha characters(was replaced). Hence I removed the PatternReplaceFilterFactory from the above definition and it worked well. But the sorted facet values dont have their case preserved anymore. How can I get around this? Thank You. Regards, Shan -- View this message in context: http://lucene.472066.n3.nabble.com/Case-Insensitive-sorting-while-preserving-case-during-faceted-search-tp2099248p2099248.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: STUCK Threads at org.apache.lucene.document.CompressionTools.decompress
2010/12/16 Erick Erickson erickerick...@gmail.com What are you trying to do? It sounds like you're storing fields compressed, is that true (i.e. defining compressed=true in your field defs)? If so, why? It may be costing you more than you benefit. No compressed fields in my schema A quick test would be to stop returning anything except the score by specifying fl=score. Or at least stop returning the largest compressed fields... Make sure you've set enableLazyFieldLoading in solrconfig.xml appropriately. lazy loading set TRUE If there's no joy here, please post your field definitions and an example or two (with debugQuery=on) of offending queries. The only type of query I do in this instance: q=itemId:7288407 (obviously, id may vary) debug result: lst name=debug str name=rawquerystringitemId:7288407/str str name =querystringitemId:7288407/str str name=parsedqueryitemId:7288407 /str str name=parsedquery_toStringitemId:#8;#0;#0;Þ㙗/str lst name= explain str name=7288407 11.873255 = (MATCH) fieldWeight(itemId:#8;#0;#0;Þ㙗 in 187), product of: 1.0 = tf(termFreq(itemId:#8;#0;#0;Þ㙗)=1) 11.873255 = idf(docFreq=4, maxDocs=263733) 1.0 = fieldNorm(field=itemId, doc=187)/str /lst str name=QParserLuceneQParser/str lst name=timing double name=time 26.0/double lst name=prepare double name=time3.0/double lst name=org.apache.solr.handler.component.QueryComponent double name=time 1.0/double /lst lst name= org.apache.solr.handler.component.FacetComponent double name=time0.0 /double /lst lst name= org.apache.solr.handler.component.MoreLikeThisComponent double name=time 0.0/double /lst lst name= org.apache.solr.handler.component.HighlightComponent double name=time 0.0/double /lst lst name= org.apache.solr.handler.component.StatsComponent double name=time0.0 /double /lst lst name= org.apache.solr.handler.component.SpellCheckComponent double name=time 0.0/double /lst lst name= org.apache.solr.handler.component.DebugComponent double name=time0.0 /double /lst /lst lst name=process double name=time21.0 /double lst name=org.apache.solr.handler.component.QueryComponent double name=time0.0/double /lst lst name= org.apache.solr.handler.component.FacetComponent double name=time0.0 /double /lst lst name= org.apache.solr.handler.component.MoreLikeThisComponent double name=time 0.0/double /lst lst name= org.apache.solr.handler.component.HighlightComponent double name=time 0.0/double /lst lst name= org.apache.solr.handler.component.StatsComponent double name=time0.0 /double /lst lst name= org.apache.solr.handler.component.SpellCheckComponent double name=time 0.0/double /lst lst name= org.apache.solr.handler.component.DebugComponent double name=time21.0 /double /lst /lst Best Erick On Thu, Dec 16, 2010 at 9:31 AM, Alexander Ramos Jardim alexander.ramos.jar...@gmail.com wrote: Hello guys, I am getting threads stuck forever at * org.apache.lucene.document.CompressionTools.decompress*. I am using Weblogic 10.02, with solr deployed as ear and no work manager specifically configured for this instance. Only doing simple queries at this node (q=itemId:9 or q:skuId:9). My index has 3Giga. Now i send the thread dump of the stuck threads. Does anyone ever had this kind of problem? '[STUCK] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'' Id=19, RUNNABLE on lock=, total cpu time=187228990.ms user time=186506940.ms at java.util.zip.Inflater.inflateFast(Native Method) at java.util.zip.Inflater.inflateBytes(Inflater.java:360) at java.util.zip.Inflater.inflate(Inflater.java:218) at java.util.zip.Inflater.inflate(Inflater.java:235) at org.apache.lucene.document.CompressionTools.decompress(CompressionTools.java:108) at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:607) at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:368) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:229) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) at org.apache.lucene.index.IndexReader.document(IndexReader.java:947) at org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444) at org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:427) at org.apache.solr.util.SolrPluginUtils.optimizePreFetchDocs(SolrPluginUtils.java:267) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:269) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at
Re: Multicore Search broken
I have tryed some Thinks, now i have new news, when i search in : http://localhost:8080/solr/mail/select?q=*:*shards=localhost:8080/solr/mail,localhost:8080/solr/http://localhost:8983/solr/core0/select?q=*:*sort=myfield+descshards=localhost:8983/solr/core0,localhost:8983/solr/core1,localhost:8983/solr/core2 mail it works, so it looks that it is not a Problem with the JAVA or something like this, i have a Idea, it is Possible, that the diferences configs? pleas, when you have an idea, than told me this...
Re: Why does Solr commit block indexing?
Unfortunately, (I think?) Solr currently commits by closing the IndexWriter, which must wait for any running merges to complete, and then opening a new one. This is really rather silly because IndexWriter has had its own commit method (which does not block ongoing indexing nor merging) for quite some time now. I'm not sure why we haven't switched over already... there must be some trickiness involved. Mike On Thu, Dec 16, 2010 at 9:39 AM, Renaud Delbru renaud.del...@deri.org wrote: Hi, See log at [1]. We are using the latest snapshot of lucene_branch3.1. We have configured Solr to use the ConcurrentMergeScheduler: mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler/ When a commit() runs, it blocks indexing (all imcoming update requests are blocked until the commit operation is finished) ... at the end of the log we notice a 4 minute gap during which none of the solr cients trying to add data receive any attention. This is a bit annoying as it leads to timeout exception on the client side. Here, the commit time is only 4 minutes, but it can be larger if there are merges of large segments I thought Solr was able to handle commits and updates at the same time: the commit operation should be done in the background, and the server still continue to receive update requests (maybe at a slower rate than normal). But it looks like it is not the case. Is it a normal behaviour ? [1] http://pastebin.com/KPkusyVb Regards -- Renaud Delbru
RE: Dataimport performance
We have ~50 long-running SQL queries that need to be joined and denormalized. Not all of the queries are to the same db, and some data comes from fixed-width data feeds. Our current search engine (that we are converting to SOLR) has a fast disk-caching mechanism that lets you cache all of these data sources and then it will join them locally prior to indexing. I'm in the process of developing something similar for DIH that uses the Berkley db to do the same thing. Its good enough that I can do nightly full re-indexes of all our data while developing the front-end, but it is still very rough. Possibly I would like to get this refined enough to eventually submit as a jira ticket / patch as it seems this is a somewhat common problem that needs solving. Even with our current search engine, the join denormalize step is always the longest-running part of the process. However, I have it running fairly fast by partitioning the data by a modulus of the primary key and then running several jobs in parallel. The trick is not to get I/O bound. Things run fast if you can set it up to maximize CPU. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Ephraim Ofir [mailto:ephra...@icq.com] Sent: Thursday, December 16, 2010 3:04 AM To: solr-user@lucene.apache.org Subject: RE: Dataimport performance Check out http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com%3e This approach of not using sub entities really improved our load time. Ephraim Ofir -Original Message- From: Robert Gründler [mailto:rob...@dubture.com] Sent: Wednesday, December 15, 2010 4:49 PM To: solr-user@lucene.apache.org Subject: Re: Dataimport performance i've benchmarked the import already with 500k records, one time without the artists subquery, and one time without the join in the main query: Without subquery: 500k in 3 min 30 sec Without join and without subquery: 500k in 2 min 30. With subquery and with left join: 320k in 6 Min 30 so the joins / subqueries are definitely a bottleneck. How exactly did you implement the custom data import? In our case, we need to de-normalize the relations of the sql data for the index, so i fear i can't really get rid of the join / subquery. -robert On Dec 15, 2010, at 15:43 , Tim Heckman wrote: 2010/12/15 Robert Gründler rob...@dubture.com: The data-config.xml looks like this (only 1 entity): entity name=track query=select t.id as id, t.title as title, l.title as label from track t left join label l on (l.id = t.label_id) where t.deleted = 0 transformer=TemplateTransformer field column=title name=title_t / field column=label name=label_t / field column=id name=sf_meta_id / field column=metaclass template=Track name=sf_meta_class/ field column=metaid template=${track.id} name=sf_meta_id/ field column=uniqueid template=Track_${track.id} name=sf_unique_id/ entity name=artists query=select a.name as artist from artist a left join track_artist ta on (ta.artist_id = a.id) where ta.track_id=${track.id} field column=artist name=artists_t / /entity /entity So there's one track entity with an artist sub-entity. My (admittedly rather limited) experience has been that sub-entities, where you have to run a separate query for every row in the parent entity, really slow down data import. For my own purposes, I wrote a custom data import using SolrJ to improve the performance (from 3 hours to 10 minutes). Just as a test, how long does it take if you comment out the artists entity?
Re: PHPSolrClient
So just use add and overwrite. OK, thanks Dennis Gearon Signature Warning - - Original Message From: Tanguy Moal tanguy.m...@gmail.com To: solr-user@lucene.apache.org Sent: Thu, December 16, 2010 1:33:36 AM Subject: Re: PHPSolrClient Hi Dennis, Not particular to the client you use (solr-php-client) for sending documents, think of update as an overwrite. This means that if you update a particular document, the previous version indexed is lost. Therefore, when updating a document, make sure that all the fields to be indexed and retrieved are present in the update. For an update to occur, only the uniqueKey id (as specified in your schema.xml) has to be the same as the document you want to update. Shortly, an update is like an add, (and performed the same way) except that the added document was previously indexed. It simple gets replaced by the update. Hope that helps, -- Tanguy 2010/12/16 Dennis Gearon gear...@sbcglobal.net: First of all, it's a very nice piece of work. I am just getting my feet wet with Solr in general. So I 'am not even sure how a document is NORMALLY deleted. The library PHPDocs say 'add', 'get' 'delete', But does anyone know about 'update'? (obviously one can read-delete-modify-create) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
RE: Memory use during merges (OOM)
Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index. Could anyone explain why that is bad? I didn't really understand the conclusion below. -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, December 16, 2010 2:51 AM To: solr-user@lucene.apache.org Subject: Re: Memory use during merges (OOM) RAM usage for merging is tricky. First off, merging must hold open a SegmentReader for each segment being merged. However, it's not necessarily a full segment reader; for example, merging doesn't need the terms index nor norms. But it will load deleted docs. But, if you are doing deletions (or updateDocument, which is just a delete + add under-the-hood), then this will force the terms index of the segment readers to be loaded, thus consuming more RAM. Furthermore, if the deletions you (by Term/Query) do in fact result in deleted documents (ie they were not false deletions), then the merging allocates an int[maxDoc()] for each SegmentReader that has deletions. Finally, if you have multiple merges running at once (see CSM.setMaxMergeCount) that means RAM for each currently running merge is tied up. So I think the gist is... the RAM usage will be in proportion to the net size of the merge (mergeFactor + how big each merged segment is), how many merges you allow concurrently, and whether you do false or true deletions. If you are doing false deletions (calling .updateDocument when in fact the Term you are replacing cannot exist) it'd be best if possible to change the app to not call .updateDocument if you know the Term doesn't exist. Mike On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote: Hello all, Are there any general guidelines for determining the main factors in memory use during merges? We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. Below is a list of the changes and part of the indexwriter log. The changes increased the indexing though-put by almost an order of magnitude. (about 600 documents per hour to about 6000 documents per hour. Our documents are about 800K) We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20? Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr? Are there rules of thumb for the memory needed in terms of the number or size of segments? Our largest segments prior to the failed merge attempt were between 5GB and 30GB. The memory allocated to the Solr/tomcat JVM is 10GB. Tom Burton-West - Changes to indexing configuration: mergeScheduler before: serialMergeScheduler after: concurrentMergeScheduler mergeFactor before: 10 after : 20 ramBufferSizeMB before: 32 after: 320 excerpt from indexWriter.log Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 7.23609 to 7.98609: 20 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 0 to 20: add this merge Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 5.44878 to 6.19878: 20 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 20 to 40: add this merge ... Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted docIDs and 0 deleted queries on 40 segments. Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit exception flushing deletes Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit OutOfMemoryError inside updateDocument tom
Re: Thank you!
If I ever make it, wikipedia, stackoverflow, PHP, Symfony, Doctrine, Apache are all going to get donations. I already send $20 to wikipedia, they're huring now. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: kenf_nc ken.fos...@realestate.com To: solr-user@lucene.apache.org Sent: Thu, December 16, 2010 6:11:24 AM Subject: Re: Thank you! Hear hear! In the beginning of my journey with Solr/Lucene I couldn't have done it without this site. Smiley and Pugh's book was useful, but this forum was invaluable. I don't have as many questions now, but each new venture, Geospatial searching, replication and redundancy, performance tuning, brings me back again and again. This and stackoverflow.com have to be two of the most useful destinations on the internet for developers. Communities are so much more relevant than reference materials, and the consistent activity in this community is impressive. -- View this message in context: http://lucene.472066.n3.nabble.com/Thank-you-tp2096329p2098512.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Memory use during merges (OOM)
It's not that it's bad, it's just that Lucene must do extra work to check if these deletes are real or not, and that extra work requires loading the terms index which will consume additional RAM. For most apps, though, the terms index is relatively small and so this isn't really an issue. But if your terms index is large this can explain the added RAM usage. One workaround for large terms index is to set the terms index divisor that IndexWriter should use whenever it loads a terms index (this is IndexWriter.setReaderTermsIndexDivisor). Mike On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen rober...@buy.com wrote: Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index. Could anyone explain why that is bad? I didn't really understand the conclusion below. -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, December 16, 2010 2:51 AM To: solr-user@lucene.apache.org Subject: Re: Memory use during merges (OOM) RAM usage for merging is tricky. First off, merging must hold open a SegmentReader for each segment being merged. However, it's not necessarily a full segment reader; for example, merging doesn't need the terms index nor norms. But it will load deleted docs. But, if you are doing deletions (or updateDocument, which is just a delete + add under-the-hood), then this will force the terms index of the segment readers to be loaded, thus consuming more RAM. Furthermore, if the deletions you (by Term/Query) do in fact result in deleted documents (ie they were not false deletions), then the merging allocates an int[maxDoc()] for each SegmentReader that has deletions. Finally, if you have multiple merges running at once (see CSM.setMaxMergeCount) that means RAM for each currently running merge is tied up. So I think the gist is... the RAM usage will be in proportion to the net size of the merge (mergeFactor + how big each merged segment is), how many merges you allow concurrently, and whether you do false or true deletions. If you are doing false deletions (calling .updateDocument when in fact the Term you are replacing cannot exist) it'd be best if possible to change the app to not call .updateDocument if you know the Term doesn't exist. Mike On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote: Hello all, Are there any general guidelines for determining the main factors in memory use during merges? We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. Below is a list of the changes and part of the indexwriter log. The changes increased the indexing though-put by almost an order of magnitude. (about 600 documents per hour to about 6000 documents per hour. Our documents are about 800K) We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20? Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr? Are there rules of thumb for the memory needed in terms of the number or size of segments? Our largest segments prior to the failed merge attempt were between 5GB and 30GB. The memory allocated to the Solr/tomcat JVM is 10GB. Tom Burton-West - Changes to indexing configuration: mergeScheduler before: serialMergeScheduler after: concurrentMergeScheduler mergeFactor before: 10 after : 20 ramBufferSizeMB before: 32 after: 320 excerpt from indexWriter.log Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 7.23609 to 7.98609: 20 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 0 to 20: add this merge Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 5.44878 to 6.19878: 20 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 20 to 40: add this merge ... Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted docIDs and 0 deleted queries on 40 segments. Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]:
Re: bulk commits
what is it that you are trying to commit? a On Thu, Dec 16, 2010 at 1:03 PM, Dennis Gearon gear...@sbcglobal.netwrote: What have people found as the best way to do bulk commits either from the web or from a file on the system? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: bulk commits
This is how I import a lot of data from a cvs file. There are close to 100k records in there. Note that you can either pre-define the column names using the fieldnames param like I did here *or* include header=true which will automatically pick up the column header if your file has it. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 This seems to load everything in to some kind of temporary location before it's actually committed. If something goes wrong there is a rollback feature that will undo anything that happened before the commit. As far as batching a bunch of files, I copied and pasted the following in to Cygwin and it worked just fine. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xag.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xah.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xai.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xaj.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xak.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xal.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl
Query Problem
Hi all, I have the following problems. I have this set of data (View data (Pastebin) http://pastebin.com/jKbUhjVS ) If i do a search for: *SectionName:Programas_Home* i have no results: Returned Data (PasteBin) http://pastebin.com/wnPdHqBm If i do a search for: *Programas_Home* i have only 1 result: Result Returned (Pastebin) http://pastebin.com/fMZkLvYK if i do a search for: SectionName:Programa* i have 1 result: Result Returned (Pastebin) http://pastebin.com/kLLnVp4b This is my *schema* http://pastebin.com/PQM8uap4 (Pastebin) and this is my *solrconfig* http://%3c/?xml version=1.0 encoding=UTF-8 ?(PasteBin) I don't understand why when searching for SectionName:Programas_Home isn't returning any results at all... Can someone send some light on this? -- __ Ezequiel. Http://www.ironicnet.com
RE: Memory use during merges (OOM)
Thanks Mike, But, if you are doing deletions (or updateDocument, which is just a delete + add under-the-hood), then this will force the terms index of the segment readers to be loaded, thus consuming more RAM. Out of 700,000 docs, by the time we get to doc 600,000, there is a good chance a few documents have been updated, which would cause a delete +add. One workaround for large terms index is to set the terms index divisor .that IndexWriter should use whenever it loads a terms index (this is IndexWriter.setReaderTermsIndexDivisor). I always get confused about the two different divisors and their names in the solrconfig.xml file We are setting termInfosIndexDivisor, which I think translates to the Lucene IndexWriter.setReaderTermsIndexDivisor indexReaderFactory name=IndexReaderFactory class=org.apache.solr.core.StandardIndexReaderFactory int name=termInfosIndexDivisor8/int /indexReaderFactory The other one is termIndexInterval which is set on the writer and determines what gets written to the tii file. I don't remember how to set this in Solr. Are we setting the right one to reduce RAM usage during merging? So I think the gist is... the RAM usage will be in proportion to the net size of the merge (mergeFactor + how big each merged segment is), how many merges you allow concurrently, and whether you do false or true deletions Does an optimize do something differently? Tom
RE: Memory use during merges (OOM)
Thanks Mike! When you say 'term index of the segment readers', are you referring to the term vectors? In our case our index of 8 million docs holds pretty 'skinny' docs containing searchable product titles and keywords, with the rest of the doc only holding Ids for faceting upon. Docs typically only have unique terms per doc, with a lot of overlap of the terms across categories of docs (all similar products). I'm thinking that our unique terms are low vs the size of our index. The way we spin out deletes and adds should keep the terms loaded all the time. Seems like once in a couple weeks a propagation happens which kills the slave farm with OOMs. We are bumping the heap up a couple gigs every time this happens and hoping it goes away at this point. That is why I jumped into this discussion, sorry for butting in like that. you guys are discussing very interesting settings I had not considered before. Rob -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, December 16, 2010 10:24 AM To: solr-user@lucene.apache.org Subject: Re: Memory use during merges (OOM) It's not that it's bad, it's just that Lucene must do extra work to check if these deletes are real or not, and that extra work requires loading the terms index which will consume additional RAM. For most apps, though, the terms index is relatively small and so this isn't really an issue. But if your terms index is large this can explain the added RAM usage. One workaround for large terms index is to set the terms index divisor that IndexWriter should use whenever it loads a terms index (this is IndexWriter.setReaderTermsIndexDivisor). Mike On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen rober...@buy.com wrote: Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index. Could anyone explain why that is bad? I didn't really understand the conclusion below. -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, December 16, 2010 2:51 AM To: solr-user@lucene.apache.org Subject: Re: Memory use during merges (OOM) RAM usage for merging is tricky. First off, merging must hold open a SegmentReader for each segment being merged. However, it's not necessarily a full segment reader; for example, merging doesn't need the terms index nor norms. But it will load deleted docs. But, if you are doing deletions (or updateDocument, which is just a delete + add under-the-hood), then this will force the terms index of the segment readers to be loaded, thus consuming more RAM. Furthermore, if the deletions you (by Term/Query) do in fact result in deleted documents (ie they were not false deletions), then the merging allocates an int[maxDoc()] for each SegmentReader that has deletions. Finally, if you have multiple merges running at once (see CSM.setMaxMergeCount) that means RAM for each currently running merge is tied up. So I think the gist is... the RAM usage will be in proportion to the net size of the merge (mergeFactor + how big each merged segment is), how many merges you allow concurrently, and whether you do false or true deletions. If you are doing false deletions (calling .updateDocument when in fact the Term you are replacing cannot exist) it'd be best if possible to change the app to not call .updateDocument if you know the Term doesn't exist. Mike On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote: Hello all, Are there any general guidelines for determining the main factors in memory use during merges? We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. Below is a list of the changes and part of the indexwriter log. The changes increased the indexing though-put by almost an order of magnitude. (about 600 documents per hour to about 6000 documents per hour. Our documents are about 800K) We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20? Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr? Are there rules of thumb for the memory needed in terms of the number or size of segments? Our largest segments prior to the failed merge attempt were between 5GB and 30GB. The memory allocated to the Solr/tomcat JVM is 10GB. Tom Burton-West - Changes to indexing configuration: mergeScheduler before: serialMergeScheduler after:
Re: Memory use during merges (OOM)
On Thu, Dec 16, 2010 at 2:09 PM, Burton-West, Tom tburt...@umich.edu wrote: Thanks Mike, But, if you are doing deletions (or updateDocument, which is just a delete + add under-the-hood), then this will force the terms index of the segment readers to be loaded, thus consuming more RAM. Out of 700,000 docs, by the time we get to doc 600,000, there is a good chance a few documents have been updated, which would cause a delete +add. OK so you should do the .updateDocument not .addDocument. One workaround for large terms index is to set the terms index divisor .that IndexWriter should use whenever it loads a terms index (this is IndexWriter.setReaderTermsIndexDivisor). I always get confused about the two different divisors and their names in the solrconfig.xml file We are setting termInfosIndexDivisor, which I think translates to the Lucene IndexWriter.setReaderTermsIndexDivisor indexReaderFactory name=IndexReaderFactory class=org.apache.solr.core.StandardIndexReaderFactory int name=termInfosIndexDivisor8/int /indexReaderFactory The other one is termIndexInterval which is set on the writer and determines what gets written to the tii file. I don't remember how to set this in Solr. Are we setting the right one to reduce RAM usage during merging? It's even more confusing! There are three settings. First tells IW how frequent the index terms are (default is 128). Second tells IndexReader whether to sub-sample these on load (default is 1, meaning load all indexed terms; but if you set it to 2 then 2*128 = every 256th term is loaded). Third, IW has the same setting (subsampling) to be used whenever it internally must open a reader (eg to apply deletes). The last two are really the same setting, just that one is passed when you open IndexReader yourself, and the other is passed whenever IW needs to open a reader. But, I'm not sure how these settings are named in solrconfig.xml. So I think the gist is... the RAM usage will be in proportion to the net size of the merge (mergeFactor + how big each merged segment is), how many merges you allow concurrently, and whether you do false or true deletions Does an optimize do something differently? No, optimize is the same deal. But, because it's a big merge (especially the last one), it's the highest RAM usage of all merges. Mike
Re: Memory use during merges (OOM)
Actually terms index is something different. If you don't use CFS, go and look at the size of *.tii in your index directory -- those are the terms index. The terms index picks a subset of the terms (by default 128) to hold in RAM (plus some metadata) in order to make seeking to a specific term faster. Unfortunately they are held in a RAM intensive way, but in the upcoming 4.0 release we've greatly reduced that. Mike On Thu, Dec 16, 2010 at 2:27 PM, Robert Petersen rober...@buy.com wrote: Thanks Mike! When you say 'term index of the segment readers', are you referring to the term vectors? In our case our index of 8 million docs holds pretty 'skinny' docs containing searchable product titles and keywords, with the rest of the doc only holding Ids for faceting upon. Docs typically only have unique terms per doc, with a lot of overlap of the terms across categories of docs (all similar products). I'm thinking that our unique terms are low vs the size of our index. The way we spin out deletes and adds should keep the terms loaded all the time. Seems like once in a couple weeks a propagation happens which kills the slave farm with OOMs. We are bumping the heap up a couple gigs every time this happens and hoping it goes away at this point. That is why I jumped into this discussion, sorry for butting in like that. you guys are discussing very interesting settings I had not considered before. Rob -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, December 16, 2010 10:24 AM To: solr-user@lucene.apache.org Subject: Re: Memory use during merges (OOM) It's not that it's bad, it's just that Lucene must do extra work to check if these deletes are real or not, and that extra work requires loading the terms index which will consume additional RAM. For most apps, though, the terms index is relatively small and so this isn't really an issue. But if your terms index is large this can explain the added RAM usage. One workaround for large terms index is to set the terms index divisor that IndexWriter should use whenever it loads a terms index (this is IndexWriter.setReaderTermsIndexDivisor). Mike On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen rober...@buy.com wrote: Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index. Could anyone explain why that is bad? I didn't really understand the conclusion below. -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Thursday, December 16, 2010 2:51 AM To: solr-user@lucene.apache.org Subject: Re: Memory use during merges (OOM) RAM usage for merging is tricky. First off, merging must hold open a SegmentReader for each segment being merged. However, it's not necessarily a full segment reader; for example, merging doesn't need the terms index nor norms. But it will load deleted docs. But, if you are doing deletions (or updateDocument, which is just a delete + add under-the-hood), then this will force the terms index of the segment readers to be loaded, thus consuming more RAM. Furthermore, if the deletions you (by Term/Query) do in fact result in deleted documents (ie they were not false deletions), then the merging allocates an int[maxDoc()] for each SegmentReader that has deletions. Finally, if you have multiple merges running at once (see CSM.setMaxMergeCount) that means RAM for each currently running merge is tied up. So I think the gist is... the RAM usage will be in proportion to the net size of the merge (mergeFactor + how big each merged segment is), how many merges you allow concurrently, and whether you do false or true deletions. If you are doing false deletions (calling .updateDocument when in fact the Term you are replacing cannot exist) it'd be best if possible to change the app to not call .updateDocument if you know the Term doesn't exist. Mike On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom tburt...@umich.edu wrote: Hello all, Are there any general guidelines for determining the main factors in memory use during merges? We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. Below is a list of the changes and part of the indexwriter log. The changes increased the indexing though-put by almost an order of magnitude. (about 600 documents per hour to about 6000 documents per hour. Our documents are about 800K) We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20? Is there any
Re: Memory use during merges (OOM)
On Thu, Dec 16, 2010 at 2:09 PM, Burton-West, Tom tburt...@umich.edu wrote: I always get confused about the two different divisors and their names in the solrconfig.xml file This one (for the writer) isnt configurable by Solr. want to open an issue? We are setting termInfosIndexDivisor, which I think translates to the Lucene IndexWriter.setReaderTermsIndexDivisor indexReaderFactory name=IndexReaderFactory class=org.apache.solr.core.StandardIndexReaderFactory int name=termInfosIndexDivisor8/int /indexReaderFactory The other one is termIndexInterval which is set on the writer and determines what gets written to the tii file. I don't remember how to set this in Solr. Are we setting the right one to reduce RAM usage during merging? When you write the terms, it creates a terms dictionary, and a terms index. The termsIndexInterval (default 128) controls how many terms go into the index. For example every 128th term. The divisor just samples this at runtime... e.g. with your divisor of 8 its only reading every 8th term from the index [or every 8*128th term is read into ram, another way to see it]. Your setting isn't being applied to the reader IW uses during merging... its only for readers Solr opens from directories explicitly. I think you should open a jira issue!
Re: Dataimport performance
Hi, LuSqlv2 beta comes out in the next few weeks, and is designed to address this issue (among others). LuSql original (http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql now moved to: https://code.google.com/p/lusql/) is a JDBC--Lucene high performance loader. You may have seen my posts on this list suggesting LuSql as high performance alternative to DIH, for a subset of use cases. LuSqlV2 has evolved into a full extract-transform-load (ETL) high performance engine, focusing on many of the issues of interest to the Lucene/SOLR community. It has a pipelined, pluggable, multithreaded architecture. It is basically: pluggable source -- 0 or more pluggable filters -- pluggable sink Source plugins implemented: - JDBC, Lucene, SOLR (SolrJ), BDB, CSV, RMI, Java Serialization Sink plugins implemented: - JDBC, Lucene, SOLR (SolrJ), BDB, XML, RMI, Java Serialization, Tee, NullSink [I am working on a memcached Sink] A number of different filters implemented (i.e. get PDF file from filesystem based on SQL field and convert get test, etc) including: BDBJoinFIlter, JDBCJoinFilter -- This particular problem is one of the unit tests I have: given a simple database of: 1- table Name 2- table City 3- table nameCityJoin 4- table Job 5- table nameJobJoin run a JDBC--BDB LuSql instance each for of City+nameCityJoin and Job+nameJobJoin; then run a JDBC--SolrJ on table Name, adding 2 BDBJoinFIlters, each which take the BDB generated earlier and do the join (you just tell the filters which field from the JDBC-generated to use against the BDB key). So your use case use a larger example of this. Also of interest: - Java RMI (Remote Method Invocation): both an RMISink(Server) and RMISource(Client) are implemented. This means you can set up N machines which are doing something, and have one or more clients (on their own machines) that are pulling this data and doing something with it. For example, JDBC--PDFToTextFilter--RMI (converting PDF files to text based on the contents of a SQL database, with text files in the file system): basically doing some heavy lifting, and then start up an RMI--SolrJ (or Lucene) which is a client to the N PDF converting machines, doing only the Lucene/SOLR indexing. The client does a pull when it needs more data. You can have N servers x M clients! Oh, string fields length 1024 are automatically gzipped by the RMI Sink(Server), to reduce network (at the cost of cpu: selectable). I am looking into RMI alternatives, like Thrift, ProtoBuf for my next Source/Sinks to implement. Another example is the reverse use case: when the indexing is more expensive getting the data. Example: One JDBC--RMISink(Server) instance, N RMISource(Client)--Lucene instances; this allows multiple Lucenes to be fed from a single JDBC source, across machines. - TeeSink: the Tee sink hides N sinks, so you can split the pipeline into multiple Sinks. I've used it to send the same content to Lucene as well as BDB in one fell swoop. Can you say index and content store in one step? I am working on cleaning up the code, writing docs (I made the mistake of making great docs for LusqlV1, so I have work to do...!), and making a couple more tests. I will announce the beta on this and the Lucene list. If you have any questions, please contact me. Thanks, Glen Newton http://zzzoot.blogspot.com -- Old LuSql benchmarks: http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html On Thu, Dec 16, 2010 at 12:04 PM, Dyer, James james.d...@ingrambook.com wrote: We have ~50 long-running SQL queries that need to be joined and denormalized. Not all of the queries are to the same db, and some data comes from fixed-width data feeds. Our current search engine (that we are converting to SOLR) has a fast disk-caching mechanism that lets you cache all of these data sources and then it will join them locally prior to indexing. I'm in the process of developing something similar for DIH that uses the Berkley db to do the same thing. Its good enough that I can do nightly full re-indexes of all our data while developing the front-end, but it is still very rough. Possibly I would like to get this refined enough to eventually submit as a jira ticket / patch as it seems this is a somewhat common problem that needs solving. Even with our current search engine, the join denormalize step is always the longest-running part of the process. However, I have it running fairly fast by partitioning the data by a modulus of the primary key and then running several jobs in parallel. The trick is not to get I/O bound. Things run fast if you can set it up to maximize CPU. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Ephraim Ofir [mailto:ephra...@icq.com] Sent: Thursday, December 16, 2010 3:04 AM To: solr-user@lucene.apache.org Subject: RE: Dataimport performance Check out
Re: bulk commits
That easy, huh? Heck, this gets better and better. BTW, how about escaping? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Adam Estrada estrada.adam.gro...@gmail.com To: Dennis Gearon gear...@sbcglobal.net; solr-user@lucene.apache.org Sent: Thu, December 16, 2010 10:58:47 AM Subject: Re: bulk commits This is how I import a lot of data from a cvs file. There are close to 100k records in there. Note that you can either pre-define the column names using the fieldnames param like I did here *or* include header=true which will automatically pick up the column header if your file has it. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 This seems to load everything in to some kind of temporary location before it's actually committed. If something goes wrong there is a rollback feature that will undo anything that happened before the commit. As far as batching a bunch of files, I copied and pasted the following in to Cygwin and it worked just fine. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xag.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xah.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xai.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xaj.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl
Re: bulk commits
On Thu, Dec 16, 2010 at 3:06 PM, Dennis Gearon gear...@sbcglobal.net wrote: That easy, huh? Heck, this gets better and better. BTW, how about escaping? The CSV escaping? It's configurable to allow for loading different CSV dialects. http://wiki.apache.org/solr/UpdateCSV By default it uses double quote encapsulation, like excel would. The bottom of the wiki page shows how to configure tab separators and backslash escaping like MySQL produces by default. -Yonik http://www.lucidimagination.com Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Adam Estrada estrada.adam.gro...@gmail.com To: Dennis Gearon gear...@sbcglobal.net; solr-user@lucene.apache.org Sent: Thu, December 16, 2010 10:58:47 AM Subject: Re: bulk commits This is how I import a lot of data from a cvs file. There are close to 100k records in there. Note that you can either pre-define the column names using the fieldnames param like I did here *or* include header=true which will automatically pick up the column header if your file has it. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 This seems to load everything in to some kind of temporary location before it's actually committed. If something goes wrong there is a rollback feature that will undo anything that happened before the commit. As far as batching a bunch of files, I copied and pasted the following in to Cygwin and it worked just fine. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xag.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xah.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xai.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl
Re: Query Problem
Ezequiel: Nice job of including relevant details, by the way. Unfortunately I'm puzzled too. Your SectionName is a string type, so it should be placed in the index as-is. Be a bit cautious about looking at returned results (as I see in one of your xml files) because the returned values are the verbatim, stored field NOT what's tokenized, and the tokenized data is what's searched.. That said, you SectionName should not be tokenized at all because it's a string type. Take a look at the admin page, schema browser and see what values for SectionName look (these will be the tokenized values. They should be exactly Programas_Name, complete with underscore, case changes, etc. Is that the case? Another place that might help is the admin/analysis page. Check the debug boxes and input your steps and it'll show you what the transformations are applied. But a quick look leaves me completely baffled. Sorry I can't be more help Erick On Thu, Dec 16, 2010 at 2:07 PM, Ezequiel Calderara ezech...@gmail.comwrote: Hi all, I have the following problems. I have this set of data (View data (Pastebin) http://pastebin.com/jKbUhjVS ) If i do a search for: *SectionName:Programas_Home* i have no results: Returned Data (PasteBin) http://pastebin.com/wnPdHqBm If i do a search for: *Programas_Home* i have only 1 result: Result Returned (Pastebin) http://pastebin.com/fMZkLvYK if i do a search for: SectionName:Programa* i have 1 result: Result Returned (Pastebin) http://pastebin.com/kLLnVp4b This is my *schema* http://pastebin.com/PQM8uap4 (Pastebin) and this is my *solrconfig* http://%3c/?xml version=1.0 encoding=UTF-8 ?(PasteBin) I don't understand why when searching for SectionName:Programas_Home isn't returning any results at all... Can someone send some light on this? -- __ Ezequiel. Http://www.ironicnet.com
RE: Memory use during merges (OOM)
Your setting isn't being applied to the reader IW uses during merging... its only for readers Solr opens from directories explicitly. I think you should open a jira issue! Do I understand correctly that this setting in theory could be applied to the reader IW uses during merging but is not currently being applied? indexReaderFactory name=IndexReaderFactory class=org.apache.solr.core.StandardIndexReaderFactory int name=termInfosIndexDivisor8/int /indexReaderFactory I understand the tradeoffs for doing this during searching, but not the trade-offs for doing this during merging. Is the use during merging the similar to the use during searching? i.e. Some process has to look up data for a particular term as opposed to having to iterate through all the terms? (Haven't yet dug into the merging/indexing code). Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] We are setting termInfosIndexDivisor, which I think translates to the Lucene IndexWriter.setReaderTermsIndexDivisor
Re: Memory use during merges (OOM)
On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom tburt...@umich.edu wrote: Your setting isn't being applied to the reader IW uses during merging... its only for readers Solr opens from directories explicitly. I think you should open a jira issue! Do I understand correctly that this setting in theory could be applied to the reader IW uses during merging but is not currently being applied? yes, i'm not really sure (especially given the name=) if you can/or it was planned to have multiple IR factories in solr, e.g. a separate one for spellchecking. so i'm not sure if we should (hackishly) steal this parameter from the IR factory (it is common to all IRFactories, not just StandardIRFactory) and apply it to to IW.. but we could at least expose the divisor param separately to the IW config so you have some way of setting it. indexReaderFactory name=IndexReaderFactory class=org.apache.solr.core.StandardIndexReaderFactory int name=termInfosIndexDivisor8/int /indexReaderFactory I understand the tradeoffs for doing this during searching, but not the trade-offs for doing this during merging. Is the use during merging the similar to the use during searching? i.e. Some process has to look up data for a particular term as opposed to having to iterate through all the terms? (Haven't yet dug into the merging/indexing code). it needs it for applying deletes... as a workaround (if you are reindexing), maybe instead of using the Terms Index Divisor=8 you could set the Terms Index Interval = 1024 (8 * 128) ? this will solve your merging problem, and have the same perf characteristics of divisor=8, except you cant go back down like you can with the divisor without reindexing with a smaller interval... if you've already tested that performance with the divisor of 8 is acceptable, or in your case maybe necessary!, it sort of makes sense to 'bake it in' by setting your divisor back to 1 and your interval = 1024 instead...
Re: Memory use during merges (OOM)
On Thu, Dec 16, 2010 at 5:51 AM, Michael McCandless luc...@mikemccandless.com wrote: If you are doing false deletions (calling .updateDocument when in fact the Term you are replacing cannot exist) it'd be best if possible to change the app to not call .updateDocument if you know the Term doesn't exist. FWIW, if you're going to add a batch of documents you know aren't already in the index, you can use the overwrite=false parameter for that Solr update request. -Yonik http://www.lucidimagination.com
Re: Query Problem
I'll check the Tokenizer to see if that's the problem. The results of Analysis Page for SectionName:Programas_Home Query Analyzer org.apache.solr.schema.FieldType$DefaultAnalyzer {} term position 1 term text Programas_Home term type word source start,end 0,14 payload So it's not having problems with that... Also in the debug you can see that the parsed query is correct... So i don't know where to look... I know nothing about Stemming or tokenizing, but i will look if that has anything to do. If anyone can help me out, please do :D On Thu, Dec 16, 2010 at 5:55 PM, Erick Erickson erickerick...@gmail.comwrote: Ezequiel: Nice job of including relevant details, by the way. Unfortunately I'm puzzled too. Your SectionName is a string type, so it should be placed in the index as-is. Be a bit cautious about looking at returned results (as I see in one of your xml files) because the returned values are the verbatim, stored field NOT what's tokenized, and the tokenized data is what's searched.. That said, you SectionName should not be tokenized at all because it's a string type. Take a look at the admin page, schema browser and see what values for SectionName look (these will be the tokenized values. They should be exactly Programas_Name, complete with underscore, case changes, etc. Is that the case? Another place that might help is the admin/analysis page. Check the debug boxes and input your steps and it'll show you what the transformations are applied. But a quick look leaves me completely baffled. Sorry I can't be more help Erick On Thu, Dec 16, 2010 at 2:07 PM, Ezequiel Calderara ezech...@gmail.com wrote: Hi all, I have the following problems. I have this set of data (View data (Pastebin) http://pastebin.com/jKbUhjVS ) If i do a search for: *SectionName:Programas_Home* i have no results: Returned Data (PasteBin) http://pastebin.com/wnPdHqBm If i do a search for: *Programas_Home* i have only 1 result: Result Returned (Pastebin) http://pastebin.com/fMZkLvYK if i do a search for: SectionName:Programa* i have 1 result: Result Returned (Pastebin) http://pastebin.com/kLLnVp4b This is my *schema* http://pastebin.com/PQM8uap4 (Pastebin) and this is my *solrconfig* http://%3c/?xml version=1.0 encoding=UTF-8 ?(PasteBin) I don't understand why when searching for SectionName:Programas_Home isn't returning any results at all... Can someone send some light on this? -- __ Ezequiel. Http://www.ironicnet.com http://www.ironicnet.com/ -- __ Ezequiel. Http://www.ironicnet.com
Re: Query Problem
OK, what version of Solr are you using? I can take a quick check to see what behavior I get Erick On Thu, Dec 16, 2010 at 4:44 PM, Ezequiel Calderara ezech...@gmail.comwrote: I'll check the Tokenizer to see if that's the problem. The results of Analysis Page for SectionName:Programas_Home Query Analyzer org.apache.solr.schema.FieldType$DefaultAnalyzer {} term position 1 term text Programas_Home term type word source start,end 0,14 payload So it's not having problems with that... Also in the debug you can see that the parsed query is correct... So i don't know where to look... I know nothing about Stemming or tokenizing, but i will look if that has anything to do. If anyone can help me out, please do :D On Thu, Dec 16, 2010 at 5:55 PM, Erick Erickson erickerick...@gmail.com wrote: Ezequiel: Nice job of including relevant details, by the way. Unfortunately I'm puzzled too. Your SectionName is a string type, so it should be placed in the index as-is. Be a bit cautious about looking at returned results (as I see in one of your xml files) because the returned values are the verbatim, stored field NOT what's tokenized, and the tokenized data is what's searched.. That said, you SectionName should not be tokenized at all because it's a string type. Take a look at the admin page, schema browser and see what values for SectionName look (these will be the tokenized values. They should be exactly Programas_Name, complete with underscore, case changes, etc. Is that the case? Another place that might help is the admin/analysis page. Check the debug boxes and input your steps and it'll show you what the transformations are applied. But a quick look leaves me completely baffled. Sorry I can't be more help Erick On Thu, Dec 16, 2010 at 2:07 PM, Ezequiel Calderara ezech...@gmail.com wrote: Hi all, I have the following problems. I have this set of data (View data (Pastebin) http://pastebin.com/jKbUhjVS ) If i do a search for: *SectionName:Programas_Home* i have no results: Returned Data (PasteBin) http://pastebin.com/wnPdHqBm If i do a search for: *Programas_Home* i have only 1 result: Result Returned (Pastebin) http://pastebin.com/fMZkLvYK if i do a search for: SectionName:Programa* i have 1 result: Result Returned (Pastebin) http://pastebin.com/kLLnVp4b This is my *schema* http://pastebin.com/PQM8uap4 (Pastebin) and this is my *solrconfig* http://%3c/?xml version=1.0 encoding=UTF-8 ?(PasteBin) I don't understand why when searching for SectionName:Programas_Home isn't returning any results at all... Can someone send some light on this? -- __ Ezequiel. Http://www.ironicnet.com http://www.ironicnet.com/ -- __ Ezequiel. Http://www.ironicnet.com
Re: Query Problem
The jars are named like *1.4.1* . So i suppose its the version 1.4.1 Thanks! On Thu, Dec 16, 2010 at 6:54 PM, Erick Erickson erickerick...@gmail.comwrote: OK, what version of Solr are you using? I can take a quick check to see what behavior I get Erick On Thu, Dec 16, 2010 at 4:44 PM, Ezequiel Calderara ezech...@gmail.com wrote: I'll check the Tokenizer to see if that's the problem. The results of Analysis Page for SectionName:Programas_Home Query Analyzer org.apache.solr.schema.FieldType$DefaultAnalyzer {} term position 1 term text Programas_Home term type word source start,end 0,14 payload So it's not having problems with that... Also in the debug you can see that the parsed query is correct... So i don't know where to look... I know nothing about Stemming or tokenizing, but i will look if that has anything to do. If anyone can help me out, please do :D On Thu, Dec 16, 2010 at 5:55 PM, Erick Erickson erickerick...@gmail.com wrote: Ezequiel: Nice job of including relevant details, by the way. Unfortunately I'm puzzled too. Your SectionName is a string type, so it should be placed in the index as-is. Be a bit cautious about looking at returned results (as I see in one of your xml files) because the returned values are the verbatim, stored field NOT what's tokenized, and the tokenized data is what's searched.. That said, you SectionName should not be tokenized at all because it's a string type. Take a look at the admin page, schema browser and see what values for SectionName look (these will be the tokenized values. They should be exactly Programas_Name, complete with underscore, case changes, etc. Is that the case? Another place that might help is the admin/analysis page. Check the debug boxes and input your steps and it'll show you what the transformations are applied. But a quick look leaves me completely baffled. Sorry I can't be more help Erick On Thu, Dec 16, 2010 at 2:07 PM, Ezequiel Calderara ezech...@gmail.com wrote: Hi all, I have the following problems. I have this set of data (View data (Pastebin) http://pastebin.com/jKbUhjVS ) If i do a search for: *SectionName:Programas_Home* i have no results: Returned Data (PasteBin) http://pastebin.com/wnPdHqBm If i do a search for: *Programas_Home* i have only 1 result: Result Returned (Pastebin) http://pastebin.com/fMZkLvYK if i do a search for: SectionName:Programa* i have 1 result: Result Returned (Pastebin) http://pastebin.com/kLLnVp4b This is my *schema* http://pastebin.com/PQM8uap4 (Pastebin) and this is my *solrconfig* http://%3c/?xml version=1.0 encoding=UTF-8 ?(PasteBin) I don't understand why when searching for SectionName:Programas_Home isn't returning any results at all... Can someone send some light on this? -- __ Ezequiel. Http://www.ironicnet.com http://www.ironicnet.com/ http://www.ironicnet.com/ -- __ Ezequiel. Http://www.ironicnet.com http://www.ironicnet.com/ -- __ Ezequiel. Http://www.ironicnet.com
Re: Jquery Autocomplete Json formatting ?
Installed Firebug Now getting the following error 4139 matches.call( document.documentElement, [test!='']:sizzle ); Though my solr server is running on port8983, I am not using any server to run this jquery, its just an html file in my home folder that i am opening in my firefox browser. - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Jquery-Autocomplete-Json-formatting-tp2101346p2101593.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Jquery Autocomplete Json formatting ?
Installed Firebug Now getting the following error 4139 matches.call( document.documentElement, [test!='']:sizzle ); Though my solr server is running on port8983, I am not using any server to run this jquery, its just an html file in my home folder that i am opening in my firefox browser. - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Jquery-Autocomplete-Json-formatting-tp2101346p2101595.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceted Search Slows Down as index gets larger
I am sorry for raising up this thread after 6 months. But we have still problems with faceted search on full-text fields. We try to get most frequent words in a text field that is created in 1 hour. The faceted search takes too much time even the matching number of documents (created_at within 1 HOUR) is constant (10-20K) as the total number of documents increases (now 20M) the query gets slower. Solr throws exceptions and does not respond. We have to restart and delete old docs. (3G RAM) Index is around 2.2 GB. And we store the data in solr as well. The documents are small. $response = $solr-search('created_at:[NOW-'.$hours.'HOUR TO NOW]', 0, 1, array( 'facet' = 'true', 'facet.field'= $field, 'facet.mincount' = 1, 'facet.method' = 'enum', 'facet.enum.cache.minDf' = 100 )); Yonik had suggested distributed search. But I am not sure if we set every configuration correctly. For example the solr caches if they are related with faceted searching. We use default values: filterCache class=solr.FastLRUCache size=512 initialSize=512 autowarmCount=0/ queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ Any help is appreciated. On Sun, Jun 6, 2010 at 8:54 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Sun, Jun 6, 2010 at 1:12 PM, Furkan Kuru furkank...@gmail.com wrote: We try to provide real-time search. So the index is changing almost in every minute. We commit for every 100 documents received. The facet search is executed every 5 mins. OK, that's the problem - pretty much every facet search is rebuilding the facet cache, which takes most of the time (and facet.fc is more expensive than facet.enum in this regard). One strategy is to use distributed search... have some big cores that don't change often, and then small cores for the new stuff that changes rapidly. -Yonik http://www.lucidimagination.com -- Furkan Kuru
Re: how to config DataImport Scheduling
I also have the same problem, i configure dataimport.properties file as shown in http://wiki.apache.org/solr/DataImportHandler#dataimport.properties_example but no change occur, can any one help me What version of solr are you using? This seems a new feature. So it won't work on solr 1.4.1.
Re: Jquery Autocomplete Json formatting ?
I think this could be down to the same server rule applied to ajax requests. Your not allowed to display content from two different servers :-( the good news solr supports jsonp which is a neat trick around this try this (pasted from another thread) queryString = *:* $.getJSON( http://[server]:[port]/solr/ select/?jsoncallback=?, {q: queryString, version: 2.2, start: 0, rows: 10, indent: on, json.wrf: callbackFunctionToDoSomethingWithOurData, wt: json, fl: field1} ); and the callback function function callbackFunctionToDoSomethingWithOurData(solrData) { // do stuff with your nice data } cheers lee c On 16 December 2010 23:18, Anurag anurag.it.jo...@gmail.com wrote: Installed Firebug Now getting the following error 4139 matches.call( document.documentElement, [test!='']:sizzle ); Though my solr server is running on port8983, I am not using any server to run this jquery, its just an html file in my home folder that i am opening in my firefox browser. - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Jquery-Autocomplete-Json-formatting-tp2101346p2101595.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query Problem
OK, it works perfectly for me on a 1.4.1 instance. I've looked over your files a couple of times and see nothing obvious (but you'll never find anyone better at overlooking the obvious than me!). Tokenizing and stemming are irrelevant in this case because your type is string, which is an untokenizedtype so you don't need to go there. The way your query parses and analyzes backs this up, so you're getting to the right schema definition. Which may bring us to whether what's in the index is what you *think* is in there. I'm betting not. Either you changed the schema and didn't re-index (say changed index=false to index=true), you didn't commit the documents after indexing or other such-like, or changed the field type and didn't reindex. So go into /solr/admin. Click on schema browser, click on fields. Along the left you should see SectionName, click on that. That will show you the #indexed# terms, and you should see, exactly, Programas_Home in there, just like in your returned documents. Let us know if that's in fact what you do see. It's possible you're being mislead by the difference between seeing the value in a returned document (the stored value) and what's searched on (the indexed token(s)). And I'm assuming that some asterisks in your mails were really there for bolding and you are NOT doing wildcard searches for, for instance, *SectionName:Programas_Home*. But we're at a point where my 1.4.1 instance produces the results you're expecting, at least as I understand them so I don't think it's a problem with Solr, but some change you've made is producing results you don't expect but are correct. Like I said, look at the indexed terms. If you see Programas_Home in the admin console after following the steps above, then I don't know what to suggest Best Erick On Thu, Dec 16, 2010 at 5:12 PM, Ezequiel Calderara ezech...@gmail.comwrote: The jars are named like *1.4.1* . So i suppose its the version 1.4.1 Thanks! On Thu, Dec 16, 2010 at 6:54 PM, Erick Erickson erickerick...@gmail.com wrote: OK, what version of Solr are you using? I can take a quick check to see what behavior I get Erick On Thu, Dec 16, 2010 at 4:44 PM, Ezequiel Calderara ezech...@gmail.com wrote: I'll check the Tokenizer to see if that's the problem. The results of Analysis Page for SectionName:Programas_Home Query Analyzer org.apache.solr.schema.FieldType$DefaultAnalyzer {} term position 1 term text Programas_Home term type word source start,end 0,14 payload So it's not having problems with that... Also in the debug you can see that the parsed query is correct... So i don't know where to look... I know nothing about Stemming or tokenizing, but i will look if that has anything to do. If anyone can help me out, please do :D On Thu, Dec 16, 2010 at 5:55 PM, Erick Erickson erickerick...@gmail.com wrote: Ezequiel: Nice job of including relevant details, by the way. Unfortunately I'm puzzled too. Your SectionName is a string type, so it should be placed in the index as-is. Be a bit cautious about looking at returned results (as I see in one of your xml files) because the returned values are the verbatim, stored field NOT what's tokenized, and the tokenized data is what's searched.. That said, you SectionName should not be tokenized at all because it's a string type. Take a look at the admin page, schema browser and see what values for SectionName look (these will be the tokenized values. They should be exactly Programas_Name, complete with underscore, case changes, etc. Is that the case? Another place that might help is the admin/analysis page. Check the debug boxes and input your steps and it'll show you what the transformations are applied. But a quick look leaves me completely baffled. Sorry I can't be more help Erick On Thu, Dec 16, 2010 at 2:07 PM, Ezequiel Calderara ezech...@gmail.com wrote: Hi all, I have the following problems. I have this set of data (View data (Pastebin) http://pastebin.com/jKbUhjVS ) If i do a search for: *SectionName:Programas_Home* i have no results: Returned Data (PasteBin) http://pastebin.com/wnPdHqBm If i do a search for: *Programas_Home* i have only 1 result: Result Returned (Pastebin) http://pastebin.com/fMZkLvYK if i do a search for: SectionName:Programa* i have 1 result: Result Returned (Pastebin) http://pastebin.com/kLLnVp4b This is my *schema* http://pastebin.com/PQM8uap4 (Pastebin) and this is my *solrconfig* http://%3c/?xml version=1.0 encoding=UTF-8 ?(PasteBin) I don't understand why when searching for SectionName:Programas_Home isn't returning any results at all... Can someone send some light on this? -- __ Ezequiel.
Re: Faceted Search Slows Down as index gets larger
Another thing you can try is trunk. This specific case has been improved by an order of magnitude recenty. The case that has been sped up is initial population of the filterCache, or when the filterCache can't hold all of the unique values, or when faceting is configured to not use the filterCache much of the time via facet.enum.cache.minDf. -Yonik http://www.lucidimagination.com On Thu, Dec 16, 2010 at 6:39 PM, Furkan Kuru furkank...@gmail.com wrote: I am sorry for raising up this thread after 6 months. But we have still problems with faceted search on full-text fields. We try to get most frequent words in a text field that is created in 1 hour. The faceted search takes too much time even the matching number of documents (created_at within 1 HOUR) is constant (10-20K) as the total number of documents increases (now 20M) the query gets slower. Solr throws exceptions and does not respond. We have to restart and delete old docs. (3G RAM) Index is around 2.2 GB. And we store the data in solr as well. The documents are small. $response = $solr-search('created_at:[NOW-'.$hours.'HOUR TO NOW]', 0, 1, array( 'facet' = 'true', 'facet.field'= $field, 'facet.mincount' = 1, 'facet.method' = 'enum', 'facet.enum.cache.minDf' = 100 )); Yonik had suggested distributed search. But I am not sure if we set every configuration correctly. For example the solr caches if they are related with faceted searching. We use default values: filterCache class=solr.FastLRUCache size=512 initialSize=512 autowarmCount=0/ queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ Any help is appreciated. On Sun, Jun 6, 2010 at 8:54 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Sun, Jun 6, 2010 at 1:12 PM, Furkan Kuru furkank...@gmail.com wrote: We try to provide real-time search. So the index is changing almost in every minute. We commit for every 100 documents received. The facet search is executed every 5 mins. OK, that's the problem - pretty much every facet search is rebuilding the facet cache, which takes most of the time (and facet.fc is more expensive than facet.enum in this regard). One strategy is to use distributed search... have some big cores that don't change often, and then small cores for the new stuff that changes rapidly. -Yonik http://www.lucidimagination.com -- Furkan Kuru
Re: bulk commits
One very important thing I forgot to mention is that you will have to increase the JAVA heap size for larger data sets. Set JAVA_OPT to something acceptable. Adam On Thu, Dec 16, 2010 at 3:27 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Dec 16, 2010 at 3:06 PM, Dennis Gearon gear...@sbcglobal.net wrote: That easy, huh? Heck, this gets better and better. BTW, how about escaping? The CSV escaping? It's configurable to allow for loading different CSV dialects. http://wiki.apache.org/solr/UpdateCSV By default it uses double quote encapsulation, like excel would. The bottom of the wiki page shows how to configure tab separators and backslash escaping like MySQL produces by default. -Yonik http://www.lucidimagination.com Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Adam Estrada estrada.adam.gro...@gmail.com To: Dennis Gearon gear...@sbcglobal.net; solr-user@lucene.apache.org Sent: Thu, December 16, 2010 10:58:47 AM Subject: Re: bulk commits This is how I import a lot of data from a cvs file. There are close to 100k records in there. Note that you can either pre-define the column names using the fieldnames param like I did here *or* include header=true which will automatically pick up the column header if your file has it. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 This seems to load everything in to some kind of temporary location before it's actually committed. If something goes wrong there is a rollback feature that will undo anything that happened before the commit. As far as batching a bunch of files, I copied and pasted the following in to Cygwin and it worked just fine. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xag.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xah.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl
Re: facet.pivot for date fields
i guess one last call for help .. i am assuming for people who wrote or have used the pivot faceting .. this should be a yes no question .. are date fields supported ? On Wed, Dec 15, 2010 at 12:58 PM, Adeel Qureshi adeelmahm...@gmail.comwrote: Thanks Pankaj - that was useful to know. I havent used the query stuff before for facets .. so that was good to know .. but the problem is still there because I want the hierarchical counts which is exactly what facet.pivot does .. so e.g. i want to count for fieldC within fieldB and even fieldB within fieldA .. that kind of stuff .. for string based fields .. facet.pivot does exactly that and does it very well .. but it doesnt seems to work for date ranges .. so in this case I want counts to be broken down by fieldA and fieldB and then fieldB counts for monthly ranges .. I understand that I might be able to use facet.query to construct several queries to get these counts .. e.g. *facet.query=fieldA:someValue AND fieldB:someValue AND fieldC:[NOW-1YEAR TO NOW]* .. but there could be thousand of possible combinations for fieldA and fieldB which will require as many facet.queries which I am assuming is not the way to go .. it might be confusing what I have explained above so the simple question still is if there is a way to get date range counts included in facet.pivot Adeel On Tue, Dec 14, 2010 at 10:53 PM, pankaj bhatt panbh...@gmail.com wrote: Hi Adeel, You can make use of facet.query attribute to make the Faceting work across a range of dates. Here i am using the duration, just replace the field with a field date and Range values as the DATE in SOLR Format. so your query parameter will be like this ( you can pass multiple parameter of facet.query name) http//blasdsdfsd/q?=asdfasdfacet.query=itemduration:[0 To 49]facet.query=itemduration:[50 To 99]facet.query=itemduration:[100 To 149] Hope, it helps. / Pankaj Bhatt. On Wed, Dec 15, 2010 at 2:01 AM, Adeel Qureshi adeelmahm...@gmail.com wrote: It doesnt seems like pivot facetting works on dates .. I was just curious if thats how its supposed to be or I am doing something wrong .. if I include a datefield in the pivot list .. i simply dont get any facet results back for that datefield Thanks Adeel
Re: bulk commits
Thanks Adam! Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Thu, 12/16/10, Adam Estrada estrada.a...@gmail.com wrote: From: Adam Estrada estrada.a...@gmail.com Subject: Re: bulk commits To: solr-user@lucene.apache.org Date: Thursday, December 16, 2010, 6:18 PM One very important thing I forgot to mention is that you will have to increase the JAVA heap size for larger data sets. Set JAVA_OPT to something acceptable. Adam On Thu, Dec 16, 2010 at 3:27 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Thu, Dec 16, 2010 at 3:06 PM, Dennis Gearon gear...@sbcglobal.net wrote: That easy, huh? Heck, this gets better and better. BTW, how about escaping? The CSV escaping? It's configurable to allow for loading different CSV dialects. http://wiki.apache.org/solr/UpdateCSV By default it uses double quote encapsulation, like excel would. The bottom of the wiki page shows how to configure tab separators and backslash escaping like MySQL produces by default. -Yonik http://www.lucidimagination.com Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Adam Estrada estrada.adam.gro...@gmail.com To: Dennis Gearon gear...@sbcglobal.net; solr-user@lucene.apache.org Sent: Thu, December 16, 2010 10:58:47 AM Subject: Re: bulk commits This is how I import a lot of data from a cvs file. There are close to 100k records in there. Note that you can either pre-define the column names using the fieldnames param like I did here *or* include header=true which will automatically pick up the column header if your file has it. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 This seems to load everything in to some kind of temporary location before it's actually committed. If something goes wrong there is a rollback feature that will undo anything that happened before the commit. As far as batching a bunch of files, I copied and pasted the following in to Cygwin and it worked just fine. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl
Got error when range query and highlight
Hello all, I got an error as follows when I do a range query search ([1 TO *]) on an numeric field and highlight is set on another text field. 2010/12/15 10:58:55 org.apache.solr.common.SolrException log Fatal: org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024 at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:153) at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:144) at org.apache.lucene.search.MultiTermQuery$ScoringBooleanQueryRewrite.rewrite(MultiTermQuery.java:110) at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:382) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:178) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:111) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:111) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:414) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at jp.co.spectrum.insight.hooserver.core.solrext.dispatcher.HooDispatchFilter.doFilter(Unknown Source) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216) at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:184) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:226) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at jp.co.spectrum.insight.hooserver.core.solrext.dispatcher.HooDispatchFilter.execute(Unknown Source) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Unknown Source) Could anyone give me any suggest? Ouyang https://sites.google.com/a/spectrum.co.jp/openinsight/
Re: Got error when range query and highlight
I got an error as follows when I do a range query search ([1 TO *]) on an numeric field and highlight is set on another text field. Are you using hl.highlightMultiTerm=true? Pasting your search URL can give more hints. Adding hl.requireFieldMatch=true should probably solve your problem.
Re: Got error when range query and highlight
Thank you for reply. Are you using hl.highlightMultiTerm=true? Pasting your search URL can give more hints. Yes, I used the hl.highlightMultiTerm=true , my search query is as follows : start=0rows=10facet.mincount=1facet.field=authornavfacet.field=contentsearchkeywordnavfacet.field=contentstypenavfacet.field=copyrightnavfacet.field=docdatetimenavfacet.field=downloadpathnavfacet.field=filenamenavfacet.field=folderchecksumnavfacet.field=folderpathnavfacet.field=groupidnavfacet.field=kcmeta%2Fbookmark%2Fcountnavfacet.field=kcmeta%2Fcomment%2Fcountnavfacet.field=kcmeta%2Fenterprisetag%2Fcountnavfacet.field=kcmeta%2Fenterprisetag%2Fvaluenavfacet.field=kcmeta%2Fusertag%2Fcountnavfacet.field=kcmeta%2Fusertag%2Fvaluenavfacet.field=kcmeta%2Fview%2Fcountnavfacet.field=kcmeta%2Fvote%2Fcountnavfacet.field=lastmodifiernavfacet.field=mimetypenavfacet.field=orgidnavfacet.field=originalcontentstypenavfacet.field=processingtimenavfacet.field=roleidnavfacet.field=sizenavfacet.field=sourcenavfacet.field=titlenavfacet.field=useridnavfacet=truehl=truehl.fl=bodyhl.fl=titlehl.simple.pre=%3Cspan+class%3D%22highlight-solr%22%3Ehl.simple.post=%3C%2Fspan%3Eq=%28%28kcmeta%2Fview%2Fcount%3A%5B+1+TO+*+%5D+AND+contentstype%3AA9000B0001*%29+AND+%28issecure%3A%220%22+OR+userid%3A%22c6305dc4%5C-cbba%5C-bf48%5C-97d5%5C-dcfe6f2430ef%22%29%29facet.sort=counthl.highlightMultiTerm=true Adding hl.requireFieldMatch=true should probably solve your problem. Yes, adding hl.requireFieldMatch=true can solve my problem, but in my solution , I have a content field indexing all fields' contents to support full text search, but I also have another 2 fields title and body which support highlight, when I do search on content, I expect the title and body can be high-lighted. So using the hl.requireFieldMatch=true may be not work. Ouyang https://sites.google.com/a/spectrum.co.jp/openinsight/
Re: Got error when range query and highlight
Adding hl.requireFieldMatch=true should probably solve your problem. Yes, adding hl.requireFieldMatch=true can solve my problem, but in my solution , I have a content field indexing all fields' contents to support full text search, but I also have another 2 fields title and body which support highlight, when I do search on content, I expect the title and body can be high-lighted. So using the hl.requireFieldMatch=true may be not work. So you can increase the number of max boolean clauses in solrconfig.xml. Default is 1024. Or you can use hl.highlightMultiTerm=false. By the way I couldn't see full-text query in your URL. It is better to filter out your non full-text queries into filter queries. This can solve your problem. For example fq=cmeta/view/count:[ 1 TO * ]fq=contentstype:A9000B0001*fq=(issecure:0 OR userid:c6305dc4\-cbba\-bf48\-97d5\-dcfe6f2430ef) You can define multiple fq's. And you can benefit caching. http://wiki.apache.org/solr/CommonQueryParameters#fq http://wiki.apache.org/solr/FilterQueryGuidance
Solr (and mabye Java?) version numbering systems
I've inferred from a bunch of posts that Solr 1.4 is actually the upcoming 4.x release? And the numbering systems on other Java products don't seem to match what's really out there,i.e Eclipse and Sun Java. So what IS the Solr versioning number system? Can anyone give a (maybe possible) chronological list? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
A schema inside a Solr Schema (Schema in a can)
Is it possible to put name value pairs of any type in a native Solr Index field type? Like JSON/XML/YML? The reason that I ask, since you asked, is I want my main index schema to be a base object, and another multivalue column to be the attributes of base object inherited descendants. Is there any other way to do this? What are the limitations in searching and indexing documents with multivalue fields? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Best practice for Delta every 2 Minutes.
I think it will not because default configuration can only have 2 newSearcher threads but the delay will be more and more long. The newer newSearcher will wait these 2 ealier one to finish. 2010/12/1 Jonathan Rochkind rochk...@jhu.edu: If your index warmings take longer than two minutes, but you're doing a commit every two minutes -- you're going to run into trouble with overlapping index preperations, eventually leading to an OOM. Could this be it? On 11/30/2010 11:36 AM, Erick Erickson wrote: I don't know, you'll have to debug it to see if it's the thing that takes so long. Solr should be able to handle 1,200 updates in a very short time unless there's something else going on, like you're committing after every update or something. This may help you track down performance with DIH http://wiki.apache.org/solr/DataImportHandler#interactive http://wiki.apache.org/solr/DataImportHandler#interactiveBest Erick On Tue, Nov 30, 2010 at 9:01 AM, stockiist...@shopgate.com wrote: how do you think is the deltaQuery better ? XD -- View this message in context: http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992774.html Sent from the Solr - User mailing list archive at Nabble.com.
Testing Solr
Hi All, I built solr successfully and i am thinking to test it with nearly 300 pdf files, 300 docs, 300 excel files,...and so on of each type with 300 files nearly Is there any dummy data available to test for solr,Otherwise i need to download each and every file individually..?? Another question is there any Benchmarks of solr...?? Regards, satya
Re: Best practice for Delta every 2 Minutes.
we now meet the same situation and want to implement like this: we add new documents to a RAMDirectory and search two indice-- the index in disk and the RAM index. regularly(e.g. every hour we flush the RAMDirecotry into disk and make a new segment) to prevent error. before add to RAMDirecotry,we write the document into log file. and after flushing, we delete corresponding lines in the log file if the program corrput. we will redo the log and add them into RAMDirectory. Any one has done similar work? 2010/12/1 Li Li fancye...@gmail.com: you may implement your own MergePolicy to keep on large index and merge all other small ones or simply set merge factor to 2 and the largest index not be merged by set maxMergeDocs less than the docs in the largest one. So there is one large index and a small one. when adding a little docs, they will be merged into the small one. and you can, e.g. weekly optimize the index and merge all indice into one index. 2010/11/30 stockii st...@shopgate.com: Hello. index is about 28 Million documents large. When i starts an delta-import is look at modified. but delta import takes to long. over an hour need solr for delta. thats my query. all sessions from the last hour should updated and all changed. i think its normal that solr need long time for the querys. how can i optimize this ? deltaQuery=SELECT id FROM sessions WHERE created BETWEEN DATE_ADD( NOW(), INTERVAL - 10 HOUR ) AND NOW() OR modified BETWEEN '${dataimporter.last_index_time}' AND DATE_ADD( NOW(), INTERVAL - 1 HOUR ) -- View this message in context: http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992714.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Best practice for Delta every 2 Minutes.
BTW, what is a Delta (in this context, not an equipment line or a rocket, please :-) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Thu, 12/16/10, Li Li fancye...@gmail.com wrote: From: Li Li fancye...@gmail.com Subject: Re: Best practice for Delta every 2 Minutes. To: solr-user@lucene.apache.org Date: Thursday, December 16, 2010, 10:54 PM I think it will not because default configuration can only have 2 newSearcher threads but the delay will be more and more long. The newer newSearcher will wait these 2 ealier one to finish. 2010/12/1 Jonathan Rochkind rochk...@jhu.edu: If your index warmings take longer than two minutes, but you're doing a commit every two minutes -- you're going to run into trouble with overlapping index preperations, eventually leading to an OOM. Could this be it? On 11/30/2010 11:36 AM, Erick Erickson wrote: I don't know, you'll have to debug it to see if it's the thing that takes so long. Solr should be able to handle 1,200 updates in a very short time unless there's something else going on, like you're committing after every update or something. This may help you track down performance with DIH http://wiki.apache.org/solr/DataImportHandler#interactive http://wiki.apache.org/solr/DataImportHandler#interactiveBest Erick On Tue, Nov 30, 2010 at 9:01 AM, stockiist...@shopgate.com wrote: how do you think is the deltaQuery better ? XD -- View this message in context: http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992774.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Testing Solr
There are websites with data sets out there. 'Data sets' may not be the right search terms, but it's something like that. Exactly what you want, I couldn't guess otherwise? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Thu, 12/16/10, satya swaroop satya.yada...@gmail.com wrote: From: satya swaroop satya.yada...@gmail.com Subject: Testing Solr To: solr-user@lucene.apache.org Date: Thursday, December 16, 2010, 10:55 PM Hi All, I built solr successfully and i am thinking to test it with nearly 300 pdf files, 300 docs, 300 excel files,...and so on of each type with 300 files nearly Is there any dummy data available to test for solr,Otherwise i need to download each and every file individually..?? Another question is there any Benchmarks of solr...?? Regards, satya