RE: Solr replication
Hi Bill, I have some questions regarding the SOLR collection distribution. !) Is it possilbe to add the index operations on the the slave server using SOLR collection distribution and still the master server is updated with these changes? 2)I have a requirement of having more than one solr instance (the corresponding data directory for each solr core). Is it possible to maintain different solr cores and still achieve SOLR collection distribution for all of these cores independently. If yes, then how ? Regards, Dilip -Original Message- From: Bill Au [mailto:[EMAIL PROTECTED] Sent: Monday, January 14, 2008 9:40 PM To: [EMAIL PROTECTED] Subject: Re: Solr replication Yes, you need the same changes in scripts.conf on the slave server but you don't need the post commit hook enabled on the slave server. The post commit hook is used to create snapshots. You will see a new snapshot in the data directory every time you do a commit on the master server. There is no need to create snapshots on the slave server as the slave server copies the snapshots from the master server. The scripts are designed to run under Unix/Linux. It uses symbolic link and Unix/Linux commands like scp, ssh, rsync, cp. I don't know much about Windows so I don't know for sure if all the Unix/Linux stuff used by the sccripts are available in Windows or not. Bill On 1/14/08, Dilip.TS [EMAIL PROTECTED] wrote: Hi Bill, I m trying to use the solr collection distribution. and done the following changes: 1)Changes done in Master server on linux #In scripts.conf file user= solr_hostname=localhost solr_port=8983 rsyncd_port=18983 data_dir=/usr/solr/data/data_tenantID_1 webapp_name=solr master_host=192.168.168.50 master_data_dir=/usr/solr/data/data_tenantID_1 master_status_dir=/usr/solr/logs 2)Enable the postcommit in solrconfig.xml !-- A postCommit event is fired after every commit or optimize ommand -- listener event=postCommit class=solr.RunExecutableListener str name=exe/usr/solr/bin/snapshooter/str str name=dir/usr/solr/bin/str bool name=waittrue/bool !--arr name=argsstr-u jetty-6.1.6/str str-d /opt/solr/data/str/arr-- arr name=env /arr /listener i run the Embedded solr folder and added a document to it.. and did a search for a word on the same server. I found the following observations in the console: INFO: query parser default operator is OR Jan 14, 2008 3:37:38 PM org.apache.solr.schema.IndexSchema readSchema INFO: unique key field: id Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore init INFO: Opening new SolrCore at //usr//solr/, dataDir=//usr//solr//data//data_tenantID_1 Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore parseListener INFO: Searching for listeners: //[EMAIL PROTECTED]firstSearcher] Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore parseListener INFO: Searching for listeners: //[EMAIL PROTECTED]newSearcher] Jan 14, 2008 3:37:39 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created xslt: org.apache.solr.request.XSLTResponseWriter Jan 14, 2008 3:37:39 PM org.apache.solr.request.XSLTResponseWriter init INFO: xsltCacheLifetimeSeconds=5 Jan 14, 2008 3:37:39 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created standard: org.apache.solr.handler.StandardRequestHandler . . . . INFO: Opening [EMAIL PROTECTED] main Jan 14, 2008 3:37:39 PM org.apache.solr.core.SolrCore registerSearcher INFO: Registered new searcher [EMAIL PROTECTED] main Jan 14, 2008 3:37:39 PM org.apache.solr.update.UpdateHandler parseEventListeners INFO: added SolrEventListener for postCommit: org.apache.solr.core.RunExecutableListener{exe=/usr/solr/bin/snapshooter,dir =/usr/solr/bin,wait=true,env=[]} Jan 14, 2008 3:37:39 PM org.apache.solr.update.DirectUpdateHandler2$CommitTracker init INFO: AutoCommit: disabled In the above console i find postCommit: org.apache.solr.core.RunExecutableListener{exe=/usr/solr/bin/snapshooter,dir =/usr/solr/bin,wait=true,env=[]} command being called after doing a commit. This is a scenario for the add/search done on the same master server on Linux. 1)I would like to know do we require similar entries for the scrips.conf and the postcommit enabled in the solrconfig.xml for the slave server too. If yes, are these entries for the slave server should be identical to that of master or it is different? 2)Also can we have the Linux machine acting as a master server and the slave can be made to run on windows machine? Thanks in advance. Regards Dilip -Original Message- From: Bill Au [mailto:[EMAIL PROTECTED] ] Sent: Saturday, December 15, 2007 1:08 AM To: solr-user@lucene.apache.org; [EMAIL PROTECTED]
Problem with dismax handler when searching Solr along with field
when i search the query for example http://localhost:8983/solr/select/?q=categoryqt=dismax it gives the results but when i want to search on the basis of field name like http://localhost:8983/solr/select/?q=maincategory:Carsqt=dismax it does not gives results however http://localhost:8983/solr/select/?q=maincategory:Cars return results of cars from field name maincategory -- View this message in context: http://www.nabble.com/Problem-with-dismax-handler-when-searching-Solr-along-with-field-tp14878239p14878239.html Sent from the Solr - User mailing list archive at Nabble.com.
Indexing two sets of details
Hi, In the web application we are developing we have two sets of details. The personal details and the resume details. We allow 5 different resumes to be available for each user. But we want the personal details to remain same for each 5 resumes. The problem is when personal details are changed we will have to update all 5 resumes. I was thinking if we index the personal details fields separately we only have to change/update those fields. But the problem is when searching for users using fields from both personal details and resume. Then I have to manually combine both searches and what if one search gives more results than the other. Would really appreciate it if anyone has a suggestion on how I should tackle this problem Thanks, -- Gavin Selvaratnam, Project Leader hSenid Mobile Solutions Phone: +94-11-2446623/4 Fax: +94-11-2307579 Web: http://www.hSenidMobile.com Make it happen Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to which they are addressed. The content and opinions contained in this email are not necessarily those of hSenid Software International. If you have received this email in error please contact the sender.
Re: Solr in a distributed multi-machine high-performance environment
Look at http://issues.apache.org/jira/browse/SOLR-303 Please note that it is still work in progress. So you may not be able to use it immeadiately. On Jan 16, 2008 10:53 AM, Srikant Jakilinki [EMAIL PROTECTED] wrote: Hi All, There is a requirement in our group of indexing and searching several millions of documents (TREC) in real-time and millisecond responses. For the moment we are preferring scale-out (throw more commodity machines) approaches rather than scale-up (faster disks, more RAM). This is in-turn inspired by the Scale-out vs. Scale-up paper (mail me if you want a copy) in which it was proven that this kind of distribution scales better and is more resilient. So, are there any resources available (Wiki, Tutorials, Slides, README etc.) that throw light and guide newbies on how to run Solr in a multi-machine scenario? I have gone through the mailing lists and site but could not really find any answers or hands-on stuff to do so. An adhoc guideline to get things working with 2 machines might just be enough but for the sake of thinking out loud and solicit responses from the list, here are my questions: 1) Solr that has to handle a fairly large index which has to be split up on multiple disks (using Multicore?) - Space is not a problem since we can use NFS but that is not recommended as we would only exploit 1 processor 2) Solr that has to handle a large collective index which has to be split up on multi-machines - The index is ever increasing (TB scale) and dynamic and all of it has to be searched at any point 3) Solr that has to exploit multi-machines because we have plenty of them in a tightly coupled P2P scenario - Machines are not a problem but will they be if they are of varied configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE 1.1 to 1.6) 4) Solr that has to distribute load on several machines - The index(s) could be common though like say using a distributed filesystem (Hadoop?) In each the above cases (we might use all of these strategies at various use cases) the application should use Solr as a strict backend and named service (IP or host:port) so that we can expose this application (and the service) to the web or intranet. Machine failures should be tolerated too. Also, does Solr manage load balancing out of the box if it was indeed configured to work with multi-machines? Maybe it is superfluous but is Solr and/or Nutch the only way to use Lucene in a multi-machine environment? Or is there some hidden document/project somewhere that makes it possible by exposing a regular Lucene process over the network using RMI or something? It is my understanding (could be wrong) that Nutch and to some extent, Solr do not perform well when there is a lot of indexing activity in parallel to search. Batch processing is also there and perhaps we can use Nutch/Solr there. Even so, we need multi-machine directions. I am sure that multi-machines make possible for a lot of other ways which might solve the goal better and that others have practical experience on. So, any advise and tips are also very welcome. We intend to document things and do some benchmarking along the way in the open spirit. Really sorry for the length but I hope some answers are forthcoming. Cheers, Srikant -- Regards, Shalin Shekhar Mangar.
Re: Solr replication
my answers inilne... On Jan 16, 2008 3:51 AM, Dilip.TS [EMAIL PROTECTED] wrote: Hi Bill, I have some questions regarding the SOLR collection distribution. !) Is it possilbe to add the index operations on the the slave server using SOLR collection distribution and still the master server is updated with these changes? No. The replication process is only one way, from the master to the slave. The idea behind it is that the slave servers would be for query only and the number of slaves can be increased or decreased according to traffic load. 2)I have a requirement of having more than one solr instance (the corresponding data directory for each solr core). Is it possible to maintain different solr cores and still achieve SOLR collection distribution for all of these cores independently. If yes, then how ? Does each solr instance has its own solr home? If so you can use replication within each instance by simply adjusting the parameters in scripts.conf for each instance. Even if they all share a single solr home, the replication related scripts all have command line option to override values set in scripts.conf: http://wiki.apache.org/solr/SolrCollectionDistributionScripts So you can invoke the scripts for each instance by setting the data directory on the command line. Regards, Dilip -Original Message- From: Bill Au [mailto:[EMAIL PROTECTED] Sent: Monday, January 14, 2008 9:40 PM To: [EMAIL PROTECTED] Subject: Re: Solr replication Yes, you need the same changes in scripts.conf on the slave server but you don't need the post commit hook enabled on the slave server. The post commit hook is used to create snapshots. You will see a new snapshot in the data directory every time you do a commit on the master server. There is no need to create snapshots on the slave server as the slave server copies the snapshots from the master server. The scripts are designed to run under Unix/Linux. It uses symbolic link and Unix/Linux commands like scp, ssh, rsync, cp. I don't know much about Windows so I don't know for sure if all the Unix/Linux stuff used by the sccripts are available in Windows or not. Bill On 1/14/08, Dilip.TS [EMAIL PROTECTED] wrote: Hi Bill, I m trying to use the solr collection distribution. and done the following changes: 1)Changes done in Master server on linux #In scripts.conf file user= solr_hostname=localhost solr_port=8983 rsyncd_port=18983 data_dir=/usr/solr/data/data_tenantID_1 webapp_name=solr master_host=192.168.168.50 master_data_dir=/usr/solr/data/data_tenantID_1 master_status_dir=/usr/solr/logs 2)Enable the postcommit in solrconfig.xml !-- A postCommit event is fired after every commit or optimize ommand -- listener event=postCommit class=solr.RunExecutableListener str name=exe/usr/solr/bin/snapshooter/str str name=dir/usr/solr/bin/str bool name=waittrue/bool !--arr name=argsstr-u jetty-6.1.6/str str-d /opt/solr/data/str/arr-- arr name=env /arr /listener i run the Embedded solr folder and added a document to it.. and did a search for a word on the same server. I found the following observations in the console: INFO: query parser default operator is OR Jan 14, 2008 3:37:38 PM org.apache.solr.schema.IndexSchema readSchema INFO: unique key field: id Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore init INFO: Opening new SolrCore at //usr//solr/, dataDir=//usr//solr//data//data_tenantID_1 Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore parseListener INFO: Searching for listeners: //[EMAIL PROTECTED]firstSearcher] Jan 14, 2008 3:37:38 PM org.apache.solr.core.SolrCore parseListener INFO: Searching for listeners: //[EMAIL PROTECTED]newSearcher] Jan 14, 2008 3:37:39 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created xslt: org.apache.solr.request.XSLTResponseWriter Jan 14, 2008 3:37:39 PM org.apache.solr.request.XSLTResponseWriter init INFO: xsltCacheLifetimeSeconds=5 Jan 14, 2008 3:37:39 PM org.apache.solr.util.plugin.AbstractPluginLoader load INFO: created standard: org.apache.solr.handler.StandardRequestHandler . . . . INFO: Opening [EMAIL PROTECTED] main Jan 14, 2008 3:37:39 PM org.apache.solr.core.SolrCore registerSearcher INFO: Registered new searcher [EMAIL PROTECTED] main Jan 14, 2008 3:37:39 PM org.apache.solr.update.UpdateHandler parseEventListeners INFO: added SolrEventListener for postCommit: org.apache.solr.core.RunExecutableListener{exe=/usr/solr/bin/snapshooter ,dir =/usr/solr/bin,wait=true,env=[]} Jan 14, 2008 3:37:39 PM org.apache.solr.update.DirectUpdateHandler2$CommitTracker init INFO: AutoCommit: disabled In the above console i find postCommit:
Re: Indexing very large files.
All, I just found a thread about this on the mailing list archives because I'm troubleshooting the same problem. The kicker is that it doesn't take such large files to kill the StringBuilder. I have discovered the following: By using a text file made up of 3,443,464 bytes or less, I get no error. AT 3,443,465 bytes: Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.lang.String.init(String.java:208) at java.lang.StringBuilder.toString(StringBuilder.java:431) at org.junit.Assert.format(Assert.java:321) at org.junit.ComparisonFailure$ComparisonCompactor.compact( ComparisonFailure.java:80) at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java:37) at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) at java.lang.Throwable.toString(Throwable.java:344) at java.lang.String.valueOf(String.java:2615) at java.io.PrintWriter.print(PrintWriter.java:546) at java.io.PrintWriter.println(PrintWriter.java:683) at java.lang.Throwable.printStackTrace(Throwable.java:510) at org.apache.tools.ant.util.StringUtils.getStackTrace( StringUtils.java:96) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace (JUnitTestRunner.java:856) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError (XMLJUnitResultFormatter.java:280) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError (XMLJUnitResultFormatter.java:255) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError( JUnitTestRunner.java:988) at junit.framework.TestResult.addError(TestResult.java:38) at junit.framework.JUnit4TestAdapterCache$1.testFailure( JUnit4TestAdapterCache.java:51) at org.junit.runner.notification.RunNotifier$4.notifyListener( RunNotifier.java:96) at org.junit.runner.notification.RunNotifier$SafeNotifier.run( RunNotifier.java:37) at org.junit.runner.notification.RunNotifier.fireTestFailure( RunNotifier.java:93) at org.junit.internal.runners.TestMethodRunner.addFailure( TestMethodRunner.java:104) at org.junit.internal.runners.TestMethodRunner.runUnprotected( TestMethodRunner.java:87) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestMethodRunner.runMethod( TestMethodRunner.java:75) at org.junit.internal.runners.TestMethodRunner.run( TestMethodRunner.java:45) at org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod( TestClassMethodsRunner.java:71) at org.junit.internal.runners.TestClassMethodsRunner.run( TestClassMethodsRunner.java:35) at org.junit.internal.runners.TestClassRunner$1.runUnprotected( TestClassRunner.java:42) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestClassRunner.run( TestClassRunner.java:52) at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:32) AT 3,443,466 byes (or more) : Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.expandCapacity( AbstractStringBuilder.java:99) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java :393) at java.lang.StringBuilder.append(StringBuilder.java:120) at org.junit.Assert.format(Assert.java:321) at org.junit.ComparisonFailure$ComparisonCompactor.compact( ComparisonFailure.java:80) at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java:37) at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) at java.lang.Throwable.toString(Throwable.java:344) at java.lang.String.valueOf(String.java:2615) at java.io.PrintWriter.print(PrintWriter.java:546) at java.io.PrintWriter.println(PrintWriter.java:683) at java.lang.Throwable.printStackTrace(Throwable.java:510) at org.apache.tools.ant.util.StringUtils.getStackTrace( StringUtils.java:96) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace (JUnitTestRunner.java:856) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError (XMLJUnitResultFormatter.java:280) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError (XMLJUnitResultFormatter.java:255) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError( JUnitTestRunner.java:988) at junit.framework.TestResult.addError(TestResult.java:38) at junit.framework.JUnit4TestAdapterCache$1.testFailure( JUnit4TestAdapterCache.java:51) at org.junit.runner.notification.RunNotifier$4.notifyListener( RunNotifier.java:96) at
Cache size and Heap size
Hello,.. I have relatively large RAM (10Gb) on my server which is running Solr. I increased Cache settings and start to see OutOfMemory exceptions, specially on facet search. Is anybody has some suggestions how Cache settings related to Memory consumptions? What are optimal settings? How they could be calculated? Thank you for any advise, Gene
conceptual issues with solr
Hi here, It seems that Lucene accepts any kind of XML document but Solr accepts only flat name/value pairs inside a document to be indexed. You'll find below what I'd like to do, Thanks for help of any kind ! Phil I need to index products (hotels) which have a price by date, then search them by date or date range and price range. Is there a way to do that with Solr? At the moment i have a document for each hotel : add doc field name=urlhttp:///yyy/field field name=id1/field field name=nameHotel Opera/field field name=category4 stars/field . /doc /add I would need to add my dates/price values like this but it is forbidden in Solr indexing: date value=30/01/2008 price=200 date value=31/01/2008 price=150 Otherwise i could define a default field (being an integer) and have as many fields as dates, like this: field name=30/01/2008200/field field name=31/01/2008150/field indexing would accept it but i think i will not be able to search or sort by date The only solution i found at the moment is to create a document for each date/price add doc field name=urlhttp:///yyy/field field name=id1/field field name=nameHotel Opera/field field name=date30/01/2008/field field name=price200/field /doc doc field name=url http:///yyy/field field name=id1/field field name=nameHotel Opera/field field name=date31/01/2008/field field name=price150/field /doc /add then i'll have many documents for 1 hotel and in order to search by date range i would need more documents like this : field name=date-range28/01/2008 to 31/01/2008/field field name=date-range29/01/2008 to 31/01/2008/field field name=date-range30/01/2008 to 31/01/2008/field Since i need to index many other informations about an hotel (address, telephone, amenities etc...) i wouldn' like to duplicate too much information, and i think it would not be scalable to search first in a dates index then in hotels index to retrieve hotel information. Any idea?
Re: Indexing very large files.
I don't think this is a StringBuilder limitation, but rather your Java JVM doesn't start with enough memory. i.e. -Xmx. In raw Lucene, I've indexed 240M files Best Erick On Jan 16, 2008 10:12 AM, David Thibault [EMAIL PROTECTED] wrote: All, I just found a thread about this on the mailing list archives because I'm troubleshooting the same problem. The kicker is that it doesn't take such large files to kill the StringBuilder. I have discovered the following: By using a text file made up of 3,443,464 bytes or less, I get no error. AT 3,443,465 bytes: Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.lang.String.init(String.java:208) at java.lang.StringBuilder.toString(StringBuilder.java:431) at org.junit.Assert.format(Assert.java:321) at org.junit.ComparisonFailure$ComparisonCompactor.compact( ComparisonFailure.java:80) at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java :37) at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) at java.lang.Throwable.toString(Throwable.java:344) at java.lang.String.valueOf(String.java:2615) at java.io.PrintWriter.print(PrintWriter.java:546) at java.io.PrintWriter.println(PrintWriter.java:683) at java.lang.Throwable.printStackTrace(Throwable.java:510) at org.apache.tools.ant.util.StringUtils.getStackTrace( StringUtils.java:96) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace (JUnitTestRunner.java:856) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError (XMLJUnitResultFormatter.java:280) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError (XMLJUnitResultFormatter.java:255) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError( JUnitTestRunner.java:988) at junit.framework.TestResult.addError(TestResult.java:38) at junit.framework.JUnit4TestAdapterCache$1.testFailure( JUnit4TestAdapterCache.java:51) at org.junit.runner.notification.RunNotifier$4.notifyListener( RunNotifier.java:96) at org.junit.runner.notification.RunNotifier$SafeNotifier.run( RunNotifier.java:37) at org.junit.runner.notification.RunNotifier.fireTestFailure( RunNotifier.java:93) at org.junit.internal.runners.TestMethodRunner.addFailure( TestMethodRunner.java:104) at org.junit.internal.runners.TestMethodRunner.runUnprotected( TestMethodRunner.java:87) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestMethodRunner.runMethod( TestMethodRunner.java:75) at org.junit.internal.runners.TestMethodRunner.run( TestMethodRunner.java:45) at org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod( TestClassMethodsRunner.java:71) at org.junit.internal.runners.TestClassMethodsRunner.run( TestClassMethodsRunner.java:35) at org.junit.internal.runners.TestClassRunner$1.runUnprotected( TestClassRunner.java:42) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestClassRunner.run( TestClassRunner.java:52) at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:32) AT 3,443,466 byes (or more) : Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.expandCapacity( AbstractStringBuilder.java:99) at java.lang.AbstractStringBuilder.append( AbstractStringBuilder.java :393) at java.lang.StringBuilder.append(StringBuilder.java:120) at org.junit.Assert.format(Assert.java:321) at org.junit.ComparisonFailure$ComparisonCompactor.compact( ComparisonFailure.java:80) at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java :37) at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) at java.lang.Throwable.toString(Throwable.java:344) at java.lang.String.valueOf(String.java:2615) at java.io.PrintWriter.print(PrintWriter.java:546) at java.io.PrintWriter.println(PrintWriter.java:683) at java.lang.Throwable.printStackTrace(Throwable.java:510) at org.apache.tools.ant.util.StringUtils.getStackTrace( StringUtils.java:96) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace (JUnitTestRunner.java:856) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError (XMLJUnitResultFormatter.java:280) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError (XMLJUnitResultFormatter.java:255) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError(
Re: Indexing very large files.
P.S. Lucene by default limits the maximum field length to 10K tokens, so you have to bump that for large files. Erick On Jan 16, 2008 11:04 AM, Erick Erickson [EMAIL PROTECTED] wrote: I don't think this is a StringBuilder limitation, but rather your Java JVM doesn't start with enough memory. i.e. -Xmx. In raw Lucene, I've indexed 240M files Best Erick On Jan 16, 2008 10:12 AM, David Thibault [EMAIL PROTECTED] wrote: All, I just found a thread about this on the mailing list archives because I'm troubleshooting the same problem. The kicker is that it doesn't take such large files to kill the StringBuilder. I have discovered the following: By using a text file made up of 3,443,464 bytes or less, I get no error. AT 3,443,465 bytes: Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.lang.String .init(String.java:208) at java.lang.StringBuilder.toString(StringBuilder.java:431) at org.junit.Assert.format(Assert.java:321) at org.junit.ComparisonFailure$ComparisonCompactor.compact ( ComparisonFailure.java:80) at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java :37) at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) at java.lang.Throwable.toString (Throwable.java:344) at java.lang.String.valueOf(String.java:2615) at java.io.PrintWriter.print(PrintWriter.java:546) at java.io.PrintWriter.println(PrintWriter.java:683) at java.lang.Throwable.printStackTrace(Throwable.java:510) at org.apache.tools.ant.util.StringUtils.getStackTrace( StringUtils.java:96) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace (JUnitTestRunner.java:856) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError (XMLJUnitResultFormatter.java:280) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError (XMLJUnitResultFormatter.java:255) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError( JUnitTestRunner.java:988) at junit.framework.TestResult.addError(TestResult.java :38) at junit.framework.JUnit4TestAdapterCache$1.testFailure( JUnit4TestAdapterCache.java:51) at org.junit.runner.notification.RunNotifier$4.notifyListener( RunNotifier.java:96) at org.junit.runner.notification.RunNotifier$SafeNotifier.run( RunNotifier.java:37) at org.junit.runner.notification.RunNotifier.fireTestFailure( RunNotifier.java:93) at org.junit.internal.runners.TestMethodRunner.addFailure ( TestMethodRunner.java:104) at org.junit.internal.runners.TestMethodRunner.runUnprotected( TestMethodRunner.java:87) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestMethodRunner.runMethod( TestMethodRunner.java:75) at org.junit.internal.runners.TestMethodRunner.run( TestMethodRunner.java :45) at org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod( TestClassMethodsRunner.java:71) at org.junit.internal.runners.TestClassMethodsRunner.run( TestClassMethodsRunner.java :35) at org.junit.internal.runners.TestClassRunner$1.runUnprotected( TestClassRunner.java:42) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestClassRunner.run( TestClassRunner.java:52) at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java :32) AT 3,443,466 byes (or more) : Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.expandCapacity( AbstractStringBuilder.java:99) at java.lang.AbstractStringBuilder.append ( AbstractStringBuilder.java :393) at java.lang.StringBuilder.append(StringBuilder.java:120) at org.junit.Assert.format(Assert.java:321) at org.junit.ComparisonFailure$ComparisonCompactor.compact ( ComparisonFailure.java:80) at org.junit.ComparisonFailure.getMessage(ComparisonFailure.java :37) at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) at java.lang.Throwable.toString (Throwable.java:344) at java.lang.String.valueOf(String.java:2615) at java.io.PrintWriter.print(PrintWriter.java:546) at java.io.PrintWriter.println(PrintWriter.java:683) at java.lang.Throwable.printStackTrace(Throwable.java:510) at org.apache.tools.ant.util.StringUtils.getStackTrace( StringUtils.java:96) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace (JUnitTestRunner.java:856)
Re: Indexing very large files.
I think your PS might do the trick. My JVM doesn't seem to be the issue, because I've set it to -Xmx512m -Xms256m. I will track down the solr config parameter you mentioned and try that. Thanks for the quick response! Dave On 1/16/08, Erick Erickson [EMAIL PROTECTED] wrote: P.S. Lucene by default limits the maximum field length to 10K tokens, so you have to bump that for large files. Erick On Jan 16, 2008 11:04 AM, Erick Erickson [EMAIL PROTECTED] wrote: I don't think this is a StringBuilder limitation, but rather your Java JVM doesn't start with enough memory. i.e. -Xmx. In raw Lucene, I've indexed 240M files Best Erick On Jan 16, 2008 10:12 AM, David Thibault [EMAIL PROTECTED] wrote: All, I just found a thread about this on the mailing list archives because I'm troubleshooting the same problem. The kicker is that it doesn't take such large files to kill the StringBuilder. I have discovered the following: By using a text file made up of 3,443,464 bytes or less, I get no error. AT 3,443,465 bytes: Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.lang.String .init(String.java:208) at java.lang.StringBuilder.toString(StringBuilder.java:431) at org.junit.Assert.format(Assert.java:321) at org.junit.ComparisonFailure$ComparisonCompactor.compact ( ComparisonFailure.java:80) at org.junit.ComparisonFailure.getMessage( ComparisonFailure.java :37) at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) at java.lang.Throwable.toString (Throwable.java:344) at java.lang.String.valueOf(String.java:2615) at java.io.PrintWriter.print(PrintWriter.java:546) at java.io.PrintWriter.println(PrintWriter.java:683) at java.lang.Throwable.printStackTrace(Throwable.java:510) at org.apache.tools.ant.util.StringUtils.getStackTrace( StringUtils.java:96) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.getFilteredTrace (JUnitTestRunner.java:856) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.formatError (XMLJUnitResultFormatter.java:280) at org.apache.tools.ant.taskdefs.optional.junit.XMLJUnitResultFormatter.addError (XMLJUnitResultFormatter.java:255) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$4.addError( JUnitTestRunner.java:988) at junit.framework.TestResult.addError(TestResult.java :38) at junit.framework.JUnit4TestAdapterCache$1.testFailure( JUnit4TestAdapterCache.java:51) at org.junit.runner.notification.RunNotifier$4.notifyListener( RunNotifier.java:96) at org.junit.runner.notification.RunNotifier$SafeNotifier.run( RunNotifier.java:37) at org.junit.runner.notification.RunNotifier.fireTestFailure( RunNotifier.java:93) at org.junit.internal.runners.TestMethodRunner.addFailure ( TestMethodRunner.java:104) at org.junit.internal.runners.TestMethodRunner.runUnprotected( TestMethodRunner.java:87) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected ( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestMethodRunner.runMethod( TestMethodRunner.java:75) at org.junit.internal.runners.TestMethodRunner.run( TestMethodRunner.java :45) at org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod( TestClassMethodsRunner.java:71) at org.junit.internal.runners.TestClassMethodsRunner.run( TestClassMethodsRunner.java :35) at org.junit.internal.runners.TestClassRunner$1.runUnprotected( TestClassRunner.java:42) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected ( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestClassRunner.run( TestClassRunner.java:52) at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java :32) AT 3,443,466 byes (or more) : Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.expandCapacity( AbstractStringBuilder.java:99) at java.lang.AbstractStringBuilder.append ( AbstractStringBuilder.java :393) at java.lang.StringBuilder.append(StringBuilder.java:120) at org.junit.Assert.format(Assert.java:321) at org.junit.ComparisonFailure$ComparisonCompactor.compact ( ComparisonFailure.java:80) at org.junit.ComparisonFailure.getMessage( ComparisonFailure.java :37) at java.lang.Throwable.getLocalizedMessage(Throwable.java:267) at java.lang.Throwable.toString (Throwable.java:344) at
Re: Indexing very large files.
I tried raising the maxFieldLength1/maxFieldLength under mainIndex as well as indexDefaults and still no luck. I'm trying to upload a text file that is about 8 MB in size. I think the following stack trace still points to some sort of overflowed String issue. Thoughts? Solr returned an error: Java heap space java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232) at java.lang.StringCoding.encode(StringCoding.java:272) at java.lang.String.getBytes(String.java:947) at org.apache.lucene.index.FieldsWriter.addDocument(FieldsWriter.java:98) at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:107) at org.apache.lucene.index.IndexWriter.buildSingleDocSegment( IndexWriter.java:977) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:965) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:947) at org.apache.solr.update.DirectUpdateHandler2.addDoc( DirectUpdateHandler2.java:270) at org.apache.solr.handler.XmlUpdateRequestHandler.update( XmlUpdateRequestHandler.java:166) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody( XmlUpdateRequestHandler.java:84) at org.apache.solr.handler.RequestHandlerBase.handleRequest( RequestHandlerBase.java:77) at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:159) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter( ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke( StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke( StandardContextValve.java:174) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java :127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java :117) at org.apache.catalina.core.StandardEngineValve.invoke( StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection (Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket( PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt( LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run( ThreadPool.java:689) at java.lang.Thread.run(Thread.java:619) java.io.IOException: Server returned HTTP response code: 500 for URL: http://solr:8080/solr/update at sun.net.www.protocol.http.HttpURLConnection.getInputStream( HttpURLConnection.java:1170) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData( SimplePostTool.java:134) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile( SimplePostTool.java:87) at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile( Uploader.java:97) at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( UploaderTest.java:95) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.junit.internal.runners.TestMethodRunner.executeMethodBody( TestMethodRunner.java:99) at org.junit.internal.runners.TestMethodRunner.runUnprotected( TestMethodRunner.java:81) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestMethodRunner.runMethod( TestMethodRunner.java:75) at org.junit.internal.runners.TestMethodRunner.run( TestMethodRunner.java:45) at org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod( TestClassMethodsRunner.java:71) at org.junit.internal.runners.TestClassMethodsRunner.run( TestClassMethodsRunner.java:35) at org.junit.internal.runners.TestClassRunner$1.runUnprotected( TestClassRunner.java:42) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestClassRunner.run( TestClassRunner.java:52) at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:32) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run( JUnitTestRunner.java:421) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch( JUnitTestRunner.java:912) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main
Re: Indexing very large files.
The PS really wasn't related to your OOM, and raising that shouldn't have changed the behavior. All that happens if you go beyond 10,000 tokens is that the rest gets thrown away. But we're beyond my real knowledge level about SOLR, so I'll defer to others. A very quick-n-dirty test as to whether you're actually allocation more memory to the process you *think* you are would be to bump it ridiculously higher. I'm completely unclear about what process gets the increased memory relative to the server. [EMAIL PROTECTED] On Jan 16, 2008 11:33 AM, David Thibault [EMAIL PROTECTED] wrote: I tried raising the maxFieldLength1/maxFieldLength under mainIndex as well as indexDefaults and still no luck. I'm trying to upload a text file that is about 8 MB in size. I think the following stack trace still points to some sort of overflowed String issue. Thoughts? Solr returned an error: Java heap space java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232) at java.lang.StringCoding.encode(StringCoding.java:272) at java.lang.String.getBytes(String.java:947) at org.apache.lucene.index.FieldsWriter.addDocument(FieldsWriter.java:98) at org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java :107) at org.apache.lucene.index.IndexWriter.buildSingleDocSegment( IndexWriter.java:977) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:965) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:947) at org.apache.solr.update.DirectUpdateHandler2.addDoc( DirectUpdateHandler2.java:270) at org.apache.solr.handler.XmlUpdateRequestHandler.update( XmlUpdateRequestHandler.java:166) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody( XmlUpdateRequestHandler.java:84) at org.apache.solr.handler.RequestHandlerBase.handleRequest( RequestHandlerBase.java:77) at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java :191) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:159) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter( ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke( StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke( StandardContextValve.java:174) at org.apache.catalina.core.StandardHostValve.invoke( StandardHostValve.java :127) at org.apache.catalina.valves.ErrorReportValve.invoke( ErrorReportValve.java :117) at org.apache.catalina.core.StandardEngineValve.invoke( StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java :151) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java :874) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection (Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket( PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt( LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run( ThreadPool.java:689) at java.lang.Thread.run(Thread.java:619) java.io.IOException: Server returned HTTP response code: 500 for URL: http://solr:8080/solr/update at sun.net.www.protocol.http.HttpURLConnection.getInputStream( HttpURLConnection.java:1170) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData( SimplePostTool.java:134) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile( SimplePostTool.java:87) at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile( Uploader.java:97) at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( UploaderTest.java:95) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.junit.internal.runners.TestMethodRunner.executeMethodBody( TestMethodRunner.java:99) at org.junit.internal.runners.TestMethodRunner.runUnprotected( TestMethodRunner.java:81) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestMethodRunner.runMethod( TestMethodRunner.java:75) at org.junit.internal.runners.TestMethodRunner.run( TestMethodRunner.java:45) at org.junit.internal.runners.TestClassMethodsRunner.invokeTestMethod( TestClassMethodsRunner.java:71) at org.junit.internal.runners.TestClassMethodsRunner.run(
Re: Indexing very large files.
This error means that the JVM has run out of heap space. Increase the heap space. That is an option on the java command. I set my heap to 200 Meg and do it this way with Tomcat 6: JAVA_OPTS=-Xmx600M tomcat/bin/startup.sh wunder On 1/16/08 8:33 AM, David Thibault [EMAIL PROTECTED] wrote: java.lang.OutOfMemoryError: Java heap space
Re: Indexing very large files.
Nice signature...=) On 1/16/08, Erick Erickson [EMAIL PROTECTED] wrote: The PS really wasn't related to your OOM, and raising that shouldn't have changed the behavior. All that happens if you go beyond 10,000 tokens is that the rest gets thrown away. But we're beyond my real knowledge level about SOLR, so I'll defer to others. A very quick-n-dirty test as to whether you're actually allocation more memory to the process you *think* you are would be to bump it ridiculously higher. I'm completely unclear about what process gets the increased memory relative to the server. [EMAIL PROTECTED] On Jan 16, 2008 11:33 AM, David Thibault [EMAIL PROTECTED] wrote: I tried raising the maxFieldLength1/maxFieldLength under mainIndex as well as indexDefaults and still no luck. I'm trying to upload a text file that is about 8 MB in size. I think the following stack trace still points to some sort of overflowed String issue. Thoughts? Solr returned an error: Java heap space java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232) at java.lang.StringCoding.encode(StringCoding.java:272) at java.lang.String.getBytes(String.java:947) at org.apache.lucene.index.FieldsWriter.addDocument(FieldsWriter.java :98) at org.apache.lucene.index.DocumentWriter.addDocument( DocumentWriter.java :107) at org.apache.lucene.index.IndexWriter.buildSingleDocSegment( IndexWriter.java:977) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:965) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:947) at org.apache.solr.update.DirectUpdateHandler2.addDoc( DirectUpdateHandler2.java:270) at org.apache.solr.handler.XmlUpdateRequestHandler.update( XmlUpdateRequestHandler.java:166) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody( XmlUpdateRequestHandler.java:84) at org.apache.solr.handler.RequestHandlerBase.handleRequest( RequestHandlerBase.java:77) at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at org.apache.solr.servlet.SolrDispatchFilter.execute( SolrDispatchFilter.java :191) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:159) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter( ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke( StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke( StandardContextValve.java:174) at org.apache.catalina.core.StandardHostValve.invoke( StandardHostValve.java :127) at org.apache.catalina.valves.ErrorReportValve.invoke( ErrorReportValve.java :117) at org.apache.catalina.core.StandardEngineValve.invoke( StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service( CoyoteAdapter.java :151) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java :874) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection (Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket( PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt( LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run( ThreadPool.java:689) at java.lang.Thread.run(Thread.java:619) java.io.IOException: Server returned HTTP response code: 500 for URL: http://solr:8080/solr/update at sun.net.www.protocol.http.HttpURLConnection.getInputStream( HttpURLConnection.java:1170) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData( SimplePostTool.java:134) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile( SimplePostTool.java:87) at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile( Uploader.java:97) at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( UploaderTest.java:95) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.junit.internal.runners.TestMethodRunner.executeMethodBody( TestMethodRunner.java:99) at org.junit.internal.runners.TestMethodRunner.runUnprotected( TestMethodRunner.java:81) at org.junit.internal.runners.BeforeAndAfterRunner.runProtected( BeforeAndAfterRunner.java:34) at org.junit.internal.runners.TestMethodRunner.runMethod( TestMethodRunner.java:75) at org.junit.internal.runners.TestMethodRunner.run( TestMethodRunner.java:45)
Re: Indexing very large files.
Walter and all, I had been bumping up the heap for my Java app (running outside of Tomcat) but I hadn't yet tried bumping up my Tomcat heap. That seems to have helped me upload the 8MB file, but it's crashing while uploading a 32MB file now. I Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem is now. I suspect Walter was on to something, since it sort of fixed my problem. I will keep troubleshooting the Tomcat memory and go from there.. Best, Dave On 1/16/08, Walter Underwood [EMAIL PROTECTED] wrote: This error means that the JVM has run out of heap space. Increase the heap space. That is an option on the java command. I set my heap to 200 Meg and do it this way with Tomcat 6: JAVA_OPTS=-Xmx600M tomcat/bin/startup.sh wunder On 1/16/08 8:33 AM, David Thibault [EMAIL PROTECTED] wrote: java.lang.OutOfMemoryError: Java heap space
Re: Solr in a distributed multi-machine high-performance environment
Thanks for that Shalin. Looks like I have to wait and keep track of developments. Forgetting about indexes that cannot be fit on a single machine (distributed search), any links to have Solr running in a 2-machine environment? I want to measure how much improvement there will be in performance with the addition of machines for computation (space later) and I need a 2-machine setup for that. Thanks Srikant Shalin Shekhar Mangar wrote: Look at http://issues.apache.org/jira/browse/SOLR-303 Please note that it is still work in progress. So you may not be able to use it immeadiately. -- Find out how you can get spam free email. http://www.bluebottle.com/tag/3
Re: Cache size and Heap size
I'm using Tomcat. I set Max Size = 5Gb and I checked in profiler that it's actually uses whole memory. There is no significant memory use by other applications. Whole change was I increased the size of cache to: LRU Cache(maxSize=1048576, initialSize=1048576, autowarmCount=524288, [EMAIL PROTECTED]) I know this is a lot and I'm going to decrease it, I was just experimenting, but I need some guidelines of how to calculate the right size of the cache. Thank you Gene - Original Message From: Daniel Alheiros [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 10:48:50 AM Subject: Re: Cache size and Heap size Hi Gene. Have you set your app server / servlet container to use allocate some of this memory to be used? You can define the maximum and minimum heap size adding/replacing some parameters on the app server initialization: -Xmx1536m -Xms1536m Which app server / servlet container are you using? Regards, Daniel Alheiros On 16/1/08 15:23, Evgeniy Strokin [EMAIL PROTECTED] wrote: Hello,.. I have relatively large RAM (10Gb) on my server which is running Solr. I increased Cache settings and start to see OutOfMemory exceptions, specially on facet search. Is anybody has some suggestions how Cache settings related to Memory consumptions? What are optimal settings? How they could be calculated? Thank you for any advise, Gene http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: Solr in a distributed multi-machine high-performance environment
Solr provides a few scripts to create a multiple-machine deployment. One box is setup as the master (used primarily for writes) and others as slaves. Slaves are added as per application requirements. The index is transferred using rsync. Look at http://wiki.apache.org/solr/CollectionDistribution for details. You can put the slaves behind a load balancer or share the slaves among your front-end servers to measure performance. On Jan 17, 2008 12:39 AM, Srikant Jakilinki [EMAIL PROTECTED] wrote: Thanks for that Shalin. Looks like I have to wait and keep track of developments. Forgetting about indexes that cannot be fit on a single machine (distributed search), any links to have Solr running in a 2-machine environment? I want to measure how much improvement there will be in performance with the addition of machines for computation (space later) and I need a 2-machine setup for that. Thanks Srikant Shalin Shekhar Mangar wrote: Look at http://issues.apache.org/jira/browse/SOLR-303 Please note that it is still work in progress. So you may not be able to use it immeadiately. -- Find out how you can get spam free email. http://www.bluebottle.com/tag/3 -- Regards, Shalin Shekhar Mangar.
Re: Solr in a distributed multi-machine high-performance environment
On 16-Jan-08, at 11:09 AM, Srikant Jakilinki wrote: Thanks for that Shalin. Looks like I have to wait and keep track of developments. Forgetting about indexes that cannot be fit on a single machine (distributed search), any links to have Solr running in a 2-machine environment? I want to measure how much improvement there will be in performance with the addition of machines for computation (space later) and I need a 2-machine setup for that. If you are looking for automatic replication and load-balancing across multiple machines, Solr does not provide that. The typical strategy is as follows: index half the documents on one machine and half on another. Execute both queries simultaneously (using threads, f.i.), and combine the results. You should observe a speed up. -Mike
Re: Solr in a distributed multi-machine high-performance environment
On 15-Jan-08, at 9:23 PM, Srikant Jakilinki wrote: 2) Solr that has to handle a large collective index which has to be split up on multi-machines - The index is ever increasing (TB scale) and dynamic and all of it has to be searched at any point This will require significant development on your part. Nutch may be able to provide more of what you need OOB. 3) Solr that has to exploit multi-machines because we have plenty of them in a tightly coupled P2P scenario - Machines are not a problem but will they be if they are of varied configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE 1.1 to 1.6) Solr requires java 1.5, lucene requires java 1.4. Also, there is certainly no point mixing PIII's and modern cpus: trying to achieve the appropriate balance between machines of such disparate capability will take much more effort than will be gained out of using them. -Mike
Re: Cache size and Heap size
On 16-Jan-08, at 11:15 AM, [EMAIL PROTECTED] wrote: I'm using Tomcat. I set Max Size = 5Gb and I checked in profiler that it's actually uses whole memory. There is no significant memory use by other applications. Whole change was I increased the size of cache to: LRU Cache(maxSize=1048576, initialSize=1048576, autowarmCount=524288, [EMAIL PROTECTED]) autowarmcount maxSize certainly doesn't make sense. I know this is a lot and I'm going to decrease it, I was just experimenting, but I need some guidelines of how to calculate the right size of the cache. Each filter that matches more than ~3000 documents will occupy maxDocs/8 bytes of memory. Certain kinds of faceting require one entry per unique value in a field. The best way to tune this is to monitor your cache hit/expunge statistics for the filter cache (on the solr admin statistics screen). -Mike
Re: Problem with dismax handler when searching Solr along with field
On 16-Jan-08, at 3:15 AM, farhanali wrote: when i search the query for example http://localhost:8983/solr/select/?q=categoryqt=dismax it gives the results but when i want to search on the basis of field name like http://localhost:8983/solr/select/?q=maincategory:Carsqt=dismax it does not gives results however http://localhost:8983/solr/select/?q=maincategory:Cars return results of cars from field name maincategory Anyone have some idea??? The dismax handler does not allow you to use lucene query syntax. The qf parameter must be used to select the fields to query (alternatively, you can provide a lucene-style query in an fq filter). See the documentation here: http://wiki.apache.org/solr/DisMaxRequestHandler -Mike
IOException: read past EOF during optimize phase
I am using the embedded Solr API for my indexing process. I created a brand new index with my application without any problem. I then ran my indexer in incremental mode. This process copies the working index to a temporary Solr location, adds/updates any records, optimizes the index, and then copies it back to the working location. There are currently not any instances of Solr reading this index. Also, I commit after every 10 rows. The schema.xml and solrconfig.xml files have not changed. Here is my function call. protected void optimizeProducts() throws IOException { UpdateHandler updateHandler = m_SolrCore.getUpdateHandler(); CommitUpdateCommand commitCmd = new CommitUpdateCommand(true); commitCmd.optimize = true; updateHandler.commit(commitCmd); log.info(Optimized index); } So, during the optimize phase, I get the following stack trace: java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107) at org.apache.lucene.store.IndexInput.readString(IndexInput.java:93) at org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508) at ... There are no exceptions or anything else that appears to be incorrect during the adds or commits. After this, the index files are still non-optimized. I know there is not a whole lot to go on here. Anything in particular that I should look at?
Re: Big number of conditions of the search
I see,.. but I really need to run it on Solr. We have already indexed everything. I don't really want to construct a query with 1K OR conditions, and send to Solr to parse it first and run it after. May be there is a way to go directly to Lucene, or Solr and run such query from Java, passing Array of IDs, or something like this? Could anybody give me some advise of how to do this in better way? Thank you Gene - Original Message From: Otis Gospodnetic [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, January 11, 2008 12:26:14 AM Subject: Re: Big number of conditions of the search Evgeniy - sound like a problem best suited for RDBMS, really. You can run such an OR query, but you'll have to manually increase the max number of clauses allowed (in one of the configs) and make sure the JVM has plenty of memory. But again, this is best done in RDBMS with some count(*) and GROUP BY selects. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Evgeniy Strokin [EMAIL PROTECTED] To: Solr User solr-user@lucene.apache.org Sent: Thursday, January 10, 2008 4:39:44 PM Subject: Big number of conditions of the search Hello, I don't know how to formulate this right, I'll give an example: I have 20 millions documents with unique ID indexed. I have list of IDs stored somewhere. I need to run query which will take documents with ID from my list and gives me some statistic. For example: my documents are addresses with unique ID. I have list which contains 10 thousand IDs of some addresses. I need to find how many addresses are in NJ from my list? Or another scenario: give me all states my addresses from and how many addresses in each state (only addresses from my list)? So I was thinking I could run facet search by field State, but my query would be like this: ID:123 OR ID:23987 OR ID:294343 10K such OR conditions in a row, which is ridicules and not even possible I think. Could somebody suggest some solution for this? Thank you Gene
Re: Indexing very large files.
OK, I have now bumped my tomcat JVM up to 1024MB min and 1500MB max. For some reason Walter's suggestion helped me get past the 8MB file upload to Solr but it's still choking on a 32MB file. Is there a way to set per-webapp JVM settings in tomcat, or is the overall tomcat JVM sufficient to set? I can't see anything in the tomcat manager to suggest that there are smaller memory limitations for solr or any other webapp (all the demo webapps that tomcat comes with are still there right now). Here's the trace I get when I try to upload the 32MB file: java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java :95) at sun.net.www.http.PosterOutputStream.write(PosterOutputStream.java :61) at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java :336) at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java :395) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:191) at com.itstrategypartners.sents.solrUpload.SimplePostTool.pipe( SimplePostTool.java:167) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData( SimplePostTool.java:125) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile( SimplePostTool.java:87) at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile( Uploader.java:97) at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( UploaderTest.java:95) Any more thoughts on possible causes? Best, Dave On 1/16/08, David Thibault [EMAIL PROTECTED] wrote: Walter and all, I had been bumping up the heap for my Java app (running outside of Tomcat) but I hadn't yet tried bumping up my Tomcat heap. That seems to have helped me upload the 8MB file, but it's crashing while uploading a 32MB file now. I Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem is now. I suspect Walter was on to something, since it sort of fixed my problem. I will keep troubleshooting the Tomcat memory and go from there.. Best, Dave On 1/16/08, Walter Underwood [EMAIL PROTECTED] wrote: This error means that the JVM has run out of heap space. Increase the heap space. That is an option on the java command. I set my heap to 200 Meg and do it this way with Tomcat 6: JAVA_OPTS=-Xmx600M tomcat/bin/startup.sh wunder On 1/16/08 8:33 AM, David Thibault [EMAIL PROTECTED] wrote: java.lang.OutOfMemoryError: Java heap space
RE: Indexing very large files.
I think you should try isolating the problem. It may turn out that the problem isn't really to do with Solr, but file uploading. I'm no expert, but that's what I'd try out in such situation. Cheers, Timothy Wonil Lee http://timundergod.blogspot.com/ http://www.google.com/reader/shared/16849249410805339619 -Original Message- From: David Thibault [mailto:[EMAIL PROTECTED] Sent: Thursday, 17 January 2008 8:30 AM To: solr-user@lucene.apache.org Subject: Re: Indexing very large files. OK, I have now bumped my tomcat JVM up to 1024MB min and 1500MB max. For some reason Walter's suggestion helped me get past the 8MB file upload to Solr but it's still choking on a 32MB file. Is there a way to set per-webapp JVM settings in tomcat, or is the overall tomcat JVM sufficient to set? I can't see anything in the tomcat manager to suggest that there are smaller memory limitations for solr or any other webapp (all the demo webapps that tomcat comes with are still there right now). Here's the trace I get when I try to upload the 32MB file: java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java :95) at sun.net.www.http.PosterOutputStream.write(PosterOutputStream.java :61) at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java :336) at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java :395) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:191) at com.itstrategypartners.sents.solrUpload.SimplePostTool.pipe( SimplePostTool.java:167) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData( SimplePostTool.java:125) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile( SimplePostTool.java:87) at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile( Uploader.java:97) at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( UploaderTest.java:95) Any more thoughts on possible causes? Best, Dave On 1/16/08, David Thibault [EMAIL PROTECTED] wrote: Walter and all, I had been bumping up the heap for my Java app (running outside of Tomcat) but I hadn't yet tried bumping up my Tomcat heap. That seems to have helped me upload the 8MB file, but it's crashing while uploading a 32MB file now. I Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem is now. I suspect Walter was on to something, since it sort of fixed my problem. I will keep troubleshooting the Tomcat memory and go from there.. Best, Dave On 1/16/08, Walter Underwood [EMAIL PROTECTED] wrote: This error means that the JVM has run out of heap space. Increase the heap space. That is an option on the java command. I set my heap to 200 Meg and do it this way with Tomcat 6: JAVA_OPTS=-Xmx600M tomcat/bin/startup.sh wunder On 1/16/08 8:33 AM, David Thibault [EMAIL PROTECTED] wrote: java.lang.OutOfMemoryError: Java heap space !DSPAM:478e7768324633671820667!
Re: IOException: read past EOF during optimize phase
Kevin, Don't have the answer to EOF but I'm wondering why the index moving. You don't need to do that as far as Solr is concerned. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Kevin Osborn [EMAIL PROTECTED] To: Solr solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 3:07:23 PM Subject: IOException: read past EOF during optimize phase I am using the embedded Solr API for my indexing process. I created a brand new index with my application without any problem. I then ran my indexer in incremental mode. This process copies the working index to a temporary Solr location, adds/updates any records, optimizes the index, and then copies it back to the working location. There are currently not any instances of Solr reading this index. Also, I commit after every 10 rows. The schema.xml and solrconfig.xml files have not changed. Here is my function call. protected void optimizeProducts() throws IOException { UpdateHandler updateHandler = m_SolrCore.getUpdateHandler(); CommitUpdateCommand commitCmd = new CommitUpdateCommand(true); commitCmd.optimize = true; updateHandler.commit(commitCmd); log.info(Optimized index); } So, during the optimize phase, I get the following stack trace: java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107) at org.apache.lucene.store.IndexInput.readString(IndexInput.java:93) at org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508) at ... There are no exceptions or anything else that appears to be incorrect during the adds or commits. After this, the index files are still non-optimized. I know there is not a whole lot to go on here. Anything in particular that I should look at?
Re: Spell checker index rebuild
Do you trust the spellchecker 100% (not looking at its source now). I'd peek at the index with Luke (Luke I trust :)) and see if that term is really there first. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doug Steigerwald [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 2:56:35 PM Subject: Spell checker index rebuild Having another weird spell checker index issue. Starting off from a clean index and spell check index, I'll index everything in example/exampledocs. On the first rebuild of the spellchecker index using the query below says the word 'blackjack' exists in the spellchecker index. Great, no problems. Rebuild it again and the word 'blackjack' does not exist any more. http://localhost:8983/solr/core0/select?q=blackjackqt=spellcheckercmd=rebuild Any ideas? This is with a Solr trunk build from yesterday. doug
Re: Indexing very large files.
From your stack trace, it looks like it's your client running out of memory, right? SimplePostTool was meant as a command-line replacement to curl to remove that dependency, not as a recommended way to talk to Solr. -Yonik On Jan 16, 2008 4:29 PM, David Thibault [EMAIL PROTECTED] wrote: OK, I have now bumped my tomcat JVM up to 1024MB min and 1500MB max. For some reason Walter's suggestion helped me get past the 8MB file upload to Solr but it's still choking on a 32MB file. Is there a way to set per-webapp JVM settings in tomcat, or is the overall tomcat JVM sufficient to set? I can't see anything in the tomcat manager to suggest that there are smaller memory limitations for solr or any other webapp (all the demo webapps that tomcat comes with are still there right now). Here's the trace I get when I try to upload the 32MB file: java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java :95) at sun.net.www.http.PosterOutputStream.write(PosterOutputStream.java :61) at sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java :336) at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java :395) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:191) at com.itstrategypartners.sents.solrUpload.SimplePostTool.pipe( SimplePostTool.java:167) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postData( SimplePostTool.java:125) at com.itstrategypartners.sents.solrUpload.SimplePostTool.postFile( SimplePostTool.java:87) at com.itstrategypartners.sents.solrUpload.Uploader.uploadFile( Uploader.java:97) at com.itstrategypartners.sents.solrUpload.UploaderTest.uploadFile( UploaderTest.java:95) Any more thoughts on possible causes? Best, Dave
Re: IOException: read past EOF during optimize phase
Kevin, Perhaps you want to look at how Solr can be used in a master-slave setup. This will separate your indexing from searching. Don't have the URL, but it's on zee Wiki. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Kevin Osborn [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 5:25:34 PM Subject: Re: IOException: read past EOF during optimize phase It is more of a file structure thing for our application. We build in one place and do our index syncing in a different place. I doubt it is relevant to this issue, but figured I would include this information anyway. - Original Message From: Otis Gospodnetic [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 2:21:31 PM Subject: Re: IOException: read past EOF during optimize phase Kevin, Don't have the answer to EOF but I'm wondering why the index moving. You don't need to do that as far as Solr is concerned. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Kevin Osborn [EMAIL PROTECTED] To: Solr solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 3:07:23 PM Subject: IOException: read past EOF during optimize phase I am using the embedded Solr API for my indexing process. I created a brand new index with my application without any problem. I then ran my indexer in incremental mode. This process copies the working index to a temporary Solr location, adds/updates any records, optimizes the index, and then copies it back to the working location. There are currently not any instances of Solr reading this index. Also, I commit after every 10 rows. The schema.xml and solrconfig.xml files have not changed. Here is my function call. protected void optimizeProducts() throws IOException { UpdateHandler updateHandler = m_SolrCore.getUpdateHandler(); CommitUpdateCommand commitCmd = new CommitUpdateCommand(true); commitCmd.optimize = true; updateHandler.commit(commitCmd); log.info(Optimized index); } So, during the optimize phase, I get the following stack trace: java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107) at org.apache.lucene.store.IndexInput.readString(IndexInput.java:93) at org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508) at ... There are no exceptions or anything else that appears to be incorrect during the adds or commits. After this, the index files are still non-optimized. I know there is not a whole lot to go on here. Anything in particular that I should look at?
Re: Indexing very large files.
David, I bet you can quickly identify the source using YourKit or another Java profiler jmap command line tool might also give you some direction. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: David Thibault [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 1:31:23 PM Subject: Re: Indexing very large files. Walter and all, I had been bumping up the heap for my Java app (running outside of Tomcat) but I hadn't yet tried bumping up my Tomcat heap. That seems to have helped me upload the 8MB file, but it's crashing while uploading a 32MB file now. I Just bumped tomcat to 1024MB of heap, so I'm not sure what the problem is now. I suspect Walter was on to something, since it sort of fixed my problem. I will keep troubleshooting the Tomcat memory and go from there.. Best, Dave On 1/16/08, Walter Underwood [EMAIL PROTECTED] wrote: This error means that the JVM has run out of heap space. Increase the heap space. That is an option on the java command. I set my heap to 200 Meg and do it this way with Tomcat 6: JAVA_OPTS=-Xmx600M tomcat/bin/startup.sh wunder On 1/16/08 8:33 AM, David Thibault [EMAIL PROTECTED] wrote: java.lang.OutOfMemoryError: Java heap space
Re: IOException: read past EOF during optimize phase
I did see that bug, which made me suspect Lucene. In my case, I tracked down the problem. It was my own application. I was using Java's FileChannel.transferTo functions to copy my index from one location to another. One of the files is bigger than 2^31-1 bytes. So, one of my files was corrupted during the copy because I was just doing one pass. I now loop the copy function until the entire file is copied and everything works fine. DOH! - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, January 16, 2008 4:57:08 PM Subject: Re: IOException: read past EOF during optimize phase This may be a Lucene bug... IIRC, I saw at least one other lucene user with a similar stack trace. I think the latest lucene version (2.3 dev) should fix it if that's the case. -Yonik On Jan 16, 2008 3:07 PM, Kevin Osborn [EMAIL PROTECTED] wrote: I am using the embedded Solr API for my indexing process. I created a brand new index with my application without any problem. I then ran my indexer in incremental mode. This process copies the working index to a temporary Solr location, adds/updates any records, optimizes the index, and then copies it back to the working location. There are currently not any instances of Solr reading this index. Also, I commit after every 10 rows. The schema.xml and solrconfig.xml files have not changed. Here is my function call. protected void optimizeProducts() throws IOException { UpdateHandler updateHandler = m_SolrCore.getUpdateHandler(); CommitUpdateCommand commitCmd = new CommitUpdateCommand(true); commitCmd.optimize = true; updateHandler.commit(commitCmd); log.info(Optimized index); } So, during the optimize phase, I get the following stack trace: java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:89) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107) at org.apache.lucene.store.IndexInput.readString(IndexInput.java:93) at org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:211) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:119) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:323) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:206) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1835) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1195) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:508) at ... There are no exceptions or anything else that appears to be incorrect during the adds or commits. After this, the index files are still non-optimized. I know there is not a whole lot to go on here. Anything in particular that I should look at?
Logging in Solr
All, I'm new to Solr and Tomcat and I'm trying to track down some odd errors. How do I set up Tomcat to do fine-grained Solr-specific logging? I have looked around enough to know that it should be possible to do per-webapp logging in Tomcat 5.5, but the details are hard to follow for a newbie. Any suggestions would be greatly appreciated. Best, Dave
Re: conceptual issues with solr
On Wed, 16 Jan 2008 16:54:56 +0100 Philippe Guillard [EMAIL PROTECTED] wrote: Hi here, It seems that Lucene accepts any kind of XML document but Solr accepts only flat name/value pairs inside a document to be indexed. You'll find below what I'd like to do, Thanks for help of any kind ! Phil Hey Phil, I need to index products (hotels) which have a price by date, then search them by date or date range and price range. Is there a way to do that with Solr? yes - look at the data types definition (somewhere in the wiki of the sample schema.xml) about data-types for indexing dates and integers,etc There are some caveats about using date data type fields (too much resolution, can slow down too much..) At the moment i have a document for each hotel : add doc field name=urlhttp:///yyy/field field name=id1/field field name=nameHotel Opera/field field name=category4 stars/field . /doc /add I would need to add my dates/price values like this but it is forbidden in Solr indexing: date value=30/01/2008 price=200 date value=31/01/2008 price=150 Otherwise i could define a default field (being an integer) and have as many fields as dates, like this: field name=30/01/2008200/field field name=31/01/2008150/field indexing would accept it but i think i will not be able to search or sort by date for simple dates like that, why not make use of dynamic fields ? define , for example, bydate_* as dynamic fields, then you can do : field name=bydate_MMDD/field so, from your example : field name=bydate_20080131200/field field name=bydate_20080130150/field The only solution i found at the moment is to create a document for each date/price add doc field name=urlhttp:///yyy/field field name=id1/field field name=nameHotel Opera/field field name=date30/01/2008/field field name=price200/field /doc doc field name=url http:///yyy/field field name=id1/field field name=nameHotel Opera/field field name=date31/01/2008/field field name=price150/field /doc /add If the field 'id' is your schemas ID, then this wouldn't work , but sure, the approach would be valid though a bit wasteful wrt to storing the metadata about the hotel There was a thread some time ago in this list (a month or 2 ago) about clever uses of the field defined as ID in the schema. then i'll have many documents for 1 hotel and in order to search by date range i would need more documents like this : field name=date-range28/01/2008 to 31/01/2008/field field name=date-range29/01/2008 to 31/01/2008/field field name=date-range30/01/2008 to 31/01/2008/field Since i need to index many other informations about an hotel (address, telephone, amenities etc...) i wouldn' like to duplicate too much information, and i think it would not be scalable to search first in a dates index then in hotels index to retrieve hotel information. Any idea? It strikes me you'd probably want a relational DB for this kind of thing B _ {Beto|Norberto|Numard} Meijome Unix is user friendly. However, it isn't idiot friendly. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Solr schema filters
: For this exact example, use the WordDelimiterFilter exactly as : configured in the text fieldType in the example schema that ships : with solr. The trick is to then use some slop when querying. : : FT-50-43 will be indexed as FT, 50, 43 / 5043 (the last two tokens : are in the same position). : Now when querying, FT-5043 won't match without slop because there is : a 50 token in the middle of the indexed terms... so try FT-5043~1 FYI: this was the motivation for the qs param on dismax ... http://localhost:8983/solr/select?debugQuery=trueqt=dismaxpf=qf=textq=FT-5043qs=3 -Hoss
Re: DisMax Syntax
: I may be mistaken, but this is not equivalent to my query.In my query i have : matches for x1, matches for x2 without slope and/or boosting and then match : to x1 x2 (exact match) with slope (~) a and boost (b) in order to have : results with exact match score better. : The total score is the sum of all the above. : Your query seems diff the structure of the query will look different in debugging, and the scores won't be exactly the same, but the concept is the same. -Hoss
Re: Fuzziness with DisMaxRequestHandler
: Is there any way to make the DisMaxRequestHandler a bit more forgiving with : user queries, I'm only getting results when the user enters a close to : perfect match. I'd like to allow near matches if possible, but I'm not sure : how to add something like this when special query syntax isn't allowed. the principle goal of dismax was to leave query string syntax as simple as possible, and move the mechanisms for controlling the query structure into other paramaters. the idea of making Queries Fuzzy is an interesting one ... it's something i don't remember anyone ever asking about before, and i'd never really considered it (from a UI perspective i find did you mean style spellchecking to be a better approach then making a user's query implicitly fuzzy) but it seems like it would be pretty easy to add support for something ... one approach qould be to add a numeric fuzz parameter, that if set would make the DisMaxQUeryParser return FuzzyQueries in place of TermQueries ... an alternate appraoch would be to allow per field fuzziness by tweaking the qf syntax so instead of just fieldA^4 where 4 is the boost value, you could have fieldA^4~0.8 where 4 is the boost value and 0.8 is the fuzziness factor I haven't thought about it hard enough to have an opinion about which would make more sense ... but the overall idea certainly seems like it could be a usefull feature if osmeone wants to submit a patch. -Hoss
Re: Transactions and Solr Was: Re: Delte by multiple id problem
: Does anyone have more experience doing this kind of stuff and whants to share? My advice: don't. I work with (or work with people who work with) about two dozen Solr indexes -- we don't attempt to update a single one of them in any sort of transactional way. Some of them are updated real time (ie: as soon as the authoritative DB is updated by some code, the same code updates the Solr index; Some of them are updated in batch (ie: once every N minutes code checks a log of all logical objects modified/deleted from the DB and sends the adds/delets to Solr; And some are only ever rebuilt from scrath every N hours (because the data in them isn't very time sensative and rebuilding from scratch is easier then dealing with incremental or batch updates. But as i said: we never attempt to be transactional about it, for a few reasons: 1) why should it be part of the transaction? a Solr index is a denormalized/inverted index of data .. why should a tool (or any other process) be prevented from writting to an authoritative data store just becuase a non authoritative copy of that data can't be updated? ... if you used MySQL with replication, would you really want to block all writes to the master just because there's a glitch in replicating to a slave? 2) why worry about it? It's relaly a non issue. If an add or delete fails it's usually either developer error (ie: the code generating your add statements thinks there's a field that doesn't exist), a transient timeout (maybe because of a commit in progress) or network glitch (have the client retry once or twice), or in very rare instances the whole Solr index was completely jacked (either from disk failure, or OOM due to a huge spike in load) and we want to revert to a backup of the index in the shortterm and rebuild the index from scratch to play it safe. 3) why limit yourself? you're going to want the ability to trigger arbitrary indexing of your data objects at anytime -- if for no other reason then so when you decide to add a field to your index you can reindex them all -- so why make your index updating code inherently tied to your DB updating code? As for your specific question along the lines of why can't we do a mix of adds and deletes all as part of one update message? the answer is because no one ever wrote any code to parse messages like that. BUT! ... that's not the question you really want to ask. the question you relaly want to ask is: *IF* someone wrote code to allow a mix of adds and deletes all as part of one update message, would it solve my problem of wanting to be able to modify my solr index transactionally? and the answer is No. Even if Solr accepted update messages that looked like this... update deleteid42/id/delete addfield name=id7/fieldfield name=abb/field/add addfield name=id666/fieldfield name=a/field/add /update ...the low level lucene calls that it would be doing internall still aren't transactional, so the first delete and add might succeed, but if there was then some kind of internal error, or a timeout because the first add took a while (maybe it triggered a segment merge) and the second add didn't happen -- the first two commands would have still been executed, and there would be no way to rollback. In a nutshell: you would be no better off then if your client code has sent all three as seperate update messages. -Hoss
Re: Restrict values in a multivalued field
: In my schema I have a multivalued field, and the values of that field are : stored and indexed in the index. I wanted to know if its possible to : restrict the number of multiple values being returned from that field, on a : search? And how? Because, lets say, if I have thousands of values in that : multivalued field, returning all of them would be a lot of load on the : system. So, I want to restrict it to send me only say, 50 values out of the : thousands. How would Solr pick which 50 to return? Why not index all thousand (so you can search on them) in an unstored field, and only store the 50 you want returned in a seperate (unindexed field). the index size will be exactly the same -- admittedly you'll have to send a bit more data over the wire for each doc you index, but that's probably a trivial amount (assuming the 50 values you want to store are representative of the thousands you index you are talking about at most a 5% increases in the amount of data you send solr on each add) -Hoss
Re: Fwd: Solr Text field
: searches. That is fine by me. But I'm still at the first question: : How do I conduct a wildcard search for ARIZONA on a solr.textField? I tried as i said: it really depends on what kind of index analyzer you have configured for the field -- the query analyzer isn't used at all when dealing with ildcard and prefix queries, so what you type in before the * must match the prefix of an actually indexed term that makes it into your index as a result of the index analyzer. If you add the debugQuery=true param to your queries, and compare the differnces you see in the parsedquery_toString value between searching for field:AR* and field:Arizona and field:ARIZONA you'll see what i mean. if you take a look at the Luke request handler which will show you the actual raw terms in your index (or the top N anyway), you can see what's really in there -- or --if you use the analysis.jsp interface, it will show you what Terms your analyzer will actaully produce if you index the raw string ARIZONA ... whatever you see there is what you need to be searching for when you do your prefix queries. -Hoss
Re: batch indexing takes more time than shown on SOLR output -- something to do with IO?
: INFO: {add=[10485, 10488, 10489, 10490, 10491, 10495, 10497, 10498, ...(42 : more) : ]} 0 875 : : However, when timing this instruction on the client-side (I use SOlrJ -- : req.process(server)) I get totally different numbers (in the beginning the : client-side measured time is about 2 seconds on average but after some time : this time goes up to about 30-40 seconds, altough the solr-outputted time : stays between 0.8-1.3 seconds? as Otis mentioned, that time is the raw processing of the request, not counting any network IO between the client and the server, or any time spent by the ResponseWriter formating the response. you can get more accurate numbers about exctly how long the server spent doing all of these things from the access log of your servlet container (which should be recording the time only after every last byte is written back to the client. that said: there's really no reason for as big a descrepency as you are describing particularly on updates where the ResposneWriter has almost nothing to do (30-40 seconds per update?!?!?!) I'm not very familiar with SolrJ, but are you by any chance using it in a way that sends a commit after every update command? (commits can get successifly longer as your index gets bigger.) : Does this have anything to do with costly IO-activity that is accounted for : in the SOLR output? If this is true, what tool do you recommend using to : monitor IO-activity? Which IO-activity are you talking about? -Hoss
Re: FunctionQuery in a custom request handler
: How do I access the ValueSource for my DateField? I'd like to use a : ReciprocalFloatFunction from inside the code, adding it aside others in the : main BooleanQuery. The FieldType API provides a getValueSource method (so every FieldType picks it's own best ValueSource implementaion). -Hoss