Re: Newbie Design Questions
Hi Yes, the XML is inside the DB in a clob. Would love to use XPath inside SQLEntityProcessor as it will save me tons of trouble for file- dumping (given that I am not able to post it). This is how I setup my DIH for DB import. driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@X" user="abc" password="***" batchSize="100"/> transformer="ClobTransformer" query="select xml_col from xml_table where xml_col IS NOT NULL" > dataSource="null" name="record" processor="XPathEntityProcessor" stream="false" url="${item.xml_col}" forEach="/record"> .. and so on The problem with this is that it always fails with this error. I can see that the earlier SQL entity extraction and clob transformation is working as the values show in the debug jsp (verbose mode with dataimport.jsp). However no records are extracted for entity. When I check catalina.out file, it shows me the following errors for entity name="record". (the XPath entity on top). java.lang.NullPointerException at org .apache .solr .handler .dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85). I don't have the whole stack trace right now. If you need it I would be happy to recreate and post it. Regards, Guna On Jan 21, 2009, at 8:22 PM, Noble Paul നോബിള് नोब्ळ् wrote: On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju wrote: Thanks Yes the source of data is a DB. However the xml is also posted on updates via publish framework. So I can just plug in an adapter hear to listen for changes and post to SOLR. I was trying to use the XPathProcessor inside the SQLEntityProcessor and this did not work (using 1.3 - I did see support in 1.4). That is not a show stopper for me and I can just post them via the framework and use files for the first time load. XPathEntityprocessor works inside SqlEntityprocessor only if a db field contains xml. However ,you can have a separate entity (at the root) to read from db for delta. Anyway if your current solution works stick to it. Have a seen a couple of answers on the backup for crash scenarios. just wanted to confirm - if I replace the index with the backup'ed files then I can simple start the up solr again and reindex the documents changed since last backup? Am I right? The slaves will also automatically adjust to this. Yes. you can replace an archived index and Solr should work just fine. but the docs added since the last snapshot was taken will be missing (of course :) ) THanks Guna On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള് नोब्ळ् wrote: On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju wrote: Hi All We are considering SOLR for a large database of XMLs. I have some newbie questions - if there is a place I can go read about them do let me know and I will go read up :) 1. Currently we are able to pull the XMLs from a file systems using FileDataSource. The DIH is convenient since I can map my XML fields using the XPathProcessor. This works for an initial load.However after the initial load, we would like to 'post' changed xmls to SOLR whenever the XML is updated in a separate system. I know we can post xmls with 'add' however I was not sure how to do this and maintain the DIH mapping I use in data-config.xml? I don't want to save the file to the disk and then call the DIH - would prefer to directly post it. Do I need to use solrj for this? What is the source of your new data? is it a DB? 2. If my solr schema.xml changes then do I HAVE to reindex all the old documents? Suppose in future we have newer XML documents that contain a new additional xml field.The old documents that are already indexed don't have this field and (so) I don't need search on them with this field. However the new ones need to be search-able on this new field. Can I just add this new field to the SOLR schema, restart the servers just post the new new documents or do I need to reindex everything? 3. Can I backup the index directory. So that in case of a disk crash - I can restore this directory and bring solr up. I realize that any documents indexed after this backup would be lost - I can however keep track of these outside and simply re-index documents 'newer' than that backup date. This question is really important to me in the context of using a Master Server with replicated index. I would like to run this backup for the 'Master'. the snapshot script is can be used to take backups on commit. 4. In general what happens when the solr application is bounced? Is the index affected (anything maintained in memory)? Regards Guna -- --Noble Paul -- --Noble Paul
Re: Using Threading while Indexing.
: I was trying to index three sets of document having 2000 articles using : three threads of embedded solr server. But while indexing, giving me : exception ?org.apache.lucene.store.LockObtainFailedException: Lock something doesn't sound right here ... i'm not expert on embeding solr, i think perhaps you aren't embedding solr the recommended way .. if you were then there would be only one SolrCore for your index, and only one IndexWriter -- all of your threads would then interate with this one (fully thread safe) SolrCore. It sounds like you are constructing seperate objects (i forget which one it is you construct when embedding) in each thread and winding up with multiple SorlCores all competing for write access to the same index. -Hoss
Re: Question about query sintax
: If I query for 'ferrar*' on my index, I will get 'ferrari' and 'red ferrari' : as a result. And that's fine. But if I try to query for 'red ferrar*', I : have to put it between double quotes as I want to grant that it will be used : as only one term, but the '*' is being ignored, as I don't get any result. : What should be the apropriate query for it? when you add the double quotes you tell solr that the * should now be treated as a literal, and it's no longer a special character. it is possible to have query structures like what you are interested in, but i don't think it's possible to express it using the Lucene syntax. -Hoss
Re: How to get XML response from CommonsHttpSolrServer through QueryResponse?
: Because I used server.setParser(new XMLResponseParser()), I get the : wt=xml parameter in the responseHeader, but the format of the : responseHeader is clearly no XML at all. I expect Solr does output XML, : but that the QueryResponse, when I print its contents, formats this as : the string above. what you are seeing is the results of the toString message on the QueryResponse object generated by the parser -- the XML is gone at this point, the XMLResponseParser processed it to generate that object. that's really the main value add of using the SolrServer/ResponseParser APIs ... if you want the raw XML response there's almost no value in using those classes at all -- just use the HttpClient directly. -Hoss
Re: Issue with dismaxrequestHandler for date fields
: Still search on any field (?q=searchTerm) gives following error : "The request sent by the client was syntactically incorrect (Invalid Date : String:'searchTerm')." because "searchTerm" isn't a valid date string : Is this valid to define *_dt (i.e. date fields ) in solrConfig.xml ? if you really wanted to do a dismax search over some date fields, you could -- but only with date input. but you don't want to do a dismax query over date fields, based on your your original question... : > : > productPublicationDate_product_dt^1.0 : > productPublicationDate_product_dt[NOW-45DAYS TO NOW]^1.0 : > ...it seems that what you really want is to have a query clause matching docs from the last 45 days -- independent of what the searchTerm was. so do that with either an "fq" (filter query) or a bq (boost query) (depending on your goal) http://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an "XY Problem" ... that is: you are dealing with "X", you are assuming "Y" will help you, and you are asking about "Y" without giving more details about the "X" so that we can understand the full issue. Perhaps the best solution doesn't involve "Y" at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: Passing analyzer to the queryparser plugin
: Is there a way to pass the analyzer to the query parser plugin Solr uses a variant of the PerFieldAnalzyer -- you specify in the schema.xml what analyzer you want to use for each field. if you have some sort of *really* exotic situation, you can always design a custom QParser that looks at some query params to do something really interesting (it's the parser that decides how to use the analyzer. if you could explain what it is you are trying to do, we might be able to help you. PS: please don't post your duplicate copies of your questions twice to gene...@lucene ... solr-user is the appropriate place for questions like this. -Hoss
Re: What can be the reason for stopping solr work after some time?
: i'm newbie with solr. We have installed with together with ezfind from : EZ Publish web sites and it is working. But in one of the servers we : have this kind of problem. It works for example for 3 hours, and then in : one moment it stop to work, searching and indexing does not work. it's pretty hard to make any sort of guess as to what your problem might be without more information. is your java process still running? does it responsed to any HTTP requests (ie: do the admin pages work?) what do the logs say? -Hoss
Re: Newbie Design Questions
On Thu, Jan 22, 2009 at 7:02 AM, Gunaranjan Chandraraju wrote: > Thanks > > Yes the source of data is a DB. However the xml is also posted on updates > via publish framework. So I can just plug in an adapter hear to listen for > changes and post to SOLR. I was trying to use the XPathProcessor inside the > SQLEntityProcessor and this did not work (using 1.3 - I did see support in > 1.4). That is not a show stopper for me and I can just post them via the > framework and use files for the first time load. XPathEntityprocessor works inside SqlEntityprocessor only if a db field contains xml. However ,you can have a separate entity (at the root) to read from db for delta. Anyway if your current solution works stick to it. > > Have a seen a couple of answers on the backup for crash scenarios. just > wanted to confirm - if I replace the index with the backup'ed files then I > can simple start the up solr again and reindex the documents changed since > last backup? Am I right? The slaves will also automatically adjust to this. Yes. you can replace an archived index and Solr should work just fine. but the docs added since the last snapshot was taken will be missing (of course :) ) > > THanks > Guna > > > On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള് नोब्ळ् wrote: > >> On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju >> wrote: >>> >>> Hi All >>> We are considering SOLR for a large database of XMLs. I have some newbie >>> questions - if there is a place I can go read about them do let me know >>> and >>> I will go read up :) >>> >>> 1. Currently we are able to pull the XMLs from a file systems using >>> FileDataSource. The DIH is convenient since I can map my XML fields >>> using >>> the XPathProcessor. This works for an initial load.However after the >>> initial load, we would like to 'post' changed xmls to SOLR whenever the >>> XML >>> is updated in a separate system. I know we can post xmls with 'add' >>> however >>> I was not sure how to do this and maintain the DIH mapping I use in >>> data-config.xml? I don't want to save the file to the disk and then call >>> the DIH - would prefer to directly post it. Do I need to use solrj for >>> this? >> >> What is the source of your new data? is it a DB? >> >>> >>> 2. If my solr schema.xml changes then do I HAVE to reindex all the old >>> documents? Suppose in future we have newer XML documents that contain a >>> new >>> additional xml field.The old documents that are already indexed don't >>> have this field and (so) I don't need search on them with this field. >>> However the new ones need to be search-able on this new field.Can I >>> just add this new field to the SOLR schema, restart the servers just post >>> the new new documents or do I need to reindex everything? >>> >>> 3. Can I backup the index directory. So that in case of a disk crash - I >>> can restore this directory and bring solr up. I realize that any >>> documents >>> indexed after this backup would be lost - I can however keep track of >>> these >>> outside and simply re-index documents 'newer' than that backup date. >>> This >>> question is really important to me in the context of using a Master >>> Server >>> with replicated index. I would like to run this backup for the 'Master'. >> >> the snapshot script is can be used to take backups on commit. >>> >>> 4. In general what happens when the solr application is bounced? Is the >>> index affected (anything maintained in memory)? >>> >>> Regards >>> Guna >>> >> >> >> >> -- >> --Noble Paul > > -- --Noble Paul
Re: solr-duplicate post management
: what i need is ,to log the existing urlid and new urlid(of course both will : not be same) ,when a .xml file of same id(unique field) is posted. : : I want to make this by modifying the solr source.Which file do i need to : modify so that i could get the above details in log ? : : I tried with DirectUpdateHandler2.java(which removes the duplicate : entries),but efforts in vein. DirectUpdateHandler2.java (on the trunk) delegates to Lucene-Java's IndexWriter.updateDocument method when you have a uniqueKey and you aren't allowing duplicates -- this method doesn't give you any way to access the old document(s) that had that existing key. The easiest way to make a change like what you are interested in might be an UpdateProcessor that does a lookup/search for the uniqueKey of each document about to be added to see if it already exists. that's probably about as efficient as you can get, and would be nicely encapsulated. You might also want to take a look at SOLR-799, where some work is being done to create UpdateProcessors that can do "near duplicate" detection... http://wiki.apache.org/solr/Deduplication https://issues.apache.org/jira/browse/SOLR-799 -Hoss
Re: Newbie Design Questions
Thanks Yes the source of data is a DB. However the xml is also posted on updates via publish framework. So I can just plug in an adapter hear to listen for changes and post to SOLR. I was trying to use the XPathProcessor inside the SQLEntityProcessor and this did not work (using 1.3 - I did see support in 1.4). That is not a show stopper for me and I can just post them via the framework and use files for the first time load. Have a seen a couple of answers on the backup for crash scenarios. just wanted to confirm - if I replace the index with the backup'ed files then I can simple start the up solr again and reindex the documents changed since last backup? Am I right? The slaves will also automatically adjust to this. THanks Guna On Jan 20, 2009, at 9:37 PM, Noble Paul നോബിള് नोब्ळ् wrote: On Wed, Jan 21, 2009 at 5:15 AM, Gunaranjan Chandraraju wrote: Hi All We are considering SOLR for a large database of XMLs. I have some newbie questions - if there is a place I can go read about them do let me know and I will go read up :) 1. Currently we are able to pull the XMLs from a file systems using FileDataSource. The DIH is convenient since I can map my XML fields using the XPathProcessor. This works for an initial load.However after the initial load, we would like to 'post' changed xmls to SOLR whenever the XML is updated in a separate system. I know we can post xmls with 'add' however I was not sure how to do this and maintain the DIH mapping I use in data-config.xml? I don't want to save the file to the disk and then call the DIH - would prefer to directly post it. Do I need to use solrj for this? What is the source of your new data? is it a DB? 2. If my solr schema.xml changes then do I HAVE to reindex all the old documents? Suppose in future we have newer XML documents that contain a new additional xml field.The old documents that are already indexed don't have this field and (so) I don't need search on them with this field. However the new ones need to be search-able on this new field. Can I just add this new field to the SOLR schema, restart the servers just post the new new documents or do I need to reindex everything? 3. Can I backup the index directory. So that in case of a disk crash - I can restore this directory and bring solr up. I realize that any documents indexed after this backup would be lost - I can however keep track of these outside and simply re-index documents 'newer' than that backup date. This question is really important to me in the context of using a Master Server with replicated index. I would like to run this backup for the 'Master'. the snapshot script is can be used to take backups on commit. 4. In general what happens when the solr application is bounced? Is the index affected (anything maintained in memory)? Regards Guna -- --Noble Paul
Re: Newbie Design Questions
Hi Grant Thanks for the reply. My response below. The data is stored as XMLs. Each record/entity corresponds to an XML. The XML is of the form ... I have currently put it in a schema.xml and DIH handler as follows schema.xml data-import.xml .. and so on I don't need all the fields in the XML indexed or stored. I just include the ones I need in the schema.xml and data-import.xml Architecturally these XMLs are created, updated and stored in a separate system. Currently I am dumping the files in a directory and invoking the DIH. Actually we have a publishing channel that publishes each XML whenever its updated or created. I'd really like to tap into this channel and directly post the xml to SOLR instead of saving it to a file and then invoking DIH. I'd also like to do it leveraging definitions like in the data-config xml so that every time I can just post the original XML and the xpath configuration takes care of extracting the relevant fields. I did take a look at cell in the link below. It seems to be only for 1.4 and currently 1.3 is the stable release. Guna On Jan 20, 2009, at 7:50 PM, Grant Ingersoll wrote: On Jan 20, 2009, at 6:45 PM, Gunaranjan Chandraraju wrote: Hi All We are considering SOLR for a large database of XMLs. I have some newbie questions - if there is a place I can go read about them do let me know and I will go read up :) 1. Currently we are able to pull the XMLs from a file systems using FileDataSource. The DIH is convenient since I can map my XML fields using the XPathProcessor. This works for an initial load. However after the initial load, we would like to 'post' changed xmls to SOLR whenever the XML is updated in a separate system. I know we can post xmls with 'add' however I was not sure how to do this and maintain the DIH mapping I use in data-config.xml? I don't want to save the file to the disk and then call the DIH - would prefer to directly post it. Do I need to use solrj for this? You can likely use SolrJ, but then you probably need to parse the XML an extra time. You may also be able to use Solr Cell, which is the Tika integration such that you can send the XML straight to Solr and have it deal with it. See http://wiki.apache.org/solr/ExtractingRequestHandler Solr Cell is a push technology, whereas DIH is a pull technology. I don't know how compatible this would be w/ DIH. Ideally, in the future, they will cooperate as much as possible, but we are not there yet. As for your initial load, what if you ran a one time XSLT processor over all the files and transformed them to SolrXML and then just posted them the normal way? Then, going forward, any new files could just be written out as SolrXML as well. If you can give some more info about your content, I think it would be helpful. 2. If my solr schema.xml changes then do I HAVE to reindex all the old documents? Suppose in future we have newer XML documents that contain a new additional xml field.The old documents that are already indexed don't have this field and (so) I don't need search on them with this field. However the new ones need to be search- able on this new field.Can I just add this new field to the SOLR schema, restart the servers just post the new new documents or do I need to reindex everything? Yes, you should be able to add new fields w/o problems. Where you can run into problems is renaming, removing, etc. 3. Can I backup the index directory. So that in case of a disk crash - I can restore this directory and bring solr up. I realize that any documents indexed after this backup would be lost - I can however keep track of these outside and simply re-index documents 'newer' than that backup date. This question is really important to me in the context of using a Master Server with replicated index. I would like to run this backup for the 'Master'. Yes, just use the master/slave replication approach for doing backups. 4. In general what happens when the solr application is bounced? Is the index affected (anything maintained in memory)? I would recommend doing a commit before bouncing and letting all indexing operations complete. Worst case, assuming you are using Solr 1.3 or later, is that you may lose what is in memory. -Grant -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: numFound problem
Ron Chan wrote: I'm using out of the box Solr 1.3 that I had just downloaded, so I guess it is the StandardAnalyzer It seems WordDelimiterFilter worked for you. Go to Admin console, click analysis, then give: Field name: text Field value (Index): SD/DDeck verbose output: checked highlight matched: checked Field value (Query): SD DDeck verbose output: checked then click analyze. regards, Koji but shouldn't the returned docs equal numFound? - Original Message - From: "Erick Erickson" To: solr-user@lucene.apache.org Sent: Wednesday, 21 January, 2009 20:49:56 GMT +00:00 GMT Britain, Ireland, Portugal Subject: Re: numFound problem It depends (tm). What analyzer are you using when indexing? I'd expect (though I haven't checked) that StandardAnalyzer would break SD/DDeck into two tokens, SD and DDeck which corresponds nicely with what you're reporting. Other analyzers and/or filters are easy to specify I'd recommend getting a copy of Luke and examining your index to see what's actually in it Best Erick On Wed, Jan 21, 2009 at 3:43 PM, Ron Chan wrote: I have a test search which I know should return 34 docs and it does however, numFound says 40 with debug enabled, I can see the 40 it has found my search looks for "SD DDeck" in the description 34 of them had "SD DDeck" with 6 of them having "SD/DDeck" now, I can probably work round it if had returned me the 40 docs but the problem is it returns 34 docs but gives me a numFound of 40 is this expected behavior?
RE: Performance "dead-zone" due to garbage collection
A ballpark calculation would be Collected Amount (From GC logging)/ # of Requests. The GC logging can tell you how much it collected each time, no need to try and snapshot before and after heap sizes. However (big caveat here), this is a ballpark figure. The garbage collector is not guaranteed to collect everything, every time. It can stop collecting depending on how much time it spent. It may only collect from certain sections within memory (Eden, survivor, tenured), etc. This may still be enough to make broad comparisons to see if you've decreased the overall garbage/request (via cache changes), but it will be quite a rough estimate. -Todd -Original Message- From: wojtekpia [mailto:wojte...@hotmail.com] Sent: Wednesday, January 21, 2009 3:08 PM To: solr-user@lucene.apache.org Subject: Re: Performance "dead-zone" due to garbage collection (Thanks for the responses) My filterCache hit rate is ~60% (so I'll try making it bigger), and I am CPU bound. How do I measure the size of my per-request garbage? Is it (total heap size before collection - total heap size after collection) / # of requests to cause a collection? I'll try your suggestions and post back any useful results. -- View this message in context: http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collect ion-tp21588427p21593661.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performance "dead-zone" due to garbage collection
(Thanks for the responses) My filterCache hit rate is ~60% (so I'll try making it bigger), and I am CPU bound. How do I measure the size of my per-request garbage? Is it (total heap size before collection - total heap size after collection) / # of requests to cause a collection? I'll try your suggestions and post back any useful results. -- View this message in context: http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-tp21588427p21593661.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: XMLResponsWriter or PHPResponseWriter, who is faster?
After some test with System.currentTimeMillis I have seen that the diference is almos unapreciable ... but phpresponse was a little bit faster... Marc Sturlese wrote: > > Hey there, I am using Solr as backEnd and I don't mind whou to get back > the results. How is faster for Solr to create the response, using > XMLResponseWriter or PHPResponseWriter?? > For my front end is faster to process the response created by > PHPResponseWriter but I would not like to improve speed parsing the > response but loose it in the creation. > Thanks in advanced > > -- View this message in context: http://www.nabble.com/XMLResponsWriter-or-PHPResponseWriter%2C-who-is-faster--tp21582204p21593352.html Sent from the Solr - User mailing list archive at Nabble.com.
Suppressing logging for /admin/ping requests
Is there anyway to suppress the logging of the /admin/ping requests? We have HAProxy configured to do health checks to this URI every couple of seconds and it is really cluttering our logs. I'd still like to see the logging from the other requestHandlers. Thanks! Todd
Re: numFound problem
Oops, missed that. I'll have to defer to folks with more SOLR experience than I have, I've pretty much worked in Lucene. Best Erick On Wed, Jan 21, 2009 at 3:57 PM, Ron Chan wrote: > I'm using out of the box Solr 1.3 that I had just downloaded, so I guess it > is the StandardAnalyzer > > but shouldn't the returned docs equal numFound? > > > - Original Message - > From: "Erick Erickson" > To: solr-user@lucene.apache.org > Sent: Wednesday, 21 January, 2009 20:49:56 GMT +00:00 GMT Britain, Ireland, > Portugal > Subject: Re: numFound problem > > It depends (tm). What analyzer are you using when indexing? > > I'd expect (though I haven't checked) that StandardAnalyzer > would break SD/DDeck into two tokens, SD and DDeck which > corresponds nicely with what you're reporting. > > Other analyzers and/or filters are easy to specify > > I'd recommend getting a copy of Luke and examining your > index to see what's actually in it > > Best > Erick > > On Wed, Jan 21, 2009 at 3:43 PM, Ron Chan wrote: > > > I have a test search which I know should return 34 docs and it does > > > > however, numFound says 40 > > > > with debug enabled, I can see the 40 it has found > > > > my search looks for "SD DDeck" in the description > > > > 34 of them had "SD DDeck" with 6 of them having "SD/DDeck" > > > > now, I can probably work round it if had returned me the 40 docs but the > > problem is it returns 34 docs but gives me a numFound of 40 > > > > is this expected behavior? > > > > > > >
Re: numFound problem
I'm using out of the box Solr 1.3 that I had just downloaded, so I guess it is the StandardAnalyzer but shouldn't the returned docs equal numFound? - Original Message - From: "Erick Erickson" To: solr-user@lucene.apache.org Sent: Wednesday, 21 January, 2009 20:49:56 GMT +00:00 GMT Britain, Ireland, Portugal Subject: Re: numFound problem It depends (tm). What analyzer are you using when indexing? I'd expect (though I haven't checked) that StandardAnalyzer would break SD/DDeck into two tokens, SD and DDeck which corresponds nicely with what you're reporting. Other analyzers and/or filters are easy to specify I'd recommend getting a copy of Luke and examining your index to see what's actually in it Best Erick On Wed, Jan 21, 2009 at 3:43 PM, Ron Chan wrote: > I have a test search which I know should return 34 docs and it does > > however, numFound says 40 > > with debug enabled, I can see the 40 it has found > > my search looks for "SD DDeck" in the description > > 34 of them had "SD DDeck" with 6 of them having "SD/DDeck" > > now, I can probably work round it if had returned me the 40 docs but the > problem is it returns 34 docs but gives me a numFound of 40 > > is this expected behavior? > > >
Re: numFound problem
It depends (tm). What analyzer are you using when indexing? I'd expect (though I haven't checked) that StandardAnalyzer would break SD/DDeck into two tokens, SD and DDeck which corresponds nicely with what you're reporting. Other analyzers and/or filters are easy to specify I'd recommend getting a copy of Luke and examining your index to see what's actually in it Best Erick On Wed, Jan 21, 2009 at 3:43 PM, Ron Chan wrote: > I have a test search which I know should return 34 docs and it does > > however, numFound says 40 > > with debug enabled, I can see the 40 it has found > > my search looks for "SD DDeck" in the description > > 34 of them had "SD DDeck" with 6 of them having "SD/DDeck" > > now, I can probably work round it if had returned me the 40 docs but the > problem is it returns 34 docs but gives me a numFound of 40 > > is this expected behavior? > > >
Re: storing complex types in a multiValued field
: > I guess most people store it as a simple string "key(separator)value". Is or use dynamic fields to putthe "key" into the field name... : > > > multiValued="true" /> ...could be... ...then index value if you omitNorms the overhead of having many fields should be low - allthough i'm not 100% certain how it compares with having a single field and encoding the key/value in the field value. -Hoss
numFound problem
I have a test search which I know should return 34 docs and it does however, numFound says 40 with debug enabled, I can see the 40 it has found my search looks for "SD DDeck" in the description 34 of them had "SD DDeck" with 6 of them having "SD/DDeck" now, I can probably work round it if had returned me the 40 docs but the problem is it returns 34 docs but gives me a numFound of 40 is this expected behavior?
Re: Performance "dead-zone" due to garbage collection
Have you tried different sizes for the nursery? It should be several times larger than the per-request garbage. Also, check your cache sizes. Objects evicted from the cache are almost always tenured, so those will add to the time needed for a full GC. Guess who was tuning GC for a week or two in December ... wunder On 1/21/09 12:15 PM, "Feak, Todd" wrote: > From a high level view, there is a certain amount of garbage collection > that must occur. That garbage is generated per request, through a > variety of means (buffers, request, response, cache expulsion). The only > thing that JVM parameters can address is *when* that collection occurs. > > It can occur often in small chunks, or rarely in large chunks (or > anywhere in between). If you are CPU bound (which it sounds like you may > be), then you really have a decision to make. Do you want an overall > drop in performance, as more time is spent garbage collecting, OR do you > want spikes in garbage collection that are more rare, but have a > stronger impact. Realistically it becomes a question of one or the > other. You *must* pay the cost of garbage collection at some point in > time. > > It is possible that increasing cache size will decrease overall garbage > collection, as the churn caused by caused by cache misses creates > additional garbage. Decreasing the churn could decrease garbage. BUT, > this really depends on your cache hit rates. If they are pretty high > (>90%) then it's probably not much of a factor. However, if you are in > the 50%-60% range, larger caches may help you in a number of ways. > > -Todd Feak > > -Original Message- > From: wojtekpia [mailto:wojte...@hotmail.com] > Sent: Wednesday, January 21, 2009 11:14 AM > To: solr-user@lucene.apache.org > Subject: Re: Performance "dead-zone" due to garbage collection > > > I'm using a recent version of Sun's JVM (6 update 7) and am using the > concurrent generational collector. I've tried several other collectors, > none > seemed to help the situation. > > I've tried reducing my heap allocation. The search performance got worse > as > I reduced the heap. I didn't monitor the garbage collector in those > tests, > but I imagine that it would've gotten better. (As a side note, I do lots > of > faceting and sorting, I have 10M records in this index, with an > approximate > index file size of 10GB). > > This index is on a single machine, in a single Solr core. Would > splitting it > across multiple Solr cores on a single machine help? I'd like to find > the > limit of this machine before spreading the data to more machines. > > Thanks, > > Wojtek
RE: Performance "dead-zone" due to garbage collection
>From a high level view, there is a certain amount of garbage collection that must occur. That garbage is generated per request, through a variety of means (buffers, request, response, cache expulsion). The only thing that JVM parameters can address is *when* that collection occurs. It can occur often in small chunks, or rarely in large chunks (or anywhere in between). If you are CPU bound (which it sounds like you may be), then you really have a decision to make. Do you want an overall drop in performance, as more time is spent garbage collecting, OR do you want spikes in garbage collection that are more rare, but have a stronger impact. Realistically it becomes a question of one or the other. You *must* pay the cost of garbage collection at some point in time. It is possible that increasing cache size will decrease overall garbage collection, as the churn caused by caused by cache misses creates additional garbage. Decreasing the churn could decrease garbage. BUT, this really depends on your cache hit rates. If they are pretty high (>90%) then it's probably not much of a factor. However, if you are in the 50%-60% range, larger caches may help you in a number of ways. -Todd Feak -Original Message- From: wojtekpia [mailto:wojte...@hotmail.com] Sent: Wednesday, January 21, 2009 11:14 AM To: solr-user@lucene.apache.org Subject: Re: Performance "dead-zone" due to garbage collection I'm using a recent version of Sun's JVM (6 update 7) and am using the concurrent generational collector. I've tried several other collectors, none seemed to help the situation. I've tried reducing my heap allocation. The search performance got worse as I reduced the heap. I didn't monitor the garbage collector in those tests, but I imagine that it would've gotten better. (As a side note, I do lots of faceting and sorting, I have 10M records in this index, with an approximate index file size of 10GB). This index is on a single machine, in a single Solr core. Would splitting it across multiple Solr cores on a single machine help? I'd like to find the limit of this machine before spreading the data to more machines. Thanks, Wojtek -- View this message in context: http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collect ion-tp21588427p21590150.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with WT parameter when upgrading from Solr1.2 to solr1.3
: Right, that's probably the crux of it - distributed search required : some extensions to response writers... things like handling : SolrDocument and SolrDocumentList. Grrr... that's right, i forgot that there wasn't any way to make SolrDocumentList implement DocList ... and i don't think this caveat got documented anywhere. I'm going to poke arround and see if i can find some good places to point this out. -Hoss
RE: Performance "dead-zone" due to garbage collection
The large drop in old generation from 27GB->6GB indicates that things are getting into your old generation prematurely. They really don't need to get there at all, and should be collected sooner (more frequently). Look into increasing young generation sizes via JVM parameters. Also look into concurrent collection. You could even consider decreasing your JVM max memory. Obviously you aren't using it all, decreasing it will force the JVM to do more frequent (and therefore smaller) collections. You're average collection time may go up, but you will get smaller performance decreases. Great details on memory tuning on Sun JDKs here http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html There are other articles for 1.6 and 1.4 as well. -Todd -Original Message- From: wojtekpia [mailto:wojte...@hotmail.com] Sent: Wednesday, January 21, 2009 9:49 AM To: solr-user@lucene.apache.org Subject: Performance "dead-zone" due to garbage collection I'm intermittently experiencing severe performance drops due to Java garbage collection. I'm allocating a lot of RAM to my Java process (27GB of the 32GB physically available). Under heavy load, the performance drops approximately every 10 minutes, and the drop lasts for 30-40 seconds. This coincides with the size of the old generation heap dropping from ~27GB to ~6GB. Is there a way to reduce the impact of garbage collection? A couple ideas we've come up with (but haven't tried yet) are: increasing the minimum heap size, more frequent (but hopefully less costly) garbage collection. Thanks, Wojtek -- View this message in context: http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collect ion-tp21588427p21588427.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Querying back with top few results in the same XMLWriter!
: I am using a ranking algorithm by modifying the XMLWriter to use a : formulation which takes the top 3 results and query with the 3 results and : now presents the result with as function of the results from these 3 : queries. Can anyone reply if I can take the top 3results and query with them : in the same reponsewriter? : Or is there any functionality provided by solr in either 1.2 or 1.3 : version. I'm not sure why you would do this in XMLWriter -- this is the type of logic that would make more sense in a RequestHandler or SearchComponent. in fact: this is very similar to what the MoreLikeThisComponent does -- except it sounds like you want to create a single query based on multiple documents (MLT creates a seperate query for each document) take a look at the SearchComponent API -- and use MLT as an example and i think you'll see a relatively easy way to accomplish your goal. -Hoss
Re: Performance "dead-zone" due to garbage collection
I would say that putting more Solr instances, each one with your own data directory could help if you can qualify your docs, in such a way that you can put "A" type docs in index "A", "B" type docs in index "B", and so on. 2009/1/21 wojtekpia > > I'm using a recent version of Sun's JVM (6 update 7) and am using the > concurrent generational collector. I've tried several other collectors, > none > seemed to help the situation. > > I've tried reducing my heap allocation. The search performance got worse as > I reduced the heap. I didn't monitor the garbage collector in those tests, > but I imagine that it would've gotten better. (As a side note, I do lots of > faceting and sorting, I have 10M records in this index, with an approximate > index file size of 10GB). > > This index is on a single machine, in a single Solr core. Would splitting > it > across multiple Solr cores on a single machine help? I'd like to find the > limit of this machine before spreading the data to more machines. > > Thanks, > > Wojtek > -- > View this message in context: > http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-tp21588427p21590150.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Alexander Ramos Jardim
Re: Sizing a Linux box for Solr?
One other useful piece of information would be how big you expect your indexes to be. Which you should be able to estimate quite easily by indexing, say, 20,000 documents from the relevant databases. Of particular interest will be the delta between the size of the index at, say, 10,000 documents and 20,000, since size is related to the number of unique terms per field and once you get past a certain number of terms, virtually every new term will already be in your index. Also, I think that the relevant metric is what the size is for *unstored* data since storing the fields isn't particularly relevant to search response time (although it can *certainly* be relevant to *total* time if you assemble a lot of stored fields to return). * *If your new to Lucene, the difference between stored and indexed is a bit confusing, so if the above is gibberish, you'd be well served by understanding the distinction before you go too far . Best Erick On Wed, Jan 21, 2009 at 1:04 PM, Thomas Dowling wrote: > On 01/21/2009 12:25 PM, Matthew Runo wrote: > > At a certain level it will become better to have multiple smaller boxes > > rather than one huge one. I've found that even an old P4 with 2 gigs of > > ram has decent response time on our 150,000 item index with only a few > > users - but it quickly goes downhill if we get more than 5 or 6. How > > many documents are you going to be storing in your index? How much of > > them will be "stored" versus "indexed"? Will you be faceting on the > > results? > > Thanks for the tip on multiple boxes. We'll be hosting about 20 > databases total. A couple of them are in the 10- to 20-million record > range and a couple more are in the 5- to 10-million range. It's highly > structured data and I anticipate a lot of faceting and indexing almost > all the fields. > > > > > In general, I'd recommend a 64 bit processor with enough ram to store > > your index in ram - but that might not be possible with "millions" of > > records. Our 150,000 item index is about a gig and a half when optimized > > but yours will likely be different depending on how much you store. > > Faceting takes more memory than pure searching as well. > > > > This is very helpful. Thanks again. > > > -- > Thomas Dowling >
Re: Performance "dead-zone" due to garbage collection
I'm using a recent version of Sun's JVM (6 update 7) and am using the concurrent generational collector. I've tried several other collectors, none seemed to help the situation. I've tried reducing my heap allocation. The search performance got worse as I reduced the heap. I didn't monitor the garbage collector in those tests, but I imagine that it would've gotten better. (As a side note, I do lots of faceting and sorting, I have 10M records in this index, with an approximate index file size of 10GB). This index is on a single machine, in a single Solr core. Would splitting it across multiple Solr cores on a single machine help? I'd like to find the limit of this machine before spreading the data to more machines. Thanks, Wojtek -- View this message in context: http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-tp21588427p21590150.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.
>Hi Fergus, > >It seems a field it is expecting is missing from the XML. You mean there is some field in the document we are indexing that is missing? > >sourceColName="*fileAbsePath*"/> > >I guess "fileAbsePath" is a typo? Can you check if that is the cause? Well spotted. I had made a mess of sanitizing the config file I sent to you. I will in future make sure the stuff I am messing with matches what I send to the list. However there is no typo in the underlying file; at least not on that line:-) > > >On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie wrote: > >> Shalin >> >> Downloaded nightly for 21jan and tried DIH again. Its better but >> still broken. Dozens of embeded tags are stripped from documents >> but it now fails every few documents for no reason I can see. Manually >> removing embeded tags causes a given problem document to be indexed, >> only to have a it fail on one of the next few documents. I think the >> problem is still in stripHTML >> >> Here is the traceback. >> >> Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start >> INFO: Server startup in 3377 ms >> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter >> readIndexerProperties >> INFO: Read dataimport.properties >> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute >> INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} >> status=0 QTime=13 >> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter >> doFullImport >> INFO: Starting Full Import >> Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 >> deleteAll >> INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX >> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit >> INFO: SolrDeletionPolicy.onInit: commits:num=2 >> >> >> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1] >> >> >> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2] >> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy >> updateCommits >> INFO: last commit = 1232539612131 >> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder >> buildDocument >> SEVERE: Exception while processing: jc document : null >> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing >> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 >> Processing Document # 9 >>at >> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) >>at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) >>at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) >>at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) >>at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) >>at >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) >>at >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) >>at >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) >>at >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) >>at >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) >> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException >>at >> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) >>at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) >>... 9 more >> Caused by: java.util.NoSuchElementException >>at >> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) >>at >> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) >>at >> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) >>at >> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) >>at >> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) >>... 10 more >> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter >> doFullImport >> SEVERE: Full Import failed >> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing >> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 >> Processing Document # 9 >>at >> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) >>
Re: Performance "dead-zone" due to garbage collection
How many boxes running your index? If it is just one, maybe distributing your index will get you a better performance during garbage collection. 2009/1/21 wojtekpia > > I'm intermittently experiencing severe performance drops due to Java > garbage > collection. I'm allocating a lot of RAM to my Java process (27GB of the > 32GB > physically available). Under heavy load, the performance drops > approximately > every 10 minutes, and the drop lasts for 30-40 seconds. This coincides with > the size of the old generation heap dropping from ~27GB to ~6GB. > > Is there a way to reduce the impact of garbage collection? A couple ideas > we've come up with (but haven't tried yet) are: increasing the minimum heap > size, more frequent (but hopefully less costly) garbage collection. > > Thanks, > > Wojtek > > -- > View this message in context: > http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-tp21588427p21588427.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Alexander Ramos Jardim
Re: Sizing a Linux box for Solr?
Definitely you will want to have more than one box for your index. You can take a look at distributed search and multicore ate the wiki. 2009/1/21 Thomas Dowling > On 01/21/2009 12:25 PM, Matthew Runo wrote: > > At a certain level it will become better to have multiple smaller boxes > > rather than one huge one. I've found that even an old P4 with 2 gigs of > > ram has decent response time on our 150,000 item index with only a few > > users - but it quickly goes downhill if we get more than 5 or 6. How > > many documents are you going to be storing in your index? How much of > > them will be "stored" versus "indexed"? Will you be faceting on the > > results? > > Thanks for the tip on multiple boxes. We'll be hosting about 20 > databases total. A couple of them are in the 10- to 20-million record > range and a couple more are in the 5- to 10-million range. It's highly > structured data and I anticipate a lot of faceting and indexing almost > all the fields. > > > > > In general, I'd recommend a 64 bit processor with enough ram to store > > your index in ram - but that might not be possible with "millions" of > > records. Our 150,000 item index is about a gig and a half when optimized > > but yours will likely be different depending on how much you store. > > Faceting takes more memory than pure searching as well. > > > > This is very helpful. Thanks again. > > > -- > Thomas Dowling > -- Alexander Ramos Jardim
Incorrect Scoring
Can someone please make sense of why the following occurs in our system. The first item barely matches but scores higher than the second one that matches all over the place. The second one is a MUCH better match but has a worse score. These are in the same query results. All I can see are the norms but don¹t know how to fix that. Parsed Query Info +((DisjunctionMaxQuery((realBrandName:brown | subCategory:brown^20.0 | productDescription:brown | width:brown | personality:brown^10.0 | brandName:brown | productType:brown^8.0 | productId:brown^10.0 | size:brown^1.2 | category:brown^10.0 | price:brown | productNameSearch:brown | heelHeight:brown | color:brown^10.0 | attrs:brown^5.0 | expandedGender:brown^0.5)~0.01) DisjunctionMaxQuery((realBrandName:shoe | subCategory:shoe^20.0 | productDescription:shoe | width:shoes | personality:shoe^10.0 | brandName:shoe | productType:shoe^8.0 | productId:shoes^10.0 | size:shoes^1.2 | category:shoe^10.0 | price:shoes | productNameSearch:shoe | heelHeight:shoes | color:shoe^10.0 | attrs:shoe^5.0 | expandedGender:shoes^0.5)~0.01))~2) DisjunctionMaxQuery((realBrandName:"brown shoe"~1^10.0 | category:"brown shoe"~1^10.0 | productNameSearch:"brown shoe"~1 | productDescription:"brown shoe"~1^2.0 | subCategory:"brown shoe"~1^20.0 | personality:"brown shoe"~1^2.0 | brandName:"brown shoe"~1^10.0 | productType:"brown shoe"~1^8.0)~0.01) +(((realBrandName:brown | subCategory:brown^20.0 | productDescription:brown | width:brown | personality:brown^10.0 | brandName:brown | productType:brown^8.0 | productId:brown^10.0 | size:brown^1.2 | category:brown^10.0 | price:brown | productNameSearch:brown | heelHeight:brown | color:brown^10.0 | attrs:brown^5.0 | expandedGender:brown^0.5)~0.01 (realBrandName:shoe | subCategory:shoe^20.0 | productDescription:shoe | width:shoes | personality:shoe^10.0 | brandName:shoe | productType:shoe^8.0 | productId:shoes^10.0 | size:shoes^1.2 | category:shoe^10.0 | price:shoes | productNameSearch:shoe | heelHeight:shoes | color:shoe^10.0 | attrs:shoe^5.0 | expandedGender:shoes^0.5)~0.01)~2) (realBrandName:"brown shoe"~1^10.0 | category:"brown shoe"~1^10.0 | productNameSearch:"brown shoe"~1 | productDescription:"brown shoe"~1^2.0 | subCategory:"brown shoe"~1^20.0 | personality:"brown shoe"~1^2.0 | brandName:"brown shoe"~1^10.0 | productType:"brown shoe"~1^8.0)~0.01 DebugQuery Info 0.45851633 = (MATCH) sum of: 0.45851633 = (MATCH) sum of: 0.19769925 = (MATCH) max plus 0.01 times others of: 0.19769925 = (MATCH) weight(color:brown^10.0 in 1407), product of: 0.06819186 = queryWeight(color:brown^10.0), product of: 10.0 = boost 2.8991618 = idf(docFreq=19348, numDocs=129257) 0.0023521234 = queryNorm 2.8991618 = (MATCH) fieldWeight(color:brown in 1407), product of: 1.0 = tf(termFreq(color:brown)=1) 2.8991618 = idf(docFreq=19348, numDocs=129257) 1.0 = fieldNorm(field=color, doc=1407) 0.26081708 = (MATCH) max plus 0.01 times others of: 0.26081708 = (MATCH) weight(subCategory:shoe^20.0 in 1407), product of: 0.14011127 = queryWeight(subCategory:shoe^20.0), product of: 20.0 = boost 2.9783995 = idf(docFreq=17874, numDocs=129257) 0.0023521234 = queryNorm 1.8614997 = (MATCH) fieldWeight(subCategory:shoe in 1407), product of: 1.0 = tf(termFreq(subCategory:shoe)=1) 2.9783995 = idf(docFreq=17874, numDocs=129257) 0.625 = fieldNorm(field=subCategory, doc=1407) 0.4086538 = (MATCH) sum of: 0.4086538 = (MATCH) sum of: 0.19769925 = (MATCH) max plus 0.01 times others of: 0.19769925 = (MATCH) weight(color:brown^10.0 in 75829), product of: 0.06819186 = queryWeight(color:brown^10.0), product of: 10.0 = boost 2.8991618 = idf(docFreq=19348, numDocs=129257) 0.0023521234 = queryNorm 2.8991618 = (MATCH) fieldWeight(color:brown in 75829), product of: 1.0 = tf(termFreq(color:brown)=1) 2.8991618 = idf(docFreq=19348, numDocs=129257) 1.0 = fieldNorm(field=color, doc=75829) 0.21095455 = (MATCH) max plus 0.01 times others of: 0.20865366 = (MATCH) weight(subCategory:shoe^20.0 in 75829), product of: 0.14011127 = queryWeight(subCategory:shoe^20.0), product of: 20.0 = boost 2.9783995 = idf(docFreq=17874, numDocs=129257) 0.0023521234 = queryNorm 1.4891998 = (MATCH) fieldWeight(subCategory:shoe in 75829), product of: 1.0 = tf(termFreq(subCategory:shoe)=1) 2.9783995 = idf(docFreq=17874, numDocs=129257) 0.5 = fieldNorm(field=subCategory, doc=75829) 0.028179625 = (MATCH) weight(productType:shoe^8.0 in 75829), product of: 0.029127462 = queryWeight(productType:shoe^8.0), product of: 8.0 = boost 1.5479344 = idf(docFreq=74728, numDocs=129257) 0.0023521234 = queryNorm 0.967459 = (MATCH) fieldWeight(productType:sho
Re: Performance "dead-zone" due to garbage collection
What JVM and garbage collector setting? We are using the IBM JVM with their concurrent generational collector. I would strongly recommend trying a similar collector on your JVM. Hint: how much memory is in use after a full GC? That is a good approximation to the working set. 27GB is a very, very large heap. Is that really being used or is it just filling up with garbage which makes the collections really long? We run with a 4GB heap and really only need that to handle indexing or starting new searchers. Searching only needs a 2GB heap for us. Our full GC pauses for under a half second. Way longer than I'd like, but that's Java (I still miss Python sometimes). wunder On 1/21/09 9:49 AM, "wojtekpia" wrote: > > I'm intermittently experiencing severe performance drops due to Java garbage > collection. I'm allocating a lot of RAM to my Java process (27GB of the 32GB > physically available). Under heavy load, the performance drops approximately > every 10 minutes, and the drop lasts for 30-40 seconds. This coincides with > the size of the old generation heap dropping from ~27GB to ~6GB. > > Is there a way to reduce the impact of garbage collection? A couple ideas > we've come up with (but haven't tried yet) are: increasing the minimum heap > size, more frequent (but hopefully less costly) garbage collection. > > Thanks, > > Wojtek
Re: Word Delimiter struggles
On Mon, Jan 19, 2009 at 9:42 PM, David Shettler wrote: > Thank you Shalin, I'm in the process of implementing your suggestion, > and it works marvelously. Had to upgrade to solr 1.3, and had to hack > up acts_as_solr to function correctly. > > Is there a way to receive a search for a given field, and have solr > know to automatically check the two fields? I suppose not. If you use DisMax (defType=dismax) instead of the standard handler, the qf parameter can be used to specify all the fields you want to search for the given query. http://wiki.apache.org/solr/DisMaxRequestHandler -- Regards, Shalin Shekhar Mangar.
Re: Query Performance while updating teh index
What exactly does Solr do when it receives a new Index? How does it keep serving while performing the updates? It seems that the part that causes the slowdown is this transition. Otis Gospodnetic wrote: > > This is an old and long thread, and I no longer recall what the specific > suggestions were. > My guess is this has to do with the OS cache of your index files. When > you make the large index update, that OS cache is useless (old files are > gone, new ones are in) and the OS cache has get re-warmed and this takes > time. > > Are you optimizing your index before the update? Do you *really* need to > do that? > How large is your update, what makes it big, and could you make it > smaller? > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message >> From: oleg_gnatovskiy >> To: solr-user@lucene.apache.org >> Sent: Tuesday, January 20, 2009 6:19:46 PM >> Subject: Re: Query Performance while updating teh index >> >> >> Hello again. It seems that we are still having these problems. Queries >> take >> as long as 20 minutes to get back to their average response time after a >> large index update, so it doesn't seem like the problem is the 12 second >> autowarm time. Are there any more suggestions for things we can try? >> Taking >> our servers out of teh loop for as long as 20 minutes is a bit of a >> hassle, >> and a risk. >> -- >> View this message in context: >> http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21573927.html >> Sent from the Solr - User mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p21588779.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: problem with DIH and MySQL
I guess Noble meant the Solr log. On Tue, Jan 20, 2009 at 9:29 PM, Nick Friedrich < nick.friedr...@student.uni-magdeburg.de> wrote: > no, there are no exceptions > but I have to admit, that I'm not sure what you mean with console > > > Zitat von Noble Paul ??? ?? : > > it got rolled back >> any exceptions on solr console? >> >> >> -- >> --Noble Paul >> >> > > > -- Regards, Shalin Shekhar Mangar.
Re: Sizing a Linux box for Solr?
On 01/21/2009 12:25 PM, Matthew Runo wrote: > At a certain level it will become better to have multiple smaller boxes > rather than one huge one. I've found that even an old P4 with 2 gigs of > ram has decent response time on our 150,000 item index with only a few > users - but it quickly goes downhill if we get more than 5 or 6. How > many documents are you going to be storing in your index? How much of > them will be "stored" versus "indexed"? Will you be faceting on the > results? Thanks for the tip on multiple boxes. We'll be hosting about 20 databases total. A couple of them are in the 10- to 20-million record range and a couple more are in the 5- to 10-million range. It's highly structured data and I anticipate a lot of faceting and indexing almost all the fields. > > In general, I'd recommend a 64 bit processor with enough ram to store > your index in ram - but that might not be possible with "millions" of > records. Our 150,000 item index is about a gig and a half when optimized > but yours will likely be different depending on how much you store. > Faceting takes more memory than pure searching as well. > This is very helpful. Thanks again. -- Thomas Dowling
Re: Performance Hit for Zero Record Dataimport
Created SOLR 974: https://issues.apache.org/jira/browse/SOLR-974 -- View this message in context: http://www.nabble.com/Performance-Hit-for-Zero-Record-Dataimport-tp21572935p21588634.html Sent from the Solr - User mailing list archive at Nabble.com.
Performance "dead-zone" due to garbage collection
I'm intermittently experiencing severe performance drops due to Java garbage collection. I'm allocating a lot of RAM to my Java process (27GB of the 32GB physically available). Under heavy load, the performance drops approximately every 10 minutes, and the drop lasts for 30-40 seconds. This coincides with the size of the old generation heap dropping from ~27GB to ~6GB. Is there a way to reduce the impact of garbage collection? A couple ideas we've come up with (but haven't tried yet) are: increasing the minimum heap size, more frequent (but hopefully less costly) garbage collection. Thanks, Wojtek -- View this message in context: http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collection-tp21588427p21588427.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.
Hi Fergus, It seems a field it is expecting is missing from the XML. I guess "fileAbsePath" is a typo? Can you check if that is the cause? On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie wrote: > Shalin > > Downloaded nightly for 21jan and tried DIH again. Its better but > still broken. Dozens of embeded tags are stripped from documents > but it now fails every few documents for no reason I can see. Manually > removing embeded tags causes a given problem document to be indexed, > only to have a it fail on one of the next few documents. I think the > problem is still in stripHTML > > Here is the traceback. > > Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start > INFO: Server startup in 3377 ms > Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter > readIndexerProperties > INFO: Read dataimport.properties > Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute > INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} > status=0 QTime=13 > Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter > doFullImport > INFO: Starting Full Import > Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 > deleteAll > INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX > Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit > INFO: SolrDeletionPolicy.onInit: commits:num=2 > > > commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1] > > > commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2] > Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy > updateCommits > INFO: last commit = 1232539612131 > Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder > buildDocument > SEVERE: Exception while processing: jc document : null > org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing > failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 > Processing Document # 9 >at > org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) >at > org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) >at > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) >at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) >at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) >at > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) >at > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) >at > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) >at > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) >at > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) > Caused by: java.lang.RuntimeException: java.util.NoSuchElementException >at > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) >at > org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) >... 9 more > Caused by: java.util.NoSuchElementException >at > com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) >at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) >at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) >at > org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) >at > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) >... 10 more > Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter > doFullImport > SEVERE: Full Import failed > org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing > failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 > Processing Document # 9 >at > org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) >at > org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) >at > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) >at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) >at > org.apache.solr.h
Re: DIH XPathEntityProcessor fails with docs containing
On Wed, Jan 21, 2009 at 6:05 PM, Fergus McMenemie wrote: > > After looking looking at http://issues.apache.org/jira/browse/SOLR-964, > where > it seems this issue has been addressed, I had another go at indexing > documents > containing DOCTYPE. It failed as follows. > > That patch has not been committed to the trunk yet. I'll take it up. -- Regards, Shalin Shekhar Mangar.
Re: Performance Hit for Zero Record Dataimport
Yes please. Even though the fix is small, it is important enough to be mentioned in the release notes. On Wed, Jan 21, 2009 at 11:05 PM, wojtekpia wrote: > > Thanks Shalin, a short circuit would definitely solve it. Should I open a > JIRA issue? > > > Shalin Shekhar Mangar wrote: > > > > I guess Data Import Handler still calls commit even if there were no > > documents created. We can add a short circuit in the code to make sure > > that > > does not happen. > > > > -- > View this message in context: > http://www.nabble.com/Performance-Hit-for-Zero-Record-Dataimport-tp21572935p21588124.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Regards, Shalin Shekhar Mangar.
Re: Performance Hit for Zero Record Dataimport
Thanks Shalin, a short circuit would definitely solve it. Should I open a JIRA issue? Shalin Shekhar Mangar wrote: > > I guess Data Import Handler still calls commit even if there were no > documents created. We can add a short circuit in the code to make sure > that > does not happen. > -- View this message in context: http://www.nabble.com/Performance-Hit-for-Zero-Record-Dataimport-tp21572935p21588124.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Sizing a Linux box for Solr?
At a certain level it will become better to have multiple smaller boxes rather than one huge one. I've found that even an old P4 with 2 gigs of ram has decent response time on our 150,000 item index with only a few users - but it quickly goes downhill if we get more than 5 or 6. How many documents are you going to be storing in your index? How much of them will be "stored" versus "indexed"? Will you be faceting on the results? In general, I'd recommend a 64 bit processor with enough ram to store your index in ram - but that might not be possible with "millions" of records. Our 150,000 item index is about a gig and a half when optimized but yours will likely be different depending on how much you store. Faceting takes more memory than pure searching as well. I'm sure that we could work out some better suggestions with more information about your use case. http://www.nabble.com/Solr---User-f14480.html is a great place to go for searching the solr user list. -Matthew On Jan 21, 2009, at 8:55 AM, Thomas Dowling wrote: Is there a useful guide somewhere that suggests system configurations for machines that will support multiple large-ish Solr indexes? I'm working on a group of library databases (journal article citations + abstracts, mostly), and need to provide some sort of helpful information to our hardware people. Other than "lots", is there an answer for "We have X millions of records, of Y average size, with Z peak simultaneous users, so the memory needed for reasonable search performance is _"? Or is the limiting factor on search performance going to be something else? [Standard caveat: I did try checking the solr-user archives, but was hampered by the fact that there's no search function. The cobbler's children go barefoot.] -- Thomas Dowling Ohio Library and Information Network tdowl...@ohiolink.edu
Sizing a Linux box for Solr?
Is there a useful guide somewhere that suggests system configurations for machines that will support multiple large-ish Solr indexes? I'm working on a group of library databases (journal article citations + abstracts, mostly), and need to provide some sort of helpful information to our hardware people. Other than "lots", is there an answer for "We have X millions of records, of Y average size, with Z peak simultaneous users, so the memory needed for reasonable search performance is _"? Or is the limiting factor on search performance going to be something else? [Standard caveat: I did try checking the solr-user archives, but was hampered by the fact that there's no search function. The cobbler's children go barefoot.] -- Thomas Dowling Ohio Library and Information Network tdowl...@ohiolink.edu
Words that need protection from stemming, i.e., protwords.txt
Hi. Any good protwords.txt out there? In a fairly standard solr analyzer chain, we use the English Porter analyzer like so: For most purposes the porter does just fine, but occasionally words come along that really don't work out to well, e.g., "maine" is stemmed to "main" - clearly goofing up precision about "Maine" without doing much good for variants of "main". So - I have an entry for my protwords.txt. What else should go in there? Thanks for your ideas, Dave Woodward
Re: XMLResponsWriter or PHPResponseWriter, who is faster?
I have been doing some testing (with System.currentTimeMillis) and the difference is almost unapreciable but bit faster PHPResponseWriter, just would like to be sure I am right. Does anybody knows it for sure? Marc Sturlese wrote: > > Hey there, I am using Solr as backEnd and I don't mind whou to get back > the results. How is faster for Solr to create the response, using > XMLResponseWriter or PHPResponseWriter?? > For my front end is faster to process the response created by > PHPResponseWriter but I would not like to improve speed parsing the > response but loose it in the creation. > Thanks in advanced > > -- View this message in context: http://www.nabble.com/XMLResponsWriter-or-PHPResponseWriter%2C-who-is-faster--tp21582204p21582667.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to schedule delta-import and auto commit
Hi Shalin, I have not faced any memory problems as of now. But I had perviously asked a question regarding caching and memory (http://www.nabble.com/How-to-open-a-new-searcher-and-close-the-old-one-by-sending-HTTP-request-td21496803.html)- -- So can I safely assume that we will not face any memory issue due to caching even if we do not send commit that frequently? (If we wont send commit, then new searcher wont be initialized. So I can assume that the current searcher will correctly manage cache without any memory issues.) Thanks, Manu - For which I got the following answer - No, You can't assume that. You have to set a good autoCommit value for your solrconfig.xml, so you don't run out of memory for no commiting to Solr often, depending on your enviroment, memory share, doc size and update frequency. -- But my understanding is that tag works only if there is some update in index. So I wanted to understand that if there are no updates, will caching create some problems with memory? Thanks, Manu Shalin Shekhar Mangar wrote: > > On Wed, Jan 21, 2009 at 4:31 PM, Manupriya > wrote: > >> >> 2. I had asked peviously regarding caching and memory >> management( >> http://www.nabble.com/How-to-open-a-new-searcher-and-close-the-old-one-by-sending-HTTP-request-td21496803.html >> ). >> So how do I schedule auto-commit for my Solr server. >> >> As per my understanding, tag in Solrconfig.xml will call >> commit >> only if there has been an update. Right? So in case, no document has been >> added/updated, how can I call auto commit? >> Note: My only purpose to call commit without document change is to close >> current Searcher and open a new Searcher. This is for better memory >> management with caching. > > > This confuses me. Why do you think Solr is mis-managing the memory? What > are > the problems you are encountering? > > -- > Regards, > Shalin Shekhar Mangar. > > -- View this message in context: http://www.nabble.com/How-to-schedule-delta-import-and-auto-commit-tp21580961p21582357.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH XPathEntityProcessor fails with docs containing
Hello, After looking looking at http://issues.apache.org/jira/browse/SOLR-964, where it seems this issue has been addressed, I had another go at indexing documents containing DOCTYPE. It failed as follows. This was using the nightly build from 21-jan 2009. The comments section within jira suggested my inital message had been replied to twice, I somehow missed them in my inbox! Regards Fergus. Jan 21, 2009 12:15:21 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Jan 21, 2009 12:15:21 PM org.apache.solr.core.SolrCore execute INFO: [jdocs] webapp=/solr path=/dataimport params={command=show-config} status=0 QTime=0 Jan 21, 2009 12:15:22 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: jc document : null org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/Volumes/spare/ts/j/dtd/jxml/data/news/f/f2008/frp70450.xmlrows processed :0 Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:180) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1325) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:613) Caused by: java.lang.RuntimeException: com.ctc.wstx.exc.WstxParsingException: (was java.io.FileNotFoundException) /../config/jml-delivery-norm-2.1.dtd (No such file or directory) at [row,col {unknown-source}]: [3,81] at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) ... 27 more Caused by: com.ctc.wstx.exc.WstxParsingException: (was java.io.FileNotFoundException) /../config/jml-delivery-norm-2.1.dtd (No such file or directory) at [row,col {unknown-source}]: [3,81] at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630) at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461) at com.ctc.wstx.sr.ValidatingStreamReader.findDtdExtSubset(ValidatingStreamReader.java:475) at com.ctc.wstx.sr.ValidatingStreamReader.finishDTD(ValidatingStreamReader.java:358) at com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:3351) at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1988) at com.ctc
XMLResponsWriter or PHPResponseWriter, who is faster?
Hey there, I am using Solr as backEnd and I don't mind whou to get back the results. How is faster for Solr to create the response, using XMLResponseWriter or PHPResponseWriter?? For my front end is faster to process the response created by PHPResponseWriter but I would not like to improve speed parsing the response but loose it in the creation. Thanks in advanced -- View this message in context: http://www.nabble.com/XMLResponsWriter-or-PHPResponseWriter%2C-who-is-faster--tp21582204p21582204.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to schedule delta-import and auto commit
On Wed, Jan 21, 2009 at 4:31 PM, Manupriya wrote: > > 2. I had asked peviously regarding caching and memory > management( > http://www.nabble.com/How-to-open-a-new-searcher-and-close-the-old-one-by-sending-HTTP-request-td21496803.html > ). > So how do I schedule auto-commit for my Solr server. > > As per my understanding, tag in Solrconfig.xml will call > commit > only if there has been an update. Right? So in case, no document has been > added/updated, how can I call auto commit? > Note: My only purpose to call commit without document change is to close > current Searcher and open a new Searcher. This is for better memory > management with caching. This confuses me. Why do you think Solr is mis-managing the memory? What are the problems you are encountering? -- Regards, Shalin Shekhar Mangar.
Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.
Shalin Downloaded nightly for 21jan and tried DIH again. Its better but still broken. Dozens of embeded tags are stripped from documents but it now fails every few documents for no reason I can see. Manually removing embeded tags causes a given problem document to be indexed, only to have a it fail on one of the next few documents. I think the problem is still in stripHTML Here is the traceback. Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start INFO: Server startup in 3377 ms Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=13 Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1] commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2] Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1232539612131 Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: jc document : null org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 Processing Document # 9 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Caused by: java.lang.RuntimeException: java.util.NoSuchElementException at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) ... 9 more Caused by: java.util.NoSuchElementException at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) ... 10 more Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 Processing Document # 9 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java
Re: Error, when i update the rich text documents such as .doc, .ppt files.
Hi Do you resolve the probleme?? because I have the same prbleme. Thanks -- View this message in context: http://www.nabble.com/Error%2C-when-i-update-the-rich-text-documents-such-as-.doc%2C-.ppt-files.-tp20934026p21581483.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to schedule delta-import and auto commit
On Wed, Jan 21, 2009 at 4:31 PM, Manupriya wrote: > > Hi, > > Our Solr server is a standalone server and some web applications send HTTP > query to search and get back the results. > > Now I have following two requirements - > > 1. we want to schedule 'delta-import' at a specified time. So that we dont > have to explicitly send a HTTP request for delta-import. > http://wiki.apache.org/solr/DataImportHandler mentions 'Schedule full > imports and delta imports' but there is no detail. Even > http://www.ibm.com/developerworks/library/j-solr-update/index.html mentions > 'scheduler' but again there is no detail. There is no feature in Solr to schedule commands at specific intervals. You may have to do it externally . If you are using linux you can setup a cron job to invoke a curl at predetermined intervals > > 2. I had asked peviously regarding caching and memory > management(http://www.nabble.com/How-to-open-a-new-searcher-and-close-the-old-one-by-sending-HTTP-request-td21496803.html). > So how do I schedule auto-commit for my Solr server. > > As per my understanding, tag in Solrconfig.xml will call commit > only if there has been an update. Right? So in case, no document has been > added/updated, how can I call auto commit? > Note: My only purpose to call commit without document change is to close > current Searcher and open a new Searcher. This is for better memory > management with caching. > > Please let me know if there is any resources that I can refer for these. > > Thanks, > Manu > > -- > View this message in context: > http://www.nabble.com/How-to-schedule-delta-import-and-auto-commit-tp21580961p21580961.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- --Noble Paul
How to schedule delta-import and auto commit
Hi, Our Solr server is a standalone server and some web applications send HTTP query to search and get back the results. Now I have following two requirements - 1. we want to schedule 'delta-import' at a specified time. So that we dont have to explicitly send a HTTP request for delta-import. http://wiki.apache.org/solr/DataImportHandler mentions 'Schedule full imports and delta imports' but there is no detail. Even http://www.ibm.com/developerworks/library/j-solr-update/index.html mentions 'scheduler' but again there is no detail. 2. I had asked peviously regarding caching and memory management(http://www.nabble.com/How-to-open-a-new-searcher-and-close-the-old-one-by-sending-HTTP-request-td21496803.html). So how do I schedule auto-commit for my Solr server. As per my understanding, tag in Solrconfig.xml will call commit only if there has been an update. Right? So in case, no document has been added/updated, how can I call auto commit? Note: My only purpose to call commit without document change is to close current Searcher and open a new Searcher. This is for better memory management with caching. Please let me know if there is any resources that I can refer for these. Thanks, Manu -- View this message in context: http://www.nabble.com/How-to-schedule-delta-import-and-auto-commit-tp21580961p21580961.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem in Date Unmarshalling from NamedListCodec.
I've solved the problem. It was a time zone problem. :) L.M. 2009/1/21 Luca Molteni : > Hello list, > > Using SolrJ with Solr 1.3 stable, namedlistcodec unmarshal in readVal > method (line 161) the number > > 119914200 > > as a date (1 January 2008), > > While executing the same query with the solr administration console, > it gives me a different date value: > > 2007-12-31T23:00:00Z > > It seems like there is a one hour difference between the twos. > > At first, I thought about a local time zone (I'm in Milan, Italy), but > I've made some tries, and using the Date and Calendar constructors > with the right locale gives me the first january. > > Could be possible that the date gots marshalled in a wrong way? > > Thank you very much. > > L.M. >
Re: SOLR Problem with special chars
Otis Gospodnetic schrieb: now it works : positionIncrementGap="100"> words="stopwords.txt"/> max="50" /> language="German" /> protected="protwords.txt" /> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> words="stopwords.txt"/> language="German" /> protected="protwords.txt" /> Greets, Ralf
Re: Solr Replication: disk space consumed on slave much higher than on master
On Wed, Jan 21, 2009 at 3:42 PM, Jaco wrote: > Thanks for the fast replies! > > It appears that I made a (probably classical) error... I didnt' make the > change to solrconfig.xml to include the when applying the > upgrade. I include this now, but the slave is not cleaning up. Will this be > done at some point automatically? Can I trigger this? Unfortunately , no. Lucene is supposed to cleanup these old commit points automatically after each commit. Even if the is not specified the default is supposed to take effect. > > User access rights for the user are OK, this use is allowed to do anything > in the Solr data directory (Tomcat service is running from SYSTEM account > (Windows)). > > Thanks, regards, > > Jaco. > > > 2009/1/21 Shalin Shekhar Mangar > >> Hi, >> >> There shouldn't be so many files on the slave. Since the empty index.x >> folders are not getting deleted, is it possible that Solr process user does >> not enough privileges to delete files/folders? >> >> Also, have you made any changes to the IndexDeletionPolicy configuration? >> >> On Wed, Jan 21, 2009 at 2:15 PM, Jaco wrote: >> >> > Hi, >> > >> > I'm running Solr nightly build of 20.12.2008, with patch as discussed on >> > http://markmail.org/message/yq2ram4f3jblermd, using Solr replication. >> > >> > On various systems running, I see that the disk space consumed on the >> slave >> > is much higher than on the master. One example: >> > - Master: 30 GB in 138 files >> > - Slave: 152 GB in 3,941 files >> > >> > Can anybody tell me what to do to prevent this from happening, and how to >> > clean up the slave? Also, there are quite some empty index.xxx >> > directories sitting in the slaves data dir. Can these be safely removed? >> > >> > Thanks a lot in advance, bye, >> > >> > Jaco. >> > >> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> > -- --Noble Paul
Problem in Date Unmarshalling from NamedListCodec.
Hello list, Using SolrJ with Solr 1.3 stable, namedlistcodec unmarshal in readVal method (line 161) the number 119914200 as a date (1 January 2008), While executing the same query with the solr administration console, it gives me a different date value: 2007-12-31T23:00:00Z It seems like there is a one hour difference between the twos. At first, I thought about a local time zone (I'm in Milan, Italy), but I've made some tries, and using the Date and Calendar constructors with the right locale gives me the first january. Could be possible that the date gots marshalled in a wrong way? Thank you very much. L.M.
Re: Solr Replication: disk space consumed on slave much higher than on master
Thanks for the fast replies! It appears that I made a (probably classical) error... I didnt' make the change to solrconfig.xml to include the when applying the upgrade. I include this now, but the slave is not cleaning up. Will this be done at some point automatically? Can I trigger this? User access rights for the user are OK, this use is allowed to do anything in the Solr data directory (Tomcat service is running from SYSTEM account (Windows)). Thanks, regards, Jaco. 2009/1/21 Shalin Shekhar Mangar > Hi, > > There shouldn't be so many files on the slave. Since the empty index.x > folders are not getting deleted, is it possible that Solr process user does > not enough privileges to delete files/folders? > > Also, have you made any changes to the IndexDeletionPolicy configuration? > > On Wed, Jan 21, 2009 at 2:15 PM, Jaco wrote: > > > Hi, > > > > I'm running Solr nightly build of 20.12.2008, with patch as discussed on > > http://markmail.org/message/yq2ram4f3jblermd, using Solr replication. > > > > On various systems running, I see that the disk space consumed on the > slave > > is much higher than on the master. One example: > > - Master: 30 GB in 138 files > > - Slave: 152 GB in 3,941 files > > > > Can anybody tell me what to do to prevent this from happening, and how to > > clean up the slave? Also, there are quite some empty index.xxx > > directories sitting in the slaves data dir. Can these be safely removed? > > > > Thanks a lot in advance, bye, > > > > Jaco. > > > > > > -- > Regards, > Shalin Shekhar Mangar. >
Re: SOLR Problem with special chars
Otis Gospodnetic schrieb: Ralf, Can you paste the part of your schema.xml where you defined the relevant field? Otis Sure ! positionIncrementGap="100"> language="German" /> language="German" /> Greets
Re: Solr Replication: disk space consumed on slave much higher than on master
Hi, There shouldn't be so many files on the slave. Since the empty index.x folders are not getting deleted, is it possible that Solr process user does not enough privileges to delete files/folders? Also, have you made any changes to the IndexDeletionPolicy configuration? On Wed, Jan 21, 2009 at 2:15 PM, Jaco wrote: > Hi, > > I'm running Solr nightly build of 20.12.2008, with patch as discussed on > http://markmail.org/message/yq2ram4f3jblermd, using Solr replication. > > On various systems running, I see that the disk space consumed on the slave > is much higher than on the master. One example: > - Master: 30 GB in 138 files > - Slave: 152 GB in 3,941 files > > Can anybody tell me what to do to prevent this from happening, and how to > clean up the slave? Also, there are quite some empty index.xxx > directories sitting in the slaves data dir. Can these be safely removed? > > Thanks a lot in advance, bye, > > Jaco. > -- Regards, Shalin Shekhar Mangar.
Re: Solr Replication: disk space consumed on slave much higher than on master
the index.xxx directories are supposed to be deleted (automatically). you can safely delete them. But, I am wondering why the index files in the slave did not get deleted. By default the deletionPolicy is KeepOnlyLastCommit. On Wed, Jan 21, 2009 at 2:15 PM, Jaco wrote: > Hi, > > I'm running Solr nightly build of 20.12.2008, with patch as discussed on > http://markmail.org/message/yq2ram4f3jblermd, using Solr replication. > > On various systems running, I see that the disk space consumed on the slave > is much higher than on the master. One example: > - Master: 30 GB in 138 files > - Slave: 152 GB in 3,941 files > > Can anybody tell me what to do to prevent this from happening, and how to > clean up the slave? Also, there are quite some empty index.xxx > directories sitting in the slaves data dir. Can these be safely removed? > > Thanks a lot in advance, bye, > > Jaco. > -- --Noble Paul
Re: Solr Replication: disk space consumed on slave much higher than on master
Hello, > Hi, > I'm running Solr nightly build of 20.12.2008, with patch as discussed on > http://markmail.org/message/yq2ram4f3jblermd, using Solr replication. > On various systems running, I see that the disk space consumed on the slave > is much higher than on the master. One example: > - Master: 30 GB in 138 files > - Slave: 152 GB in 3,941 files > Can anybody tell me what to do to prevent this from happening, and how to > clean up the slave? Also, there are quite some empty index.xxx > directories sitting in the slaves data dir. Can these be safely removed? > Thanks a lot in advance, bye, > Jaco. Slaves use much more disk space after some time because they keep snapshots of the index you pull from the master. Look at the snapcleaner script, you can use it to automatically clean data directory. I hope that helps. -- Regards, Rafał Kuć
Solr Replication: disk space consumed on slave much higher than on master
Hi, I'm running Solr nightly build of 20.12.2008, with patch as discussed on http://markmail.org/message/yq2ram4f3jblermd, using Solr replication. On various systems running, I see that the disk space consumed on the slave is much higher than on the master. One example: - Master: 30 GB in 138 files - Slave: 152 GB in 3,941 files Can anybody tell me what to do to prevent this from happening, and how to clean up the slave? Also, there are quite some empty index.xxx directories sitting in the slaves data dir. Can these be safely removed? Thanks a lot in advance, bye, Jaco.