Changing schema and reindexing documents
Hi, I have lots of documents in my solr index. Now I have a requirement to change its schema and add a new field. What should I do, so that all the documents keep working after schema change? Thanks Riz
Re: Changing schema and reindexing documents
On Wed, Oct 6, 2010 at 11:59 AM, M.Rizwan griz...@gmail.com wrote: Hi, I have lots of documents in my solr index. Now I have a requirement to change its schema and add a new field. What should I do, so that all the documents keep working after schema change? [...] You will need to reindex if the schema is changed. Regards, Gora
Re: Changing schema and reindexing documents
If you add a field to the schema file and restart Solr, the existing documents won't have that field. New documents that you index will. If this is ok, you are safe. In general, don't change the schema without indexing. You can trip over the weirdest problems. On Wed, Oct 6, 2010 at 12:31 AM, Gora Mohanty g...@mimirtech.com wrote: On Wed, Oct 6, 2010 at 11:59 AM, M.Rizwan griz...@gmail.com wrote: Hi, I have lots of documents in my solr index. Now I have a requirement to change its schema and add a new field. What should I do, so that all the documents keep working after schema change? [...] You will need to reindex if the schema is changed. Regards, Gora -- Lance Norskog goks...@gmail.com
Trouble with exception Document [Null] missing required field DocID
Hi All I m new to solr extract request handler, i want to index pdf documents but when i submit document to solr using curl i got following exception Document [Null] missing required field DocID my curl command is like curl http://localhost:8983/solr1/update/extract?literal.DocID=123fmap.content=Contentscommit=true; -F myfi...@d:/solr/apache-solr-1.4.0/docs/filename1.pdf and here is my schema fields field name=DocID type=string indexed=true stored=true/ field name=Contents type=text indexed=true stored=true/ dynamicField name=ignored_* type=ignored indexed=false stored=false/ /fields uniqueKeyDocID/uniqueKey please help me if i m missing something Regards Ahsan
Help needed on indexing Zope CMS content
Hi! We are planning to periodically index several MySQL database tables plus a Zope CMS document tree in Solr. Indexing the Zope DB seems to be tricky though. Has anyone here done this and could provide a URL or sample code to a solution? Something running as a python script would be great, but different approaches are welcome, too. I know this is only slightly related to Solr, but since Solr has spread so widely, I wanted to give it a try here. Thanks in advance! Marian
Re: Help needed on indexing Zope CMS content
On Wed, Oct 6, 2010 at 1:58 PM, Marian Steinbach mar...@sendung.de wrote: Hi! We are planning to periodically index several MySQL database tables plus a Zope CMS document tree in Solr. Indexing the Zope DB seems to be tricky though. [...] Been a while since I touched Zope, but there seems to be something called collective.solr available: * http://pypi.python.org/pypi/collective.solr * http://www.contentmanagementsoftware.info/plone/collective.solr Does this not meet your needs? Regards, Gora
Experience running Solr on ISCSI
Hi. Our hardware department is planning on moving some stuff to new machines (on our request) They are suggesting using virtualization (some CISCO solution) on those machines and having the 'disk' connected via ISCSI. Does anybody have experience running a SOLR index on a ISCSI drive? We have already tried with NFS but that is slowing the index process down to much, about 12 times slower. So NFS is a no-go. I could have know that as it is mentioned on a lot of places to avoid nfs. But I can't find info about ISCSI Does anybody have experience running a SOLR index on a virtualized environment? Is it resistant enough that it keeps working when the virtualized machine is transfered to a different hardware node? thanks
Re: Solr UIMA integration
Hi Tommaso, I will try the service call outside Solr/UIMA. And the text i am using is FileName: Entity.xml add doc field name=referenceEntity.xml/field field name=contentSenator Dick Durbin (D-IL) Chicago , March 3, 2007./field field name=titleEntity Extraction/field /doc /add and using curl to index it curl http://localhost:8080/solr/update -F solr.bo...@entity.xml Thanks Mahesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-UIMA-integration-tp1528253p1642093.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr UIMA integration
Hi Mahesh, the issue here is that you're not sending a field name=text.../field to Solr from which UIMAUpdateRequestProcessor extracts text to analyze :) Infact by default UIMAUpdateRequestProcessor extracts text to analyze from that field and send that value to a UIMA pipeline. Obviously you could choose to customize this behavior making UIMAUpdateRequestProcessor read values from each field that is being indexed in the document or another field. However this made me realize that in such situations that field value is a String null and not a null object, as I expected; so line 57 in UIMAUpdateRequestProcessor should be changed as following to prevent such errors: ... if (textFieldValue != null !.equals(textFieldValue) !null.equals(textFieldValue)) { ... Hope this helps, Tommaso 2010/10/6 maheshkumar maheshkuma...@gmail.com Hi Tommaso, I will try the service call outside Solr/UIMA. And the text i am using is FileName: Entity.xml add doc field name=referenceEntity.xml/field field name=contentSenator Dick Durbin (D-IL) Chicago , March 3, 2007./field field name=titleEntity Extraction/field /doc /add and using curl to index it curl http://localhost:8080/solr/update -F solr.bo...@entity.xml Thanks Mahesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-UIMA-integration-tp1528253p1642093.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Help needed on indexing Zope CMS content
Marian Steinbach-3 wrote: We are planning to periodically index several MySQL database tables plus a Zope CMS document tree in Solr. Indexing the Zope DB seems to be tricky though. Has anyone here done this and could provide a URL or sample code to a solution? Something running as a python script would be great, but different approaches are welcome, too. We implemented SolrIndex for a customer and released it as a ZCatalog plugin. You can find it here: http://pypi.python.org/pypi/alm.solrindex/ It has no dependancies on Plone and is pretty much drop in and just add it as an index in your sites catalog. Cheers, Calvin - -- Six Feet Up, Inc. | Sponsor of Plone Conference 2010 (Oct. 25th-31st) Direct Line: +1 (317) 861-5948 x602 Email: cal...@sixfeetup.com Try Plone 4 Today at: http://plone4demo.com How am I doing? Please contact my manager Gabrielle Hendryx-Parker at gabrie...@sixfeetup.com with any feedback. -- View this message in context: http://lucene.472066.n3.nabble.com/Help-needed-on-indexing-Zope-CMS-content-tp1641160p1642806.html Sent from the Solr - User mailing list archive at Nabble.com.
phrase query with autosuggest (SOLR-1316)
It seemed like SOLR-1316 was a little too long to continue the conversation. Is there support for quotes indicating a phrase query. For example, my autosuggest query for mike sha ought to return mike shaffer, mike sharp, etc. Instead I get suggestions for mike and for sha, resulting in a collated result mike r meyer shaw, Cheers, Mike
Re: having problem about Solr Date Field.
On Wed, Oct 6, 2010 at 9:17 PM, Kouta Osabe kota0919was...@gmail.com wrote: Hi, Gora Thanks for your advice. and then I try to write these codes following your advice. Case1 pub_date column(MySQL) is 2010-09-27 00:00:00. I wrote like below. SolrJDto info = new SolrJDto(); TimeZone tz2 = TimeZone.getTimeZone(UTC+9); Calendar cal = Calendar.getInstance(tz2); // publishDate represent publish_date column on Solr Schema and the type is pdate. info.publishDate = rs.getDate(publish_date,cal); then I got 2010-09-27T00:00:00Z on Solr Admin. This result is what I expected. Case2 reg_date column(MySQL) is 2010-09-27 11:22:33. I wrote like below. TimeZone tz2 = TimeZone.getTimeZone(UTC+9); Calendar cal = Calendar.getInstance(tz2); info.publishDate = rs.getDate(reg_date,cal); then, I got 2010-09-27T02:22:33Z on Solr admin. this result is not what i expected. [...] It seems like mysql is doing UTC conversion for one column, and not for the other. I can think of two possible reasons for this: * If they are from different mysql servers, it is possible that the timezone is set differently for the two servers. Please see http://dev.mysql.com/doc/refman/5.1/en/time-zone-support.html for how to set the timezone for mysql. (It is also possible for the client connection to set a connection-specific timezone, but I do not think that is what is happening here.) * The type of the columns is different, e.g., one could be a DATETIME, and the other a TIMESTAMP. The mysql timezone link above also explains how these are handled. Without going through the above could you not just set the timezone for reg_date to UTC to get the result that you expect? Regards, Gora.
RE: phrase query with autosuggest (SOLR-1316)
My simple but effective solution to that problem was to replace the white spaces in the items you index for autosuggest with some special character, then your wildcarding will work with the whole phrase as you desire. Index: mike_shaffer Query: mike_sha* -Original Message- From: mike anderson [mailto:saidthero...@gmail.com] Sent: Wednesday, October 06, 2010 7:33 AM To: solr-user@lucene.apache.org Subject: phrase query with autosuggest (SOLR-1316) It seemed like SOLR-1316 was a little too long to continue the conversation. Is there support for quotes indicating a phrase query. For example, my autosuggest query for mike sha ought to return mike shaffer, mike sharp, etc. Instead I get suggestions for mike and for sha, resulting in a collated result mike r meyer shaw, Cheers, Mike
Re: Strategy for re-indexing
Hi, I don't *think* there is any DIH request queuing going on - each is triggered by the DIH request. You need to queue them yourself if your app/data is such that running multiple imports/deltas causes problems with either hardware or data. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Allistair Crossley a...@roxxor.co.uk To: solr-user@lucene.apache.org Sent: Wed, October 6, 2010 10:49:49 AM Subject: Strategy for re-indexing Hi, I was interested in gaining some insight into how you guys schedule updates for your Solr index (I have a single index). Right now during development I have added deltaQuery specifications to data import entities to control the number of rows being queries on re-indexes. However in terms of *when* to reindex we have a lot going on in the system - there are 4 sub-systems: custom application data, a CMS, a forum and a blog. It's all being indexed and at any given time there will be users and administrators all updating various parts of the sub-systems. For the time being during development I have been issuing reindexes to the data import handler on each CRUD on any given sub-system. This has been working fine to be honest. It does need to be as immediate as possible - a scheduled update won't work for us. Even every 10 minutes is probably not fast enough. So I wonder what others do. Is anyone else in a similar situation? And what happens if 4 users generate 4 different requests to the data import handler to update for different types of data? The DIH will be running already let's say for request 1, then request 2 comes in - is it rejected? Or is it queued? I need it to be queued and serviced because the request 1 re-index may have already run its queries but missed the data added by the user for request 2. Same then goes for the requests 3 and 4. Thanks for your consideration, Allistair
Re: Experience running Solr on ISCSI
Thijs, The only thing I could find is this: http://search-lucene.com/m/VDjIlUc2Ci2/iscsisubj=Lucene+on+NFS+iSCSI I don't have experience with transferring Solr/Lucene/index to different hardware nodes without stopping and persisting things before transfer. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Thijs vonk.th...@gmail.com To: solr-user@lucene.apache.org Sent: Wed, October 6, 2010 7:23:33 AM Subject: Experience running Solr on ISCSI Hi. Our hardware department is planning on moving some stuff to new machines (on our request) They are suggesting using virtualization (some CISCO solution) on those machines and having the 'disk' connected via ISCSI. Does anybody have experience running a SOLR index on a ISCSI drive? We have already tried with NFS but that is slowing the index process down to much, about 12 times slower. So NFS is a no-go. I could have know that as it is mentioned on a lot of places to avoid nfs. But I can't find info about ISCSI Does anybody have experience running a SOLR index on a virtualized environment? Is it resistant enough that it keeps working when the virtualized machine is transfered to a different hardware node? thanks
StatsComponent and multi-valued fields
Running 1.4.1. I'm able to execute stats queries against multi-valued fields, but when given a facet, the statscomponent only considers documents that have a facet value as the last value in the field. As an example, imagine you are running stats on fooCount, and you want to facet on bar, which is multi-valued. Two documents... 1) fooCount = 100 bar = A, B, C 2) fooCount = 5 bar = C, B, A stats.field=fooCountstats=truestats.facet=bar I would expect to see stats for A, B, and C all with sums of 105. But what I'm seeing is stats for C and A with sums of 100 and 5 respectively. Is this expected behavior? Something I'm possibly doing wrong? Is this just not advisable? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/StatsComponent-and-multi-valued-fields-tp1644918p1644918.html Sent from the Solr - User mailing list archive at Nabble.com.
script transformer vs. custom transformer
This might be a dumb question, but should I expect that a custom transformer written in java will perform better than a javascript script transformer? Or does the javascript get compiled to bytecode such that there really should not be much difference between the two? Of course, the bigger performance issue I'm dealing with is getting the data out of the SQL database as quickly as possible, but I'm curious about the performance implications of a script transform vs. the same transform done in java. thanks, Tim
Re: phrase query with autosuggest (SOLR-1316)
If you use Chantal's suggestion from an earlier thread, involving facets and tokenized fields, but not the tokens handling -- i think it will work. (But that solution requires only one auto-suggest value per document). There are a bunch of ways people have figured out to do auto-suggest without putting it in an entirely seperate Solr core. They all have their issues and strengths and weaknesses, including a weakness of being kind of confusing to implement sometimes. I don't think anyone's come up with a general purpose works for everything isn't confusing solution yet. Robert Petersen wrote: My simple but effective solution to that problem was to replace the white spaces in the items you index for autosuggest with some special character, then your wildcarding will work with the whole phrase as you desire. Index: mike_shaffer Query: mike_sha* -Original Message- From: mike anderson [mailto:saidthero...@gmail.com] Sent: Wednesday, October 06, 2010 7:33 AM To: solr-user@lucene.apache.org Subject: phrase query with autosuggest (SOLR-1316) It seemed like SOLR-1316 was a little too long to continue the conversation. Is there support for quotes indicating a phrase query. For example, my autosuggest query for mike sha ought to return mike shaffer, mike sharp, etc. Instead I get suggestions for mike and for sha, resulting in a collated result mike r meyer shaw, Cheers, Mike
Re: multi level faceting
Hi, there is a solution without the patch. Here it should be explained: http://www.lucidimagination.com/blog/2010/08/11/stumped-with-solr-chris-hostetter-of-lucene-pmc-at-lucene-revolution/ If not, I will do on 9.10.2010 ;-) Regards, Peter. I've a similar problem with a project I'm working on now. I am holding out for either SOLR-64 or SOLR-792 being a bit more mature before I need the functionality but if not I was thinking I could do multi-level faceting by indexing the data as a String like this: id: 1 SHOE: Sneakers|Men|Size 7 id: 2 SHOE: Sneakers|Men|Size 8 id: 3 SHOE: Sneakers|Women|Size 6 etc and then in the UI, show just up to the first delimiter (you'll have to sum the counts in the UI too). Once the user clicks on Sneakers, you would then add fq=SHOE:Sneakers|* to the query and then show the values up to the 2nd delimiter, etc. Alternatively, if you didn't want to use a wildcard query, you could index each level separately like this: id: 1 SHOE1: Sneakers SHOE2: Sneakers|Men SHOE3: Sneakers|Men|Size 7 Then after the user clicks on the 1st level, fq on SHOE1 and show SHOE2, etc. This wouldn't work so well if you had more than a few levels in your hierarchy. I haven't actually tried this and like I said I'm hoping I could just use a patch (really I hope 3.x gets released GA with the functionality but I won't hold my breath...) But I do think this would work in a pinch if need be. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nguyen, Vincent (CDC/OD/OADS) (CTR) [mailto:v...@cdc.gov] Sent: Tuesday, October 05, 2010 8:22 AM To: solr-user@lucene.apache.org Subject: RE: multi level faceting Just to clarify, the effect I was look for was this. Sneakers Men (22) Women (43) AFTER a user filters by one of those, they would be presented with a NEW facet field such as Sneakers Men Size 7 (10) Size 8 (11) Size 9 (23) Vincent Vu Nguyen -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Monday, October 04, 2010 11:44 PM To: solr-user@lucene.apache.org Subject: Re: multi level faceting Hi, I *think* this is not what Vincent was after. If I read the suggestions correctly, you are saying to use fq=xfq=y -- multiple fqs. But I think Vincent is wondering how to end up with something that will let him create a UI with multi-level facets (with a single request), e.g. Footwear (100) Sneakers (20) Men (1) Women (19) Dancing shoes (10) Men (0) Women (10) ... If this is what Vincent was after, I'd love to hear suggestions myself. :) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Jason Brown jason.br...@sjp.co.uk To: solr-user@lucene.apache.org Sent: Mon, October 4, 2010 11:34:56 AM Subject: RE: multi level faceting Yes, by adding fq back into the main query you will get results increasingly filtered each time. You may run into an issue if you are displaying facet counts, as the facet part of the query will also obey the increasingly filtered fq, and so not display counts for other categories anymore from the chosen facet (depends if you need to display counts from a facet once the first value from the facet has been chosen if you get my drift). Local params are a way to deal with this by not subjecting the facet count to the same fq restriction (but allowing the search results to obey it). -Original Message- From: Nguyen, Vincent (CDC/OD/OADS) (CTR) [mailto:v...@cdc.gov] Sent: Mon 04/10/2010 16:34 To: solr-user@lucene.apache.org Subject: RE: multi level faceting Ok. Thanks for the quick response. Vincent Vu Nguyen Division of Science Quality and Translation Office of the Associate Director for Science Centers for Disease Control and Prevention (CDC) 404-498-6154 Century Bldg 2400 Atlanta, GA 30329 -Original Message- From: Allistair Crossley [mailto:a...@roxxor.co.uk] Sent: Monday, October 04, 2010 9:40 AM To: solr-user@lucene.apache.org Subject: Re: multi level faceting I think that is just sending 2 fq facet queries through. In Solr PHP I would do that with, e.g. $params['facet'] = true; $params['facet.fields'] = array('Size'); $params['fq'] = array('sex' = array('Men', 'Women')); but yes i think you'd have to send through what the current facet query is and add it to your next drill-down On Oct 4, 2010, at 9:36 AM, Nguyen, Vincent (CDC/OD/OADS) (CTR) wrote: Hi, I was wondering if there's a way to display facet options based on previous facet values. For example, I've seen many shopping sites where a user can
RE: Experience with large merge factors
Hi Mike, .Do you use multiple threads for indexing? Large RAM buffer size is also good, but I think perf peaks out mabye around 512 MB (at least based on past tests)? We are using Solr, I'm not sure if Solr uses multiple threads for indexing. We have 30 producers each sending documents to 1 of 12 Solr shards on a round robin basis. So each shard will get multiple requests. Believe it or not, merging is typically compute bound. It's costly to decode re-encode all the vInts. Sounds like we need to do some monitoring during merging to see what the cpu use is and also the io wait during large merges. Larger merge factor is good because it means the postings are copied fewer times, but, it's bad beacuse you could risk running out of descriptors, and, if the OS doesn't have enough RAM, you'll start to thin out the readahead that the OS can do (which makes the merge less efficient since the disk heads are seeking more). Is there a way to estimate the amount of RAM for the readahead? Once we start the re-indexing we will be running 12 shards on a 16 processor box with 144 GB of memory. Do you do any deleting? Deletes would happen as a byproduct of updating a record. This shouldn't happen too frequently during re-indexing, but we update records when a document gets re-scanned and re-OCR'd. This would probably amount to a few thousand. Do you use stored fields and/or term vectors? If so, try to make your docs uniform if possible, ie add the same fields in the same order. This enables lucene to use bulk byte copy merging under the hood. We use 4 or 5 stored fields. They are very small compared to our huge OCR field. Since we construct our Solr documents programattically, I'm fairly certain that they are always in the same order. I'll have to look at the code when I get back to make sure. We aren't using term vectors now, but we plan to add them as well as a number of fields based on MARC (cataloging) metadata in the future. Tom
Re: Experience with large merge factors
Hi Tom, .Do you use multiple threads for indexing? Large RAM buffer size is also good, but I think perf peaks out mabye around 512 MB (at least based on past tests)? We are using Solr, I'm not sure if Solr uses multiple threads for indexing. We have 30 producers each sending documents to 1 of 12 Solr shards on a round robin basis. So each shard will get multiple requests. Solr itself doesn't use multiple threads for indexing, but you can easily do that on the client side. SolrJ's StreamingUpdateSolrServer is the simplest thing to use for this. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Burton-West, Tom tburt...@umich.edu To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wed, October 6, 2010 9:57:12 PM Subject: RE: Experience with large merge factors Hi Mike, .Do you use multiple threads for indexing? Large RAM buffer size is also good, but I think perf peaks out mabye around 512 MB (at least based on past tests)? We are using Solr, I'm not sure if Solr uses multiple threads for indexing. We have 30 producers each sending documents to 1 of 12 Solr shards on a round robin basis. So each shard will get multiple requests. Believe it or not, merging is typically compute bound. It's costly to decode re-encode all the vInts. Sounds like we need to do some monitoring during merging to see what the cpu use is and also the io wait during large merges. Larger merge factor is good because it means the postings are copied fewer times, but, it's bad beacuse you could risk running out of descriptors, and, if the OS doesn't have enough RAM, you'll start to thin out the readahead that the OS can do (which makes the merge less efficient since the disk heads are seeking more). Is there a way to estimate the amount of RAM for the readahead? Once we start the re-indexing we will be running 12 shards on a 16 processor box with 144 GB of memory. Do you do any deleting? Deletes would happen as a byproduct of updating a record. This shouldn't happen too frequently during re-indexing, but we update records when a document gets re-scanned and re-OCR'd. This would probably amount to a few thousand. Do you use stored fields and/or term vectors? If so, try to make your docs uniform if possible, ie add the same fields in the same order. This enables lucene to use bulk byte copy merging under the hood. We use 4 or 5 stored fields. They are very small compared to our huge OCR field. Since we construct our Solr documents programattically, I'm fairly certain that they are always in the same order. I'll have to look at the code when I get back to make sure. We aren't using term vectors now, but we plan to add them as well as a number of fields based on MARC (cataloging) metadata in the future. Tom