automatic delta imports?
Hello all I'm seeing the following in my web server log file: [2011-12-19 08:57:00.016] [customersIndex] webapp=/solr path=/dataimport params={command=delta-importcommit=trueoptimize=true} status=0 QTime=3 [2011-12-19 08:57:00.018] Starting Delta Import [2011-12-19 08:57:00.018] Read dataimport.properties [2011-12-19 08:57:00.019] Starting delta collection. [2011-12-19 08:57:00.019] Running ModifiedRowKey() for Entity: CUSTOMERS [2011-12-19 08:57:00.019] [ordersIndex] webapp=/solr path=/dataimport params={command=delta-importcommit=trueoptimize=true} status=0 QTime=1 [2011-12-19 08:57:00.023] Starting Delta Import [2011-12-19 08:57:00.023] Creating a connection for entity CUSTOMERS with URL: jdbc:oracle:thin:@(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = somehost)(PORT = 1521))(CONNECT_DATA =(SERVER = DEDICATED)(SERVICE_NAME = someservice))) [2011-12-19 08:57:00.023] Read dataimport.properties [2011-12-19 08:57:00.024] Starting delta collection. [2011-12-19 08:57:00.024] Running ModifiedRowKey() for Entity: item2 [2011-12-19 08:57:00.025] Creating a connection for entity item2 with URL: jdbc:oracle:thin:@(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = somehost)(PORT = 1521))(CONNECT_DATA =(SERVER = DEDICATED)(SERVICE_NAME = someservice))) [2011-12-19 08:57:01.005] Time taken for getConnection(): 982 [2011-12-19 08:57:01.010] Time taken for getConnection(): 985 [2011-12-19 08:57:01.216] Completed ModifiedRowKey for Entity: CUSTOMERS rows obtained : 0 [2011-12-19 08:57:01.217] Completed DeletedRowKey for Entity: CUSTOMERS rows obtained : 0 [2011-12-19 08:57:01.217] Completed parentDeltaQuery for Entity: CUSTOMERS [2011-12-19 08:57:01.217] Delta Import completed successfully [2011-12-19 08:57:01.217] {} 0 3 [2011-12-19 08:57:01.217] Time taken = 0:0:1.199 [2011-12-19 08:57:02.226] Completed ModifiedRowKey for Entity: item2 rows obtained : 0 [2011-12-19 08:57:02.227] Completed DeletedRowKey for Entity: item2 rows obtained : 0 [2011-12-19 08:57:02.227] Completed parentDeltaQuery for Entity: item2 [2011-12-19 08:57:02.227] Delta Import completed successfully [2011-12-19 08:57:02.227] {} 0 1 [2011-12-19 08:57:02.227] Time taken = 0:0:2.204 It sure looks like its an automatic delta import running on my 2 indexes. But I thought the docs said Solr didn't provide for automatic delta imports? Is this a new feature in 3.2? I've got multiple 3.2 instances running and this is the only one who's logs show this message. Have I turned something on accidentally? If so, what config files contain these settings? I want to turn this on for the other solr instances. Mark
case insensitive searches
Hello all According to the docs, I need to use solr.LowerCaseTokenizerFactory Does anyone have any experience with it? Can anyone comment on pitfalls or things to beware of? Does anyone know of any examples I can look at? Thanks Mark
Re: solr equivalent of select distinct
Erick Thanks very much for the reply. I typed this late Friday after work and tried to simplify the problem description. I got something wrong. Hopefully this restatement is better: My PK is FLD1, FLD2 and FLD3 concatenated together. In some cases FLD1 and FLD2 can be the same. The ONLY differing field being FLD3. Here's an example: PK FLD1 FLD2FLD3 FLD4 FLD5 AB0 AB 0 x y AB1 AB 1 x y CD0 CD 0 a b CD1 CD 1 e f I want to write a query using only the terms FLD1 and FLD2 and ONLY get back: A B x y C D a b C D e f Since FLD4 and FLD5 are the same for PK=AB0 and AB1, I only want one occurrence of those records. Since FLD4 and FLD5 are different for PK=CD0 and CD1, I want BOTH occurrences of those records. I'm hoping I can use wildcards to get FLD4 and FLD5. If not, I can use fl= I'm using edismax. We are also creating the query string on the fly. I suspect using SolrJ and plugging the values into a bean would be easier - or do I have that wrong? I hope the tables of example data display properly. Mark On Sun, Sep 11, 2011 at 12:06 PM, Erick Erickson erickerick...@gmail.comwrote: This smells like an XY problem, can you back up and give a higher-level reason *why* you want this behavior? Because given your problem description, this seems like you are getting correct behavior no matter how you define the problem. You're essentially saying that you have two records with identical beginnings of your PK, why is it incorrect to give you both records? But, anyway, if you're searching on FLD1 and FLD2, then by definition you're going to get both records back or the search would be failing! Best Erick On Fri, Sep 9, 2011 at 8:08 PM, Mark juszczec mark.juszc...@gmail.com wrote: Hello everyone Let's say each record in my index contains fields named PK, FLD1, FLD2, FLD3 FLD100 PK is my solr primary key and I'm creating it by concatenating FLD1+FLD2+FLD3 and I'm guaranteed that combination will be unique Let's say 2 of these records have FLD1 = A and FLD2 = B. I am unsure about the remaining fields Right now, if I do a query specifying FLD1 = A and FLD2 = B then I get both records. I only want 1. Research says I should use faceting. But this: q=FLD1:A and FLD2:B rows=500 defType=edismax fl=FLD1, FLD2 facet=true facet_field=FLD1 facet_field=FLD2 gives me 2 records. In fact, it gives me the same results as: q=FLD1:A and FLD2:B rows=500 defType=edismax fl=FLD1, FLD2 I'm wrong somewhere, but I'm unsure where. Is faceting the right way to go or should I be using grouping? Curiously, when I use grouping like this: q=FLD1:A and FLD2:B rows=500 defType=edismax indent=true fl=FLD1, FLD2 group=true group.field=FLD1 group.field=FLD2 I get 2 records as well. Has anyone dealt with mimicing select distinct in Solr? Any advice would be very appreciated. Mark
Re: searching for terms containing embedded spaces
Erick My field contains a b (without ) We are trying to assemble the query as a String by appending the various values. I think that is a large part of the problem and our lives would be easier if we let the Solr api do this work. We've experimented with our query assembler producing field:a+b We've also tried making it create field:a\ b The first case just does not work and I'm unsure why. The second case ends up url encoding the \ and I'm unsure if that will cause it to be used in the query or not. Mark On Sun, Sep 11, 2011 at 12:10 PM, Erick Erickson erickerick...@gmail.comwrote: Try escaping it for a start. But why do you want to? If it's a phrase query, enclose it in double quotes. You really have to provide more details, because there are too many possibilities to answer. For instance: If you're entering field:a b then 'b' will be searched against your default text field and you should enter field:(a b) or field:a field:b If you've tokenized the field, you shouldn't care. If you're using keywordanalyzer, escaping should work. Etc. Best Erick On Fri, Sep 9, 2011 at 8:11 PM, Mark juszczec mark.juszc...@gmail.com wrote: Hi folks I've got a field that contains 2 words separated by a single blank. What's the trick to creating a search string that contains the single blank? Mark
Re: searching for terms containing embedded spaces
But as Erick says, it's not clear that's really what you want (to search on a single term with a space in it). If it's a normal text field, each word will be indexed separately, so you really want a phrase query or a boolean query: field:a b or field:(a b) I am looking for a text string with a single, embedded space. For the purposes of this example, it is a b and its stored in the index in a field called field. Am I incorrect in assuming the query field:a b will match the the string a followed by a single embedded space followed by a b? I'm also wondering if this is already handled by the Solr/SolrJ API and if we are making our lives more difficult by assembling the query strings ourselves. Mark -Yonik http://www.lucene-eurocon.com - The Lucene/Solr User Conference
Re: searching for terms containing embedded spaces
That's what I thought. The problem is, its not and I am unsure what is wrong. On Sun, Sep 11, 2011 at 1:35 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Sun, Sep 11, 2011 at 1:15 PM, Mark juszczec mark.juszc...@gmail.com wrote: I am looking for a text string with a single, embedded space. For the purposes of this example, it is a b and its stored in the index in a field called field. Am I incorrect in assuming the query field:a b will match the the string a followed by a single embedded space followed by a b? Yes, that should work regardless of how the field is indexed (as a big single token, or as a normal text field that doesn't preserve spaces). -Yonik http://www.lucene-eurocon.com - The Lucene/Solr User Conference
Re: searching for terms containing embedded spaces
The field's properties are: field name=CUSTOMER_TYPE_NM type=string indexed=true stored=true required=true default=CUSTOMER_TYPE_NM_MISSING There have been no changes since I last completely rebuilt the index. Is re-indexing done when an index is completely rebuilt with a a dataimport=full? How about if we've done dataimport=delta? If it helps, this is what I get when I print out the ModifiableSolrParams object I'm sending to the query method: q=+*%3A*++AND+CUSTOMER_TYPE_NM%3ANetwork+Advertiser+AND+ACTIVE_IND%3A1defType=edismaxrows=500sort=ACCOUNT_CUSTOMER_ID+ascstart=0 Mark On Sun, Sep 11, 2011 at 2:05 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Sun, Sep 11, 2011 at 1:39 PM, Mark juszczec mark.juszc...@gmail.com wrote: That's what I thought. The problem is, its not and I am unsure what is wrong. What is the fieldType definition for that field? Did you change it without re-indexing? -Yonik http://www.lucene-eurocon.com - The Lucene/Solr User Conference
solr equivalent of select distinct
Hello everyone Let's say each record in my index contains fields named PK, FLD1, FLD2, FLD3 FLD100 PK is my solr primary key and I'm creating it by concatenating FLD1+FLD2+FLD3 and I'm guaranteed that combination will be unique Let's say 2 of these records have FLD1 = A and FLD2 = B. I am unsure about the remaining fields Right now, if I do a query specifying FLD1 = A and FLD2 = B then I get both records. I only want 1. Research says I should use faceting. But this: q=FLD1:A and FLD2:B rows=500 defType=edismax fl=FLD1, FLD2 facet=true facet_field=FLD1 facet_field=FLD2 gives me 2 records. In fact, it gives me the same results as: q=FLD1:A and FLD2:B rows=500 defType=edismax fl=FLD1, FLD2 I'm wrong somewhere, but I'm unsure where. Is faceting the right way to go or should I be using grouping? Curiously, when I use grouping like this: q=FLD1:A and FLD2:B rows=500 defType=edismax indent=true fl=FLD1, FLD2 group=true group.field=FLD1 group.field=FLD2 I get 2 records as well. Has anyone dealt with mimicing select distinct in Solr? Any advice would be very appreciated. Mark
searching for terms containing embedded spaces
Hi folks I've got a field that contains 2 words separated by a single blank. What's the trick to creating a search string that contains the single blank? Mark
edismax, inconsistencies with implicit/explicit AND when used with explicit OR
Hello all We've just switched from the default parser to the edismax parser and a user has noticed some inconsistencies when using implicit/explicit ANDs, ORs and grouping search terms in parenthesis. First, the default query operator is AND. I switched it from OR today. The query: customersJoin/select?indent=onversion=3.3q=CUSTOMER_NM:*IBM*%20CUSTOMER_NM:*Software*%20OR%20CUSTOMER_NM:*something*fq=start=0rows=10fl=*%2CscoredefType=edismaxwt=explainOther=hl.flhttp://cn-nyc1-ad-dev1.cnet.com:8983/solr/customersJoin/select?indent=onversion=3.3q=CUSTOMER_NM:*IBM*%20CUSTOMER_NM:*Software*%20OR%20CUSTOMER_NM:*something*fq=start=0rows=10fl=*%2CscoredefType=edismaxwt=explainOther=hl.fl = returns 1053 results. Some have only IBM in CUSTOMER_NM, some have only Software in the name, some have both. However, when I explicitly specify an AND between CUSTOMER_NM:*IBM* and CUSTOMER_NM:*Software* : customersJoin/select?indent=onversion=3.3q=CUSTOMER_NM:*IBM*%20AND%20CUSTOMER_NM:*Software*%20OR%20CUSTOMER_NM:*something*fq=start=0rows=10fl=*%2CscoredefType=edismaxwt=explainOther=hl.fl= I only get 3 results and all of them contain both IBM and Software. I found this reference to inconsistencies with edismax, but I'm not sure it explains this situation 100%. http://lucene.472066.n3.nabble.com/edismax-inconsistency-AND-OR-td2131795.html Have I found a bug or am I doing something terribly wrong? Mark
edismax configuration
Hello all Can someone direct me to a link with config info in order to allow use of the edismax QueryHandler? Mark
Re: edismax configuration
Got it. Thank you. I thought this was going to be much more difficult than it actually was. Mark On Mon, Aug 8, 2011 at 4:50 PM, Markus Jelsma markus.jel...@openindex.iowrote: http://wiki.apache.org/solr/CommonQueryParameters#defType Hello all Can someone direct me to a link with config info in order to allow use of the edismax QueryHandler? Mark
deleting index directory/files
Hello all I'm using multiple cores. I there's a directory named by the core and it contains a subdir named data that contains a subdir named index that contains a bunch of files that contain the data for my index. Let's say I want to completely rebuild the index from scratch. Can I delete the dir named index? I know the next thing I'd have to do is a full data import, and that's ok. I want to blow away any traces of the core's previous existence. Mark
field with repeated data in index
Hello all I created an index consisting of orders and the names of the salesmen who are responsible for the order. As you can imagine, the same name can be associated with many different orders. No problem. Until I try to do a faceted search on the salesman name field. Right now, I have the data indexed as follows: field name=PRIMARY_AC type=string indexed=false stored=true required=true default=PRIMARY_AC unavailable/ My faceted search gives me the following response: response={responseHeader={status=0,QTime=358,params={facet=on,indent=true,q=*:*,facet.field=PRIMARY_AC,wt=javabin,rows=0,version=2}},response={numFound=954178,start=0,docs=[]},facet_counts={facet_queries={},facet_fields={PRIMARY_AC={}},facet_dates={},facet_ranges={}}} Which just isn't right. I KNOW there's data in there, but am confused as to how to properly identify it to Solr. Any suggestions? Mark
Re: field with repeated data in index
James Wow. That was fast. Thanks! But I thought you couldn't index a field that has duplicate values? Mark On Thu, Jul 28, 2011 at 4:53 PM, Dyer, James james.d...@ingrambook.comwrote: You need to index the field you want to facet on. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Mark juszczec [mailto:mark.juszc...@gmail.com] Sent: Thursday, July 28, 2011 3:50 PM To: solr-user@lucene.apache.org Subject: field with repeated data in index Hello all I created an index consisting of orders and the names of the salesmen who are responsible for the order. As you can imagine, the same name can be associated with many different orders. No problem. Until I try to do a faceted search on the salesman name field. Right now, I have the data indexed as follows: field name=PRIMARY_AC type=string indexed=false stored=true required=true default=PRIMARY_AC unavailable/ My faceted search gives me the following response: response={responseHeader={status=0,QTime=358,params={facet=on,indent=true,q=*:*,facet.field=PRIMARY_AC,wt=javabin,rows=0,version=2}},response={numFound=954178,start=0,docs=[]},facet_counts={facet_queries={},facet_fields={PRIMARY_AC={}},facet_dates={},facet_ranges={}}} Which just isn't right. I KNOW there's data in there, but am confused as to how to properly identify it to Solr. Any suggestions? Mark
updating existing data in index vs inserting new data in index
Hello all I'm using Solr 3.2 and am confused about updating existing data in an index. According to the DataImportHandler Wiki: *delta-import* : For incremental imports and change detection run the command `http://host:port/solr/dataimport?command=delta-import . It supports the same clean, commit, optimize and debug parameters as full-import command. I know delta-import will find new data in the database and insert it into the index. My problem is how it handles updates where I've got a record that exists in the index and the database, the database record is changed and I want to incorporate those changes in the existing record in the index. IOW I don't want to insert it again. I've tried this and wound up with 2 records with the same key in the index. The first contains the original db values found when the index was created, the 2nd contains the db values after the record was changed. I've also found this http://search.lucidimagination.com/search/out?u=http%3A%2F%2Flucene.472066.n3.nabble.com%2FDelta-import-with-solrj-client-tp1085763p1086173.html the subject is 'Delta-import with solrj client' Greetings. I have a *solrj* client for fetching data from database. I am using *delta*-*import* for fetching data. If a column is changed in database using timestamp with *delta*-*import* i get the latest column indexed but there are *duplicate* values in the index similar to the column but the data is older. This works with cleaning the index but i want to update the index without cleaning it. Is there a way to just update the index with the updated column without having *duplicate* values. Appreciate for any feedback. Hando There are 2 responses: Short answer is no, there isn't a way. *Solr* doesn't have the concept of 'Update' to an indexed document. You need to add the full document (all 'columns') each time any one field changes. If doing that in your DataImportHandler logic is difficult you may need to write a separate Update Service that does: 1) Read UniqueID, UpdatedColumn(s) from database 2) Using UniqueID Retrieve document from *Solr* 3) Add/Update field(s) with updated column(s) 4) Add document back to *Solr* Although, if you use DIH to do a full *import*, using the same query in your *Delta*-*Import* to get the whole document shouldn't be that difficult. and Hi, Make sure you use a proper ID field, which does *not* change even if the content in the database changes. In this way, when your *delta*-*import* fetches changed rows to index, they will update the existing rows in your index. I have an ID field that doesn't change. It is the primary key field from the database table I am trying to index and I have verified it is unique. So, does Solr allow updates (not inserts) of existing records? Is anyone able to do this? Mark
Re: updating existing data in index vs inserting new data in index
Bob Thanks very much for the reply! I am using a unique integer called order_id as the Solr index key. My query, deltaQuery and deltaImportQuery are below: entity name=item1 pk=ORDER_ID query=select 1 as TABLE_ID , orders.order_id, orders.order_booked_ind, orders.order_dt, orders.cancel_dt, orders.account_manager_id, orders.of_header_id, orders.order_status_lov_id, orders.order_type_id, orders.approved_discount_pct, orders.campaign_nm, orders.approved_by_cd,orders.advertiser_id, orders.agency_id from orders deltaImportQuery=select 1 as TABLE_ID, orders.order_id, orders.order_booked_ind, orders.order_dt, orders.cancel_dt, orders.account_manager_id, orders.of_header_id, orders.order_status_lov_id, orders.order_type_id, orders.approved_discount_pct, orders.campaign_nm, orders.approved_by_cd,orders.advertiser_id, orders.agency_id from orders where orders.order_id = '${dataimporter.delta.ORDER_ID}' deltaQuery=select orders.order_id from orders where orders.change_dt to_date('${dataimporter.last_index_time}','-MM-DD HH24:MI:SS') /entity The test I am running is two part: 1. After I do a full import of the index, I insert a brand new record (with a never existed before order_id) in the database. The delta import picks this up just fine. 2. After the full import, I modify a record with an order_id that already shows up in the index. I have verified there is only one record with this order_id in both the index and the db before I do the delta update. I guess the question is, am I screwing myself up by defining my own Solr index key? I want to, ultimately, be able to search on ORDER_ID in the Solr index. However, the docs say (I think) a field does not have to be the Solr primary key in order to be searchable. Would I be better off letting Solr manage the keys? Mark On Thu, Jul 7, 2011 at 9:24 AM, Bob Sandiford bob.sandif...@sirsidynix.comwrote: What are you using as the unique id in your Solr index? It sounds like you may have one value as your Solr index unique id, which bears no resemblance to a unique[1] id derived from your data... Or - another way to put it - what is it that makes these two records in your Solr index 'the same', and what are the unique id's for those two entries in the Solr index? How are those id's related to your original data? [1] not only unique, but immutable. I.E. if you update a row in your database, the unique id derived from that row has to be the same as it would have been before the update. Otherwise, there's nothing for Solr to recognize as a duplicate entry, and do a 'delete' and 'insert' instead of just an 'insert'. Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Mark juszczec [mailto:mark.juszc...@gmail.com] Sent: Thursday, July 07, 2011 9:15 AM To: solr-user@lucene.apache.org Subject: updating existing data in index vs inserting new data in index Hello all I'm using Solr 3.2 and am confused about updating existing data in an index. According to the DataImportHandler Wiki: *delta-import* : For incremental imports and change detection run the command `http://host:port/solr/dataimport?command=delta-import . It supports the same clean, commit, optimize and debug parameters as full-import command. I know delta-import will find new data in the database and insert it into the index. My problem is how it handles updates where I've got a record that exists in the index and the database, the database record is changed and I want to incorporate those changes in the existing record in the index. IOW I don't want to insert it again. I've tried this and wound up with 2 records with the same key in the index. The first contains the original db values found when the index was created, the 2nd contains the db values after the record was changed. I've also found this http://search.lucidimagination.com/search/out?u=http%3A%2F%2Flucene.4720 66.n3.nabble.com%2FDelta-import-with-solrj-client-tp1085763p1086173.html the subject is 'Delta-import with solrj client' Greetings. I have a *solrj* client for fetching data from database. I am using *delta*-*import* for fetching data. If a column is changed in database using timestamp with *delta*-*import* i get the latest column indexed but there are *duplicate* values in the index similar to the column but the data is older. This works with cleaning the index but i want to update the index without cleaning it. Is there a way to just update the index with the updated column without having *duplicate* values. Appreciate for any feedback. Hando There are 2 responses: Short answer is no, there isn't a way. *Solr* doesn't have the concept of 'Update' to an indexed document. You need to add the full document (all 'columns') each time any one field changes. If doing
Re: updating existing data in index vs inserting new data in index
Bob No, I don't. Let me look into that and post my results. Mark On Thu, Jul 7, 2011 at 10:14 AM, Bob Sandiford bob.sandif...@sirsidynix.com wrote: Hi, Mark. I haven't used DIH myself - so I'll need to leave comments on your set up to others who have done so. Another question - after your initial index create (and after each delta), do you run a 'commit'? Do you run an 'optimize'? (Without the optimize, 'deleted' records still show up in query results...) Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Mark juszczec [mailto:mark.juszc...@gmail.com] Sent: Thursday, July 07, 2011 10:04 AM To: solr-user@lucene.apache.org Subject: Re: updating existing data in index vs inserting new data in index Bob Thanks very much for the reply! I am using a unique integer called order_id as the Solr index key. My query, deltaQuery and deltaImportQuery are below: entity name=item1 pk=ORDER_ID query=select 1 as TABLE_ID , orders.order_id, orders.order_booked_ind, orders.order_dt, orders.cancel_dt, orders.account_manager_id, orders.of_header_id, orders.order_status_lov_id, orders.order_type_id, orders.approved_discount_pct, orders.campaign_nm, orders.approved_by_cd,orders.advertiser_id, orders.agency_id from orders deltaImportQuery=select 1 as TABLE_ID, orders.order_id, orders.order_booked_ind, orders.order_dt, orders.cancel_dt, orders.account_manager_id, orders.of_header_id, orders.order_status_lov_id, orders.order_type_id, orders.approved_discount_pct, orders.campaign_nm, orders.approved_by_cd,orders.advertiser_id, orders.agency_id from orders where orders.order_id = '${dataimporter.delta.ORDER_ID}' deltaQuery=select orders.order_id from orders where orders.change_dt to_date('${dataimporter.last_index_time}','-MM-DD HH24:MI:SS') /entity The test I am running is two part: 1. After I do a full import of the index, I insert a brand new record (with a never existed before order_id) in the database. The delta import picks this up just fine. 2. After the full import, I modify a record with an order_id that already shows up in the index. I have verified there is only one record with this order_id in both the index and the db before I do the delta update. I guess the question is, am I screwing myself up by defining my own Solr index key? I want to, ultimately, be able to search on ORDER_ID in the Solr index. However, the docs say (I think) a field does not have to be the Solr primary key in order to be searchable. Would I be better off letting Solr manage the keys? Mark On Thu, Jul 7, 2011 at 9:24 AM, Bob Sandiford bob.sandif...@sirsidynix.comwrote: What are you using as the unique id in your Solr index? It sounds like you may have one value as your Solr index unique id, which bears no resemblance to a unique[1] id derived from your data... Or - another way to put it - what is it that makes these two records in your Solr index 'the same', and what are the unique id's for those two entries in the Solr index? How are those id's related to your original data? [1] not only unique, but immutable. I.E. if you update a row in your database, the unique id derived from that row has to be the same as it would have been before the update. Otherwise, there's nothing for Solr to recognize as a duplicate entry, and do a 'delete' and 'insert' instead of just an 'insert'. Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Mark juszczec [mailto:mark.juszc...@gmail.com] Sent: Thursday, July 07, 2011 9:15 AM To: solr-user@lucene.apache.org Subject: updating existing data in index vs inserting new data in index Hello all I'm using Solr 3.2 and am confused about updating existing data in an index. According to the DataImportHandler Wiki: *delta-import* : For incremental imports and change detection run the command `http://host:port/solr/dataimport?command=delta-import . It supports the same clean, commit, optimize and debug parameters as full-import command. I know delta-import will find new data in the database and insert it into the index. My problem is how it handles updates where I've got a record that exists in the index and the database, the database record is changed and I want to incorporate those changes in the existing record in the index. IOW I don't want to insert it again. I've tried this and wound up with 2 records with the same key in the index. The first contains the original db values found when the index was created
Re: updating existing data in index vs inserting new data in index
Ok. That's really good to know because optimization of that kind will be important. What of commit? Does it somehow remove the previous version of an updated record? On Thu, Jul 7, 2011 at 10:49 AM, Michael Kuhlmann s...@kuli.org wrote: Am 07.07.2011 16:14, schrieb Bob Sandiford: [...] (Without the optimize, 'deleted' records still show up in query results...) No, that's not true. The terms remain in the index, but the document won't show up any more. Optimize is only for performance (and disk space) optimization, as the name suggests. -Kuli
Re: updating existing data in index vs inserting new data in index
Erick I used to, but now I find I must have commented it out in a fit of rage ;-) This could be the whole problem. I have verified via admin schema browser that the field is ORDER_ID and will double check I refer to it in upper case in the appropriate places in the Solr config scheme. Curiously, the admin schema browser display for ORDER_ID says hasDeletions: false - which seems the opposite of what I want. I want to be able to delete duplicates. Or am I interpreting this field wrong? In order to check for duplicates, I am going to using the admin browser to enter the following in the Make A Query box: TABLE_ID:1 AND ORDER_ID:674659 When I click search and view the results, 2 records are displayed. One has the original values, one has the changed values. I haven't examined the xml (via view source) too closely and the next time I run I will look for something indicating one of the records is inactive. When you say change your schema do you mean via a delta import or by modifying the config files or both? FWIW, I am deleting the index on the file system, doing a full import, modifying the data in the database and then doing a delta import. I am not restarting Solr at all in this process. I understand Solr does not perform key management. You described exactly what I meant. Sorry for any confusion. Mark On Thu, Jul 7, 2011 at 10:52 AM, Erick Erickson erickerick...@gmail.comwrote: Let me re-state a few things to see if I've got it right: your schema.xml file has an entry like uniqueKeyorder_id/uniqueKey, right? given this definition, any document added with an order_id that already exists in the Solr index will be replaced. i.e. you should have one and only one document with a given order_id. case matters. Check via the admin page (schema browser) to see if you have two fields, order_id an ORDER_ID. How are you checking that your docs are duplicates? If you do a search on order_id, you should get back one and only one document (assuming the definition above). A document that's deleted will just be marked as deleted, the data won't be purged from the index. It won't show in search results, but it will show if you use lower-level ways to access the data. Whenever you change your schema, it's best to clean the index, restart the server and re-index from scratch. Solr won't retroactively remove duplicate uniqueKey entries. On the stats admin/stats page you should see maxDocs and numDocs. The difference between these should be the number of deleted documents. Solr doesn't manage unique keys. All that happens is Solr will replace any pre-existing documents where *you've* defined the uniqueKey when a new doc is added... Hope this helps Erick On Thu, Jul 7, 2011 at 10:16 AM, Mark juszczec mark.juszc...@gmail.com wrote: Bob No, I don't. Let me look into that and post my results. Mark On Thu, Jul 7, 2011 at 10:14 AM, Bob Sandiford bob.sandif...@sirsidynix.com wrote: Hi, Mark. I haven't used DIH myself - so I'll need to leave comments on your set up to others who have done so. Another question - after your initial index create (and after each delta), do you run a 'commit'? Do you run an 'optimize'? (Without the optimize, 'deleted' records still show up in query results...) Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Mark juszczec [mailto:mark.juszc...@gmail.com] Sent: Thursday, July 07, 2011 10:04 AM To: solr-user@lucene.apache.org Subject: Re: updating existing data in index vs inserting new data in index Bob Thanks very much for the reply! I am using a unique integer called order_id as the Solr index key. My query, deltaQuery and deltaImportQuery are below: entity name=item1 pk=ORDER_ID query=select 1 as TABLE_ID , orders.order_id, orders.order_booked_ind, orders.order_dt, orders.cancel_dt, orders.account_manager_id, orders.of_header_id, orders.order_status_lov_id, orders.order_type_id, orders.approved_discount_pct, orders.campaign_nm, orders.approved_by_cd,orders.advertiser_id, orders.agency_id from orders deltaImportQuery=select 1 as TABLE_ID, orders.order_id, orders.order_booked_ind, orders.order_dt, orders.cancel_dt, orders.account_manager_id, orders.of_header_id, orders.order_status_lov_id, orders.order_type_id, orders.approved_discount_pct, orders.campaign_nm, orders.approved_by_cd,orders.advertiser_id, orders.agency_id from orders where orders.order_id = '${dataimporter.delta.ORDER_ID}' deltaQuery=select orders.order_id from orders where orders.change_dt to_date('${dataimporter.last_index_time}','-MM-DD HH24:MI:SS') /entity The test I am running is two part: 1. After I do a full import of the index, I
Re: updating existing data in index vs inserting new data in index
First thanks for all the help. I think the problem was a combination of not having a unique key defined AND not including the commit=true parameter in the delta update. Once I did those things, the delta import left me with a single (updated) copy of the record including the changes in the source database. Do I have write access to the Wiki so I can explicitly state commit=true NEEDS to be specified? Mark On Thu, Jul 7, 2011 at 12:39 PM, Erick Erickson erickerick...@gmail.comwrote: I'd restart Solr after changing the schema.xml. The delta import does NOT require restart or anything else like that. The fact that two records are displayed is not what I'd expect. But Solr absolutely handles the replace via uniqueKey. So I suspect that you're not actually doing what you expect. A little-known aid for debugging DIH is solr/admin/dataimport.jsp, that might give you some joy. But, to summarize. This should work fine for DIH as far as Solr is concerned assuming that uniqueKey is properly defined. In you query above that returns two documents, can you paste the entire response with fl=* attached? I'm guessing that the data in your index isn't what you're expecting... Also, you might want to get a copy of Luke and examine your index, there's a wealth of infomration Best Erick On Thu, Jul 7, 2011 at 11:12 AM, Mark juszczec mark.juszc...@gmail.com wrote: Erick I used to, but now I find I must have commented it out in a fit of rage ;-) This could be the whole problem. I have verified via admin schema browser that the field is ORDER_ID and will double check I refer to it in upper case in the appropriate places in the Solr config scheme. Curiously, the admin schema browser display for ORDER_ID says hasDeletions: false - which seems the opposite of what I want. I want to be able to delete duplicates. Or am I interpreting this field wrong? In order to check for duplicates, I am going to using the admin browser to enter the following in the Make A Query box: TABLE_ID:1 AND ORDER_ID:674659 When I click search and view the results, 2 records are displayed. One has the original values, one has the changed values. I haven't examined the xml (via view source) too closely and the next time I run I will look for something indicating one of the records is inactive. When you say change your schema do you mean via a delta import or by modifying the config files or both? FWIW, I am deleting the index on the file system, doing a full import, modifying the data in the database and then doing a delta import. I am not restarting Solr at all in this process. I understand Solr does not perform key management. You described exactly what I meant. Sorry for any confusion. Mark On Thu, Jul 7, 2011 at 10:52 AM, Erick Erickson erickerick...@gmail.com wrote: Let me re-state a few things to see if I've got it right: your schema.xml file has an entry like uniqueKeyorder_id/uniqueKey, right? given this definition, any document added with an order_id that already exists in the Solr index will be replaced. i.e. you should have one and only one document with a given order_id. case matters. Check via the admin page (schema browser) to see if you have two fields, order_id an ORDER_ID. How are you checking that your docs are duplicates? If you do a search on order_id, you should get back one and only one document (assuming the definition above). A document that's deleted will just be marked as deleted, the data won't be purged from the index. It won't show in search results, but it will show if you use lower-level ways to access the data. Whenever you change your schema, it's best to clean the index, restart the server and re-index from scratch. Solr won't retroactively remove duplicate uniqueKey entries. On the stats admin/stats page you should see maxDocs and numDocs. The difference between these should be the number of deleted documents. Solr doesn't manage unique keys. All that happens is Solr will replace any pre-existing documents where *you've* defined the uniqueKey when a new doc is added... Hope this helps Erick On Thu, Jul 7, 2011 at 10:16 AM, Mark juszczec mark.juszc...@gmail.com wrote: Bob No, I don't. Let me look into that and post my results. Mark On Thu, Jul 7, 2011 at 10:14 AM, Bob Sandiford bob.sandif...@sirsidynix.com wrote: Hi, Mark. I haven't used DIH myself - so I'll need to leave comments on your set up to others who have done so. Another question - after your initial index create (and after each delta), do you run a 'commit'? Do you run an 'optimize'? (Without the optimize, 'deleted' records still show up in query results...) Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif
primary key made of multiple fields from multiple source tables
Hello all I'm using Solr 3.2 and am trying to index a document whose primary key is built from multiple columns selected from an Oracle DB. I'm getting the following error: java.lang.IllegalArgumentException: deltaQuery has no column to resolve to declared primary key pk='ordersorderline_id' at org.apache.solr.handler.dataimport.DocBuilder.findMatchingPkColumn(DocBuilder.java:840) ~[apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30 23:09:08] at org.apache.solr.handler.dataimport.DocBuilder.collectDelta(DocBuilder.java:891) ~[apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30 23:09:08] at org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:284) ~[apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30 23:09:08] at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:178) ~[apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30 23:09:08] at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:374) [apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30 23:09:08] at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:413) [apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30 23:09:08] at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392) [apache-solr-dataimporthandler-3.2.0.jar:3.2.0 1129474 - rmuir - 2011-05-30 23:09:08] The deltaQuery is: select orders.order_id || orders.order_booked_ind || order_line.order_line_id as ordersorderline_id, orders.order_id, orders.order_booked_ind, order_line.order_line_id, orders.order_dt, orders.cancel_dt, orders.account_manager_id, orders.of_header_id, orders.order_status_lov_id, orders.order_type_id, orders.approved_discount_pct, orders.campaign_nm, orders.approved_by_cd,orders.advertiser_id, orders.agency_id, order_line.accounting_comments_desc from orders, order_line where order_line.order_id = orders.order_id and order_line.order_booked_ind = orders.order_booked_ind I've just seen in the Solr Wiki Task List at http://wiki.apache.org/solr/TaskList?highlight=%28composite%29 a Big Idea for The Future is: support for *composite* keys ... either with some explicit change to the uniqueKey declaration or perhaps just copyField with some hidden magic that concats the resulting terms into a single key Term Does this prohibit my creating the key with the select as above? Mark