RE: Missing Records
Another round of tests this morning. Ten rounds of imports all done on the non-leader node: 902294 900089 899267 898127 901945 901055 899638 899392 899880 901812 The expected number of records is 903990 I am getting this error: org.apache.solr.common.SolrException: Bad Request request: http://192.168.20.51:8983/solr/Inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.51%3A7574%2Fsolr%2FInventory_shard1_replica2%2Fwt=javabinversion=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) And I am getting this warning: org.apache.solr.common.SolrException: Bad Request request: http://192.168.20.51:8983/solr/Inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.51%3A7574%2Fsolr%2FInventory_shard1_replica2%2Fwt=javabinversion=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) These are both from the admin logging section. I have retained the logs files if it would help AJ -Original Message- From: AJ Lemke [mailto:aj.le...@securitylabs.com] Sent: Monday, November 3, 2014 5:31 PM To: solr-user@lucene.apache.org Subject: RE: Missing Records So I jumped back on this. I have not been using the optimize option on this new set of tests. If I run the full index on the leader I seem to get all of the items in the database minus 3 that have a missing field. Indexing completed. Added/Updated: 903,990 documents. Deleted 0 documents. (Duration: 25m 11s) Requests: 1 (0/s), Fetched: 903,993 (598/s), Skipped: 0, Processed: 903,990 Last Modified:2 minutes ago Num Docs:903990 Max Doc:903990 Heap Memory Usage:2625744 Deleted Docs:0 Version:3249 Segment Count:7 Optimized: Current: If I run it on the other node I get: Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. (Duration: 27m 08s) Requests: 1 (0/s), Fetched: 903,993 (555/s), Skipped: 0, Processed: 903,993 (555/s) Last Modified:about a minute ago Num Docs:897791 Max Doc:897791 Heap Memory Usage:2621072 Deleted Docs:0 Version:3285 Segment Count:7 Optimized: Current: Any ideas? If there is any more info that is needed let me know. AJ -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, October 31, 2014 1:44 PM To: solr-user@lucene.apache.org Subject: Re: Missing Records Sorry to say this, but I don't think the numDocs/maxDoc numbers are telling you anything. because it looks like you've optimized which purges any data associated with deleted docs, including the internal IDs which are the numDocs/maxDocs figures. So if there were deletions, we can't see any evidence of same. Siih. On Fri, Oct 31, 2014 at 9:56 AM, AJ Lemke aj.le...@securitylabs.com wrote: I have run some more tests so the numbers have changed a bit. Index Results done on Node 1: Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. (Duration: 31m 47s) Requests: 1 (0/s), Fetched: 903,993 (474/s), Skipped: 0, Processed: 903,993 Node 1: Last Modified: 44 minutes ago Num Docs: 824216 Max Doc: 824216 Heap Memory Usage: -1 Deleted Docs: 0 Version: 1051 Segment Count: 1 Optimized: checked Current: checked Node 2: Last Modified: 44 minutes ago Num Docs: 824216 Max Doc: 824216 Heap Memory Usage: -1 Deleted Docs: 0 Version: 1051 Segment Count: 1 Optimized: checked Current: checked Search results are the same as the doc numbers above. Logs only have one instance of an error: ERROR - 2014-10-31 10:47:12.867; org.apache.solr.update.StreamingSolrServers$1; error org.apache.solr.common.SolrException: Bad Request request: http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Some info that may be of help This is on my local vm using jetty with the embedded zookeeper. Commands to start cloud: java -DzkRun -jar start.jar java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983 -jar
RE: Missing Records
So I jumped back on this. I have not been using the optimize option on this new set of tests. If I run the full index on the leader I seem to get all of the items in the database minus 3 that have a missing field. Indexing completed. Added/Updated: 903,990 documents. Deleted 0 documents. (Duration: 25m 11s) Requests: 1 (0/s), Fetched: 903,993 (598/s), Skipped: 0, Processed: 903,990 Last Modified:2 minutes ago Num Docs:903990 Max Doc:903990 Heap Memory Usage:2625744 Deleted Docs:0 Version:3249 Segment Count:7 Optimized: Current: If I run it on the other node I get: Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. (Duration: 27m 08s) Requests: 1 (0/s), Fetched: 903,993 (555/s), Skipped: 0, Processed: 903,993 (555/s) Last Modified:about a minute ago Num Docs:897791 Max Doc:897791 Heap Memory Usage:2621072 Deleted Docs:0 Version:3285 Segment Count:7 Optimized: Current: Any ideas? If there is any more info that is needed let me know. AJ -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, October 31, 2014 1:44 PM To: solr-user@lucene.apache.org Subject: Re: Missing Records Sorry to say this, but I don't think the numDocs/maxDoc numbers are telling you anything. because it looks like you've optimized which purges any data associated with deleted docs, including the internal IDs which are the numDocs/maxDocs figures. So if there were deletions, we can't see any evidence of same. Siih. On Fri, Oct 31, 2014 at 9:56 AM, AJ Lemke aj.le...@securitylabs.com wrote: I have run some more tests so the numbers have changed a bit. Index Results done on Node 1: Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. (Duration: 31m 47s) Requests: 1 (0/s), Fetched: 903,993 (474/s), Skipped: 0, Processed: 903,993 Node 1: Last Modified: 44 minutes ago Num Docs: 824216 Max Doc: 824216 Heap Memory Usage: -1 Deleted Docs: 0 Version: 1051 Segment Count: 1 Optimized: checked Current: checked Node 2: Last Modified: 44 minutes ago Num Docs: 824216 Max Doc: 824216 Heap Memory Usage: -1 Deleted Docs: 0 Version: 1051 Segment Count: 1 Optimized: checked Current: checked Search results are the same as the doc numbers above. Logs only have one instance of an error: ERROR - 2014-10-31 10:47:12.867; org.apache.solr.update.StreamingSolrServers$1; error org.apache.solr.common.SolrException: Bad Request request: http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Some info that may be of help This is on my local vm using jetty with the embedded zookeeper. Commands to start cloud: java -DzkRun -jar start.jar java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983 -jar start.jar sh zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir ~/development/configs/inventory/ -confname config_ inventory sh zkcli.sh -zkhost localhost:9983 -cmd linkconfig -collection inventory -confname config_ inventory curl http://localhost:8983/solr/admin/collections?action=CREATEname=inventorynumShards=1replicationFactor=2maxShardsPerNode=4; curl http://localhost:8983/solr/admin/collections?action=RELOADname= inventory AJ -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, October 31, 2014 9:49 AM To: solr-user@lucene.apache.org Subject: Re: Missing Records OK, that is puzzling. bq: If there were duplicates only one of the duplicates should be removed and I still should be able to search for the ID and find one correct? Correct. Your bad request error is puzzling, you may be on to something there. What it looks like is that somehow some of the documents you're sending to Solr aren't getting indexed, either being dropped through the network or perhaps have invalid fields, field formats (i.e. a date in the wrong format, whatever) or some such. When you complete the run, what are the maxDoc and numDocs numbers on one of the nodes? What else do you see in the logs? They're pretty big after that many adds, but maybe you can grep for ERROR and see something interesting like stack traces. Or even org.apache.solr. This latter will give you some false hits, but at least it's better than paging through a huge log file Personally, in this kind of situation I sometimes use SolrJ to do my indexing rather than DIH, I find it easier to debug so that's another possibility. In the worst case with SolrJ, you can send the docs
RE: Missing Records
I started this collection using this command: http://localhost:8983/solr/admin/collections?action=CREATEname=inventorynumShards=1replicationFactor=2maxShardsPerNode=4 So 1 shard and replicationFactor of 2 AJ -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Thursday, October 30, 2014 5:12 PM To: solr-user@lucene.apache.org Subject: Re: Missing Records I am curious , how many shards do you have and whats the replication factor you are using ? On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote: Hi All, We have a SOLR cloud instance that has been humming along nicely for months. Last week we started experiencing missing records. Admin DIH Example: Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A *:* search claims that there are only 903,902 this is the first full index. Subsequent full indexes give the following counts for the *:* search 903,805 903,665 826,357 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0, Processed: 903,993 (x/s) every time. ---records per second is variable I found an item that should be in the index but is not found in a search. Here are the referenced lines of the log file. DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE add{,id=750041421} {{params(debug=falseoptimize=trueindent=truecommit=trueclean=true wt=jsoncommand=full-importentity=adsverbose=false),defaults(config= data-config.xml)}} DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.SolrCmdDistributor; sending update to http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 add{,id=750041421} params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57 %3A8983%2Fsolr%2Finventory_shard1_replica1%2F --- there are 746 lines of log between entries --- DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire; [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],di strib.from?[0x17] http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delByQ [0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip% 51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'sk i-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48 929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2Di visionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*Phot oCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162 Long Track [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0] 1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0x e0]+Description?VThis Bad boy will pull you through the deepest snow!With the 162 track and 1000cc of power you can fly up any hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission [0xe0]*ModelFacet7Ski-Doo|Summit Highmark[0xe0]/DealerNameFacet9Certified Auto, Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0](DealerI D$4150[0xe0]AdCode$DX1Q[0xe0]*DealerName4Certified Auto, Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorCol or+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000 SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID12[0xe0].F uelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certif ied Auto, Inc.|Sioux City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber000105 [0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit highmark[\n] What could be the issue and how does one fix this issue? Thanks so much and if more information is needed I have preserved the log files. AJ
RE: Missing Records
Hi Erick: All of the records are coming out of an auto numbered field so the ID's will all be unique. Here is the the test I ran this morning: Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. (Duration: 28m) Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: 903,993 (538/s) Started: 33 minutes ago Last Modified:4 minutes ago Num Docs:903829 Max Doc:903829 Heap Memory Usage:-1 Deleted Docs:0 Version:1517 Segment Count:16 Optimized: checked Current: checked If there were duplicates only one of the duplicates should be removed and I still should be able to search for the ID and find one correct? As it is right now I am missing records that should be in the collection. I also noticed this: org.apache.solr.common.SolrException: Bad Request request: http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) AJ -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, October 30, 2014 7:08 PM To: solr-user@lucene.apache.org Subject: Re: Missing Records First question: Is there any possibility that some of the docs have duplicate IDs (uniqueKeys)? If so, then some of the docs will be replaced, which will lower your returns. One way to figuring this out is to go to the admin screen and if numDocs maxDoc, then documents have been replaced. Also, if numDocs is smaller than 903,993 then you probably have some docs being replaced. One warning, however. Even if docs are deleted, then this could still be the case because when segments are merged the deleted docs are purged. Best, Erick On Thu, Oct 30, 2014 at 3:12 PM, S.L simpleliving...@gmail.com wrote: I am curious , how many shards do you have and whats the replication factor you are using ? On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote: Hi All, We have a SOLR cloud instance that has been humming along nicely for months. Last week we started experiencing missing records. Admin DIH Example: Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A *:* search claims that there are only 903,902 this is the first full index. Subsequent full indexes give the following counts for the *:* search 903,805 903,665 826,357 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0, Processed: 903,993 (x/s) every time. ---records per second is variable I found an item that should be in the index but is not found in a search. Here are the referenced lines of the log file. DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE add{,id=750041421} {{params(debug=falseoptimize=trueindent=truecommit=trueclean=true wt=jsoncommand=full-importentity=adsverbose=false),defaults(confi g=data-config.xml)}} DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.SolrCmdDistributor; sending update to http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 add{,id=750041421} params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.5 7%3A8983%2Fsolr%2Finventory_shard1_replica1%2F --- there are 746 lines of log between entries --- DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire; [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],d istrib.from?[0x17] http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delBy Q[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zi p%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower 'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.4 8929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2 DivisionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*P hotoCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162 Long Track [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0 ]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[ 0xe0]+Description?VThis Bad boy will pull you through the deepest snow!With the 162 track and 1000cc of power you can fly up any hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission [0xe0]*ModelFacet7Ski-Doo|Summit Highmark[0xe0]/DealerNameFacet9Certified Auto, Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0
Re: Missing Records
OK, that is puzzling. bq: If there were duplicates only one of the duplicates should be removed and I still should be able to search for the ID and find one correct? Correct. Your bad request error is puzzling, you may be on to something there. What it looks like is that somehow some of the documents you're sending to Solr aren't getting indexed, either being dropped through the network or perhaps have invalid fields, field formats (i.e. a date in the wrong format, whatever) or some such. When you complete the run, what are the maxDoc and numDocs numbers on one of the nodes? What else do you see in the logs? They're pretty big after that many adds, but maybe you can grep for ERROR and see something interesting like stack traces. Or even org.apache.solr. This latter will give you some false hits, but at least it's better than paging through a huge log file Personally, in this kind of situation I sometimes use SolrJ to do my indexing rather than DIH, I find it easier to debug so that's another possibility. In the worst case with SolrJ, you can send the docs one at a time Best, Erick On Fri, Oct 31, 2014 at 7:37 AM, AJ Lemke aj.le...@securitylabs.com wrote: Hi Erick: All of the records are coming out of an auto numbered field so the ID's will all be unique. Here is the the test I ran this morning: Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. (Duration: 28m) Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: 903,993 (538/s) Started: 33 minutes ago Last Modified:4 minutes ago Num Docs:903829 Max Doc:903829 Heap Memory Usage:-1 Deleted Docs:0 Version:1517 Segment Count:16 Optimized: checked Current: checked If there were duplicates only one of the duplicates should be removed and I still should be able to search for the ID and find one correct? As it is right now I am missing records that should be in the collection. I also noticed this: org.apache.solr.common.SolrException: Bad Request request: http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) AJ -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, October 30, 2014 7:08 PM To: solr-user@lucene.apache.org Subject: Re: Missing Records First question: Is there any possibility that some of the docs have duplicate IDs (uniqueKeys)? If so, then some of the docs will be replaced, which will lower your returns. One way to figuring this out is to go to the admin screen and if numDocs maxDoc, then documents have been replaced. Also, if numDocs is smaller than 903,993 then you probably have some docs being replaced. One warning, however. Even if docs are deleted, then this could still be the case because when segments are merged the deleted docs are purged. Best, Erick On Thu, Oct 30, 2014 at 3:12 PM, S.L simpleliving...@gmail.com wrote: I am curious , how many shards do you have and whats the replication factor you are using ? On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote: Hi All, We have a SOLR cloud instance that has been humming along nicely for months. Last week we started experiencing missing records. Admin DIH Example: Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A *:* search claims that there are only 903,902 this is the first full index. Subsequent full indexes give the following counts for the *:* search 903,805 903,665 826,357 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0, Processed: 903,993 (x/s) every time. ---records per second is variable I found an item that should be in the index but is not found in a search. Here are the referenced lines of the log file. DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE add{,id=750041421} {{params(debug=falseoptimize=trueindent=truecommit=trueclean=true wt=jsoncommand=full-importentity=adsverbose=false),defaults(confi g=data-config.xml)}} DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.SolrCmdDistributor; sending update to http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 add{,id=750041421} params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.5 7%3A8983%2Fsolr%2Finventory_shard1_replica1%2F --- there are 746 lines of log between entries --- DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire; [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib
RE: Missing Records
I have run some more tests so the numbers have changed a bit. Index Results done on Node 1: Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. (Duration: 31m 47s) Requests: 1 (0/s), Fetched: 903,993 (474/s), Skipped: 0, Processed: 903,993 Node 1: Last Modified: 44 minutes ago Num Docs: 824216 Max Doc: 824216 Heap Memory Usage: -1 Deleted Docs: 0 Version: 1051 Segment Count: 1 Optimized: checked Current: checked Node 2: Last Modified: 44 minutes ago Num Docs: 824216 Max Doc: 824216 Heap Memory Usage: -1 Deleted Docs: 0 Version: 1051 Segment Count: 1 Optimized: checked Current: checked Search results are the same as the doc numbers above. Logs only have one instance of an error: ERROR - 2014-10-31 10:47:12.867; org.apache.solr.update.StreamingSolrServers$1; error org.apache.solr.common.SolrException: Bad Request request: http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Some info that may be of help This is on my local vm using jetty with the embedded zookeeper. Commands to start cloud: java -DzkRun -jar start.jar java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983 -jar start.jar sh zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir ~/development/configs/inventory/ -confname config_ inventory sh zkcli.sh -zkhost localhost:9983 -cmd linkconfig -collection inventory -confname config_ inventory curl http://localhost:8983/solr/admin/collections?action=CREATEname=inventorynumShards=1replicationFactor=2maxShardsPerNode=4; curl http://localhost:8983/solr/admin/collections?action=RELOADname= inventory AJ -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, October 31, 2014 9:49 AM To: solr-user@lucene.apache.org Subject: Re: Missing Records OK, that is puzzling. bq: If there were duplicates only one of the duplicates should be removed and I still should be able to search for the ID and find one correct? Correct. Your bad request error is puzzling, you may be on to something there. What it looks like is that somehow some of the documents you're sending to Solr aren't getting indexed, either being dropped through the network or perhaps have invalid fields, field formats (i.e. a date in the wrong format, whatever) or some such. When you complete the run, what are the maxDoc and numDocs numbers on one of the nodes? What else do you see in the logs? They're pretty big after that many adds, but maybe you can grep for ERROR and see something interesting like stack traces. Or even org.apache.solr. This latter will give you some false hits, but at least it's better than paging through a huge log file Personally, in this kind of situation I sometimes use SolrJ to do my indexing rather than DIH, I find it easier to debug so that's another possibility. In the worst case with SolrJ, you can send the docs one at a time Best, Erick On Fri, Oct 31, 2014 at 7:37 AM, AJ Lemke aj.le...@securitylabs.com wrote: Hi Erick: All of the records are coming out of an auto numbered field so the ID's will all be unique. Here is the the test I ran this morning: Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. (Duration: 28m) Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: 903,993 (538/s) Started: 33 minutes ago Last Modified:4 minutes ago Num Docs:903829 Max Doc:903829 Heap Memory Usage:-1 Deleted Docs:0 Version:1517 Segment Count:16 Optimized: checked Current: checked If there were duplicates only one of the duplicates should be removed and I still should be able to search for the ID and find one correct? As it is right now I am missing records that should be in the collection. I also noticed this: org.apache.solr.common.SolrException: Bad Request request: http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) AJ -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, October 30, 2014 7:08 PM To: solr-user
Re: Missing Records
Sorry to say this, but I don't think the numDocs/maxDoc numbers are telling you anything. because it looks like you've optimized which purges any data associated with deleted docs, including the internal IDs which are the numDocs/maxDocs figures. So if there were deletions, we can't see any evidence of same. Siih. On Fri, Oct 31, 2014 at 9:56 AM, AJ Lemke aj.le...@securitylabs.com wrote: I have run some more tests so the numbers have changed a bit. Index Results done on Node 1: Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. (Duration: 31m 47s) Requests: 1 (0/s), Fetched: 903,993 (474/s), Skipped: 0, Processed: 903,993 Node 1: Last Modified: 44 minutes ago Num Docs: 824216 Max Doc: 824216 Heap Memory Usage: -1 Deleted Docs: 0 Version: 1051 Segment Count: 1 Optimized: checked Current: checked Node 2: Last Modified: 44 minutes ago Num Docs: 824216 Max Doc: 824216 Heap Memory Usage: -1 Deleted Docs: 0 Version: 1051 Segment Count: 1 Optimized: checked Current: checked Search results are the same as the doc numbers above. Logs only have one instance of an error: ERROR - 2014-10-31 10:47:12.867; org.apache.solr.update.StreamingSolrServers$1; error org.apache.solr.common.SolrException: Bad Request request: http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Some info that may be of help This is on my local vm using jetty with the embedded zookeeper. Commands to start cloud: java -DzkRun -jar start.jar java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983 -jar start.jar sh zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir ~/development/configs/inventory/ -confname config_ inventory sh zkcli.sh -zkhost localhost:9983 -cmd linkconfig -collection inventory -confname config_ inventory curl http://localhost:8983/solr/admin/collections?action=CREATEname=inventorynumShards=1replicationFactor=2maxShardsPerNode=4; curl http://localhost:8983/solr/admin/collections?action=RELOADname= inventory AJ -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, October 31, 2014 9:49 AM To: solr-user@lucene.apache.org Subject: Re: Missing Records OK, that is puzzling. bq: If there were duplicates only one of the duplicates should be removed and I still should be able to search for the ID and find one correct? Correct. Your bad request error is puzzling, you may be on to something there. What it looks like is that somehow some of the documents you're sending to Solr aren't getting indexed, either being dropped through the network or perhaps have invalid fields, field formats (i.e. a date in the wrong format, whatever) or some such. When you complete the run, what are the maxDoc and numDocs numbers on one of the nodes? What else do you see in the logs? They're pretty big after that many adds, but maybe you can grep for ERROR and see something interesting like stack traces. Or even org.apache.solr. This latter will give you some false hits, but at least it's better than paging through a huge log file Personally, in this kind of situation I sometimes use SolrJ to do my indexing rather than DIH, I find it easier to debug so that's another possibility. In the worst case with SolrJ, you can send the docs one at a time Best, Erick On Fri, Oct 31, 2014 at 7:37 AM, AJ Lemke aj.le...@securitylabs.com wrote: Hi Erick: All of the records are coming out of an auto numbered field so the ID's will all be unique. Here is the the test I ran this morning: Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. (Duration: 28m) Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: 903,993 (538/s) Started: 33 minutes ago Last Modified:4 minutes ago Num Docs:903829 Max Doc:903829 Heap Memory Usage:-1 Deleted Docs:0 Version:1517 Segment Count:16 Optimized: checked Current: checked If there were duplicates only one of the duplicates should be removed and I still should be able to search for the ID and find one correct? As it is right now I am missing records that should be in the collection. I also noticed this: org.apache.solr.common.SolrException: Bad Request request: http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2
Missing Records
Hi All, We have a SOLR cloud instance that has been humming along nicely for months. Last week we started experiencing missing records. Admin DIH Example: Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A *:* search claims that there are only 903,902 this is the first full index. Subsequent full indexes give the following counts for the *:* search 903,805 903,665 826,357 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0, Processed: 903,993 (x/s) every time. ---records per second is variable I found an item that should be in the index but is not found in a search. Here are the referenced lines of the log file. DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE add{,id=750041421} {{params(debug=falseoptimize=trueindent=truecommit=trueclean=truewt=jsoncommand=full-importentity=adsverbose=false),defaults(config=data-config.xml)}} DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.SolrCmdDistributor; sending update to http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 add{,id=750041421} params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica1%2F --- there are 746 lines of log between entries --- DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire; [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],distrib.from?[0x17]http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delByQ[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2DivisionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*PhotoCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162 Long Track [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0xe0]+Description?VThis Bad boy will pull you through the deepest snow!With the 162 track and 1000cc of power you can fly up any hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission [0xe0]*ModelFacet7Ski-Doo|Summit Highmark[0xe0]/DealerNameFacet9Certified Auto, Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0](DealerID$4150[0xe0]AdCode$DX1Q[0xe0]*DealerName4Certified Auto, Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorColor+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000 SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID12[0xe0].FuelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certified Auto, Inc.|Sioux City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber000105[0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit highmark[\n] What could be the issue and how does one fix this issue? Thanks so much and if more information is needed I have preserved the log files. AJ
Re: Missing Records
I am curious , how many shards do you have and whats the replication factor you are using ? On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote: Hi All, We have a SOLR cloud instance that has been humming along nicely for months. Last week we started experiencing missing records. Admin DIH Example: Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A *:* search claims that there are only 903,902 this is the first full index. Subsequent full indexes give the following counts for the *:* search 903,805 903,665 826,357 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0, Processed: 903,993 (x/s) every time. ---records per second is variable I found an item that should be in the index but is not found in a search. Here are the referenced lines of the log file. DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE add{,id=750041421} {{params(debug=falseoptimize=trueindent=truecommit=trueclean=truewt=jsoncommand=full-importentity=adsverbose=false),defaults(config=data-config.xml)}} DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.SolrCmdDistributor; sending update to http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 add{,id=750041421} params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica1%2F --- there are 746 lines of log between entries --- DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire; [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],distrib.from?[0x17] http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delByQ[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2DivisionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*PhotoCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162 Long Track [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0xe0]+Description?VThis Bad boy will pull you through the deepest snow!With the 162 track and 1000cc of power you can fly up any hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission [0xe0]*ModelFacet7Ski-Doo|Summit Highmark[0xe0]/DealerNameFacet9Certified Auto, Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0](DealerID$4150[0xe0]AdCode$DX1Q[0xe0]*DealerName4Certified Auto, Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorColor+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000 SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID12[0xe0].FuelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certified Auto, Inc.|Sioux City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber000105[0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit highmark[\n] What could be the issue and how does one fix this issue? Thanks so much and if more information is needed I have preserved the log files. AJ
Re: Missing Records
First question: Is there any possibility that some of the docs have duplicate IDs (uniqueKeys)? If so, then some of the docs will be replaced, which will lower your returns. One way to figuring this out is to go to the admin screen and if numDocs maxDoc, then documents have been replaced. Also, if numDocs is smaller than 903,993 then you probably have some docs being replaced. One warning, however. Even if docs are deleted, then this could still be the case because when segments are merged the deleted docs are purged. Best, Erick On Thu, Oct 30, 2014 at 3:12 PM, S.L simpleliving...@gmail.com wrote: I am curious , how many shards do you have and whats the replication factor you are using ? On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote: Hi All, We have a SOLR cloud instance that has been humming along nicely for months. Last week we started experiencing missing records. Admin DIH Example: Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A *:* search claims that there are only 903,902 this is the first full index. Subsequent full indexes give the following counts for the *:* search 903,805 903,665 826,357 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0, Processed: 903,993 (x/s) every time. ---records per second is variable I found an item that should be in the index but is not found in a search. Here are the referenced lines of the log file. DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE add{,id=750041421} {{params(debug=falseoptimize=trueindent=truecommit=trueclean=truewt=jsoncommand=full-importentity=adsverbose=false),defaults(config=data-config.xml)}} DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.SolrCmdDistributor; sending update to http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 add{,id=750041421} params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica1%2F --- there are 746 lines of log between entries --- DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire; [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],distrib.from?[0x17] http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delByQ[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2DivisionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*PhotoCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162 Long Track [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0xe0]+Description?VThis Bad boy will pull you through the deepest snow!With the 162 track and 1000cc of power you can fly up any hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission [0xe0]*ModelFacet7Ski-Doo|Summit Highmark[0xe0]/DealerNameFacet9Certified Auto, Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0](DealerID$4150[0xe0]AdCode$DX1Q[0xe0]*DealerName4Certified Auto, Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorColor+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000 SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID12[0xe0].FuelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certified Auto, Inc.|Sioux City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber000105[0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit highmark[\n] What could be the issue and how does one fix this issue? Thanks so much and if more information is needed I have preserved the log files. AJ
Re: Delta Import occasionally missing records.
The SolrEntityProcessor would be a top-level entity. You would do a query like this: sort=timestamp,descrows=1fl=timestamp. This gives you one data item: the timestamp of the last item added to the index. With this, the JDBC sub-entity would create a query that chooses all rows with a timestamp = this latest timestamp. It will not be easy to put this together, but it is possible :) Good luck! Lance On Mon, Jan 24, 2011 at 2:04 AM, btucker btuc...@mintel.com wrote: Thank you for your response. In what way is 'timestamp' not perfect? I've looked into the SolrEntityProcessor and added a timestamp field to our index. However i'm struggling to work out a query to get the max value od the timestamp field and does the SolrEntityProcessor entity appear before the root entity or does it wrap around the root entity. On 22 January 2011 07:24, Lance Norskog-2 [via Lucene] ml-node+2307215-627680969-326...@n3.nabble.comml-node%2b2307215-627680969-326...@n3.nabble.com wrote: The timestamp thing is not perfect. You can instead do a search against Solr and find the latest timestamp in the index. SOLR-1499 allows you to search against Solr in the DataImportHandler. On Fri, Jan 21, 2011 at 2:27 AM, btucker [hidden email]http://user/SendEmail.jtp?type=nodenode=2307215i=0 wrote: Hello We've just started using solr to provide search functionality for our application with the DataImportHandler performing a delta-import every 1 fired by crontab, which works great, however it does occasionally miss records that are added to the database while the delta-import is running. Our data-config.xml has the following queries in its root entity: query=SELECT id, date_published, date_created, publish_flag FROM Item WHERE id 0 AND record_type_id=0 ORDER BY id DESC preImportDeleteQuery=SELECT item_id AS Id FROM gnpd_production.item_deletions deletedPkQuery=SELECT item_id AS id FROM gnpd_production.item_deletions WHERE deletion_date = SUBDATE('${dataimporter.last_index_time}', INTERVAL 5 MINUTE) deltaImportQuery=SELECT id, date_published, date_created, publish_flag FROM Item WHERE id 0 AND record_type_id=0 AND id=${dataimporter.delta.id} ORDER BY id DESC deltaQuery=SELECT id, date_published, date_created, publish_flag FROM Item WHERE id 0 AND record_type_id=0 AND sys_time_stamp = SUBDATE('${dataimporter.last_index_time}', INTERVAL 1 MINUTE) ORDER BY id DESC I think the problem i'm having comes from the way solr stores the last_index_time in conf/dataimport.properties as stated on the wiki as When delta-import command is executed, it reads the start time stored in conf/dataimport.properties. It uses that timestamp to run delta queries and after completion, updates the timestamp in conf/dataimport.properties. Which to me seems to indicate that any records with a time-stamp between when the dataimport starts and ends will be missed as the last_index_time is set to when it completes the import. This doesn't seem quite right to me. I would have expected the last_index_time to refer to when the dataimport was last STARTED so that there was no gaps in the timestamp covered. I changed the deltaQuery of our config to include the SUBDATE by INTERVAL 1 MINUTE statement to alleviate this problem, but it does only cover times when the delta-import takes less than a minute. Any ideas as to how this can be overcome? ,other than increasing the INTERVAL to something larger. Regards Barry Tucker -- View this message in context: http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.htmlhttp://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html?by-user=t Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog [hidden email] http://user/SendEmail.jtp?type=nodenode=2307215i=1 -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2307215.html To unsubscribe from Delta Import occasionally missing records., click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=2300877code=YnR1Y2tlckBtaW50ZWwuY29tfDIzMDA4Nzd8LTEzMDE5MDUxOTI=. font size=1 face=Verdana Mintel International Group Ltd | 18-19 Long Lane | London EC1A 9PL UK Registered in England: Number 1475918. | VAT Number: GB 232 9342 72 Contact details for our other offices can be found at http://www.mintel.com/office-locations. This email and any attachments may include content that is confidential, privileged, or otherwise protected under applicable law. Unauthorised disclosure, copying, distribution, or use of the contents is prohibited and may be unlawful. If you have received this email in error, including
Re: Delta Import occasionally missing records.
Thank you for your response. In what way is 'timestamp' not perfect? I've looked into the SolrEntityProcessor and added a timestamp field to our index. However i'm struggling to work out a query to get the max value od the timestamp field and does the SolrEntityProcessor entity appear before the root entity or does it wrap around the root entity. On 22 January 2011 07:24, Lance Norskog-2 [via Lucene] ml-node+2307215-627680969-326...@n3.nabble.comml-node%2b2307215-627680969-326...@n3.nabble.com wrote: The timestamp thing is not perfect. You can instead do a search against Solr and find the latest timestamp in the index. SOLR-1499 allows you to search against Solr in the DataImportHandler. On Fri, Jan 21, 2011 at 2:27 AM, btucker [hidden email]http://user/SendEmail.jtp?type=nodenode=2307215i=0 wrote: Hello We've just started using solr to provide search functionality for our application with the DataImportHandler performing a delta-import every 1 fired by crontab, which works great, however it does occasionally miss records that are added to the database while the delta-import is running. Our data-config.xml has the following queries in its root entity: query=SELECT id, date_published, date_created, publish_flag FROM Item WHERE id 0 AND record_type_id=0 ORDER BY id DESC preImportDeleteQuery=SELECT item_id AS Id FROM gnpd_production.item_deletions deletedPkQuery=SELECT item_id AS id FROM gnpd_production.item_deletions WHERE deletion_date = SUBDATE('${dataimporter.last_index_time}', INTERVAL 5 MINUTE) deltaImportQuery=SELECT id, date_published, date_created, publish_flag FROM Item WHERE id 0 AND record_type_id=0 AND id=${dataimporter.delta.id} ORDER BY id DESC deltaQuery=SELECT id, date_published, date_created, publish_flag FROM Item WHERE id 0 AND record_type_id=0 AND sys_time_stamp = SUBDATE('${dataimporter.last_index_time}', INTERVAL 1 MINUTE) ORDER BY id DESC I think the problem i'm having comes from the way solr stores the last_index_time in conf/dataimport.properties as stated on the wiki as When delta-import command is executed, it reads the start time stored in conf/dataimport.properties. It uses that timestamp to run delta queries and after completion, updates the timestamp in conf/dataimport.properties. Which to me seems to indicate that any records with a time-stamp between when the dataimport starts and ends will be missed as the last_index_time is set to when it completes the import. This doesn't seem quite right to me. I would have expected the last_index_time to refer to when the dataimport was last STARTED so that there was no gaps in the timestamp covered. I changed the deltaQuery of our config to include the SUBDATE by INTERVAL 1 MINUTE statement to alleviate this problem, but it does only cover times when the delta-import takes less than a minute. Any ideas as to how this can be overcome? ,other than increasing the INTERVAL to something larger. Regards Barry Tucker -- View this message in context: http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.htmlhttp://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html?by-user=t Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog [hidden email] http://user/SendEmail.jtp?type=nodenode=2307215i=1 -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2307215.html To unsubscribe from Delta Import occasionally missing records., click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=2300877code=YnR1Y2tlckBtaW50ZWwuY29tfDIzMDA4Nzd8LTEzMDE5MDUxOTI=. font size=1 face=Verdana Mintel International Group Ltd | 18-19 Long Lane | London EC1A 9PL UK Registered in England: Number 1475918. | VAT Number: GB 232 9342 72 Contact details for our other offices can be found at http://www.mintel.com/office-locations. This email and any attachments may include content that is confidential, privileged, or otherwise protected under applicable law. Unauthorised disclosure, copying, distribution, or use of the contents is prohibited and may be unlawful. If you have received this email in error, including without appropriate authorisation, then please reply to the sender about the error and delete this email and any attachments./font -- View this message in context: http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2318572.html Sent from the Solr - User mailing list archive at Nabble.com.
Delta Import occasionally missing records.
Hello We've just started using solr to provide search functionality for our application with the DataImportHandler performing a delta-import every 1 fired by crontab, which works great, however it does occasionally miss records that are added to the database while the delta-import is running. Our data-config.xml has the following queries in its root entity: query=SELECT id, date_published, date_created, publish_flag FROM Item WHERE id 0 AND record_type_id=0 ORDER BY id DESC preImportDeleteQuery=SELECT item_id AS Id FROM gnpd_production.item_deletions deletedPkQuery=SELECT item_id AS id FROM gnpd_production.item_deletions WHERE deletion_date = SUBDATE('${dataimporter.last_index_time}', INTERVAL 5 MINUTE) deltaImportQuery=SELECT id, date_published, date_created, publish_flag FROM Item WHERE id 0 AND record_type_id=0 AND id=${dataimporter.delta.id} ORDER BY id DESC deltaQuery=SELECT id, date_published, date_created, publish_flag FROM Item WHERE id 0 AND record_type_id=0 AND sys_time_stamp = SUBDATE('${dataimporter.last_index_time}', INTERVAL 1 MINUTE) ORDER BY id DESC I think the problem i'm having comes from the way solr stores the last_index_time in conf/dataimport.properties as stated on the wiki as When delta-import command is executed, it reads the start time stored in conf/dataimport.properties. It uses that timestamp to run delta queries and after completion, updates the timestamp in conf/dataimport.properties. Which to me seems to indicate that any records with a time-stamp between when the dataimport starts and ends will be missed as the last_index_time is set to when it completes the import. This doesn't seem quite right to me. I would have expected the last_index_time to refer to when the dataimport was last STARTED so that there was no gaps in the timestamp covered. I changed the deltaQuery of our config to include the SUBDATE by INTERVAL 1 MINUTE statement to alleviate this problem, but it does only cover times when the delta-import takes less than a minute. Any ideas as to how this can be overcome? ,other than increasing the INTERVAL to something larger. Regards Barry Tucker -- View this message in context: http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Delta Import occasionally missing records.
The timestamp thing is not perfect. You can instead do a search against Solr and find the latest timestamp in the index. SOLR-1499 allows you to search against Solr in the DataImportHandler. On Fri, Jan 21, 2011 at 2:27 AM, btucker btuc...@mintel.com wrote: Hello We've just started using solr to provide search functionality for our application with the DataImportHandler performing a delta-import every 1 fired by crontab, which works great, however it does occasionally miss records that are added to the database while the delta-import is running. Our data-config.xml has the following queries in its root entity: query=SELECT id, date_published, date_created, publish_flag FROM Item WHERE id 0 AND record_type_id=0 ORDER BY id DESC preImportDeleteQuery=SELECT item_id AS Id FROM gnpd_production.item_deletions deletedPkQuery=SELECT item_id AS id FROM gnpd_production.item_deletions WHERE deletion_date = SUBDATE('${dataimporter.last_index_time}', INTERVAL 5 MINUTE) deltaImportQuery=SELECT id, date_published, date_created, publish_flag FROM Item WHERE id 0 AND record_type_id=0 AND id=${dataimporter.delta.id} ORDER BY id DESC deltaQuery=SELECT id, date_published, date_created, publish_flag FROM Item WHERE id 0 AND record_type_id=0 AND sys_time_stamp = SUBDATE('${dataimporter.last_index_time}', INTERVAL 1 MINUTE) ORDER BY id DESC I think the problem i'm having comes from the way solr stores the last_index_time in conf/dataimport.properties as stated on the wiki as When delta-import command is executed, it reads the start time stored in conf/dataimport.properties. It uses that timestamp to run delta queries and after completion, updates the timestamp in conf/dataimport.properties. Which to me seems to indicate that any records with a time-stamp between when the dataimport starts and ends will be missed as the last_index_time is set to when it completes the import. This doesn't seem quite right to me. I would have expected the last_index_time to refer to when the dataimport was last STARTED so that there was no gaps in the timestamp covered. I changed the deltaQuery of our config to include the SUBDATE by INTERVAL 1 MINUTE statement to alleviate this problem, but it does only cover times when the delta-import takes less than a minute. Any ideas as to how this can be overcome? ,other than increasing the INTERVAL to something larger. Regards Barry Tucker -- View this message in context: http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com