RE: Missing Records

2014-11-04 Thread AJ Lemke
Another round of tests this morning.

Ten rounds of imports all done on the non-leader node:
902294
900089
899267
898127
901945
901055
899638
899392
899880
901812

The expected number of records is 903990

I am getting this error:

org.apache.solr.common.SolrException: Bad Request

request: 
http://192.168.20.51:8983/solr/Inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.51%3A7574%2Fsolr%2FInventory_shard1_replica2%2Fwt=javabinversion=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

And I am getting this warning:
org.apache.solr.common.SolrException: Bad Request

request: 
http://192.168.20.51:8983/solr/Inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.51%3A7574%2Fsolr%2FInventory_shard1_replica2%2Fwt=javabinversion=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

These are both from the admin logging section.
I have retained the logs files if it would help


AJ

-Original Message-
From: AJ Lemke [mailto:aj.le...@securitylabs.com] 
Sent: Monday, November 3, 2014 5:31 PM
To: solr-user@lucene.apache.org
Subject: RE: Missing Records

So I jumped back on this.

I have not been using the optimize option on this new set of tests.
If I run the full index on the leader I seem to get all of the items in the 
database minus 3 that have a missing field.

Indexing completed. Added/Updated: 903,990 documents. Deleted 0 documents. 
(Duration: 25m 11s)
Requests: 1 (0/s), Fetched: 903,993 (598/s), Skipped: 0, Processed: 903,990

Last Modified:2 minutes ago
Num Docs:903990
Max Doc:903990
Heap Memory Usage:2625744
Deleted Docs:0
Version:3249
Segment Count:7
Optimized:
Current:

If I run it on the other node I get:

Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. 
(Duration: 27m 08s)
Requests: 1 (0/s), Fetched: 903,993 (555/s), Skipped: 0, Processed: 903,993 
(555/s)

Last Modified:about a minute ago
Num Docs:897791
Max Doc:897791
Heap Memory Usage:2621072
Deleted Docs:0
Version:3285
Segment Count:7
Optimized:
Current:

Any ideas?

If there is any more info that is needed let me know.

AJ


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Friday, October 31, 2014 1:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Missing Records

Sorry to say this, but I don't think the numDocs/maxDoc numbers are telling you 
anything. because it looks like you've optimized which purges any data 
associated with deleted docs, including the internal IDs which are the 
numDocs/maxDocs figures. So if there were deletions, we can't see any evidence 
of same.


Siih.


On Fri, Oct 31, 2014 at 9:56 AM, AJ Lemke aj.le...@securitylabs.com wrote:
 I have run some more tests so the numbers have changed a bit.

 Index Results done on Node 1:
 Indexing completed. Added/Updated: 903,993 documents. Deleted 0 
 documents. (Duration: 31m 47s)
 Requests: 1 (0/s), Fetched: 903,993 (474/s), Skipped: 0, Processed: 
 903,993

 Node 1:
 Last Modified: 44 minutes ago
 Num Docs: 824216
 Max Doc: 824216
 Heap Memory Usage: -1
 Deleted Docs: 0
 Version: 1051
 Segment Count: 1
 Optimized: checked
 Current: checked

 Node 2:
 Last Modified: 44 minutes ago
 Num Docs: 824216
 Max Doc: 824216
 Heap Memory Usage: -1
 Deleted Docs: 0
 Version: 1051
 Segment Count: 1
 Optimized: checked
 Current: checked

 Search results are the same as the doc numbers above.

 Logs only have one instance of an error:

 ERROR - 2014-10-31 10:47:12.867;
 org.apache.solr.update.StreamingSolrServers$1; error
 org.apache.solr.common.SolrException: Bad Request



 request: 
 http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2
 at 
 org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 Some info that may be of help
 This is on my local vm using jetty with the embedded zookeeper.
 Commands to start cloud:

 java -DzkRun -jar start.jar
 java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983 -jar

RE: Missing Records

2014-11-03 Thread AJ Lemke
So I jumped back on this.

I have not been using the optimize option on this new set of tests.
If I run the full index on the leader I seem to get all of the items in the 
database minus 3 that have a missing field.

Indexing completed. Added/Updated: 903,990 documents. Deleted 0 documents. 
(Duration: 25m 11s)
Requests: 1 (0/s), Fetched: 903,993 (598/s), Skipped: 0, Processed: 903,990

Last Modified:2 minutes ago
Num Docs:903990
Max Doc:903990
Heap Memory Usage:2625744
Deleted Docs:0
Version:3249
Segment Count:7
Optimized:
Current:

If I run it on the other node I get:

Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. 
(Duration: 27m 08s)
Requests: 1 (0/s), Fetched: 903,993 (555/s), Skipped: 0, Processed: 903,993 
(555/s)

Last Modified:about a minute ago
Num Docs:897791
Max Doc:897791
Heap Memory Usage:2621072
Deleted Docs:0
Version:3285
Segment Count:7
Optimized:
Current:

Any ideas?

If there is any more info that is needed let me know.

AJ


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, October 31, 2014 1:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Missing Records

Sorry to say this, but I don't think the numDocs/maxDoc numbers are telling you 
anything. because it looks like you've optimized which purges any data 
associated with deleted docs, including the internal IDs which are the 
numDocs/maxDocs figures. So if there were deletions, we can't see any evidence 
of same.


Siih.


On Fri, Oct 31, 2014 at 9:56 AM, AJ Lemke aj.le...@securitylabs.com wrote:
 I have run some more tests so the numbers have changed a bit.

 Index Results done on Node 1:
 Indexing completed. Added/Updated: 903,993 documents. Deleted 0 
 documents. (Duration: 31m 47s)
 Requests: 1 (0/s), Fetched: 903,993 (474/s), Skipped: 0, Processed: 
 903,993

 Node 1:
 Last Modified: 44 minutes ago
 Num Docs: 824216
 Max Doc: 824216
 Heap Memory Usage: -1
 Deleted Docs: 0
 Version: 1051
 Segment Count: 1
 Optimized: checked
 Current: checked

 Node 2:
 Last Modified: 44 minutes ago
 Num Docs: 824216
 Max Doc: 824216
 Heap Memory Usage: -1
 Deleted Docs: 0
 Version: 1051
 Segment Count: 1
 Optimized: checked
 Current: checked

 Search results are the same as the doc numbers above.

 Logs only have one instance of an error:

 ERROR - 2014-10-31 10:47:12.867; 
 org.apache.solr.update.StreamingSolrServers$1; error
 org.apache.solr.common.SolrException: Bad Request



 request: 
 http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2
 at 
 org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 Some info that may be of help
 This is on my local vm using jetty with the embedded zookeeper.
 Commands to start cloud:

 java -DzkRun -jar start.jar
 java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983 -jar start.jar

 sh zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir 
 ~/development/configs/inventory/ -confname config_ inventory sh 
 zkcli.sh -zkhost localhost:9983 -cmd linkconfig -collection inventory 
 -confname config_ inventory

 curl 
 http://localhost:8983/solr/admin/collections?action=CREATEname=inventorynumShards=1replicationFactor=2maxShardsPerNode=4;
 curl http://localhost:8983/solr/admin/collections?action=RELOADname= 
 inventory 

 AJ


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Friday, October 31, 2014 9:49 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Missing Records

 OK, that is puzzling.

 bq: If there were duplicates only one of the duplicates should be removed and 
 I still should be able to search for the ID and find one correct?

 Correct.

 Your bad request error is puzzling, you may be on to something there.
 What it looks like is that somehow some of the documents you're 
 sending to Solr aren't getting indexed, either being dropped through 
 the network or perhaps have invalid fields, field formats (i.e. a date 
 in the wrong format,
 whatever) or some such. When you complete the run, what are the maxDoc and 
 numDocs numbers on one of the nodes?

 What else do you see in the logs? They're pretty big after that many adds, 
 but maybe you can grep for ERROR and see something interesting like stack 
 traces. Or even org.apache.solr. This latter will give you some false hits, 
 but at least it's better than paging through a huge log file

 Personally, in this kind of situation I sometimes use SolrJ to do my indexing 
 rather than DIH, I find it easier to debug so that's another possibility. In 
 the worst case with SolrJ, you can send the docs

RE: Missing Records

2014-10-31 Thread AJ Lemke
I started this collection using this command:

http://localhost:8983/solr/admin/collections?action=CREATEname=inventorynumShards=1replicationFactor=2maxShardsPerNode=4

So 1 shard and replicationFactor of 2

AJ

-Original Message-
From: S.L [mailto:simpleliving...@gmail.com] 
Sent: Thursday, October 30, 2014 5:12 PM
To: solr-user@lucene.apache.org
Subject: Re: Missing Records

I am curious , how many shards do you have and whats the replication factor you 
are using ?

On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote:

 Hi All,

 We have a SOLR cloud instance that has been humming along nicely for 
 months.
 Last week we started experiencing missing records.

 Admin DIH Example:
 Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A *:* 
 search claims that there are only 903,902 this is the first full 
 index.
 Subsequent full indexes give the following counts for the *:* search
 903,805
 903,665
 826,357

 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0,
 Processed: 903,993 (x/s) every time. ---records per second is variable


 I found an item that should be in the index but is not found in a search.

 Here are the referenced lines of the log file.

 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE 
 add{,id=750041421} 
 {{params(debug=falseoptimize=trueindent=truecommit=trueclean=true
 wt=jsoncommand=full-importentity=adsverbose=false),defaults(config=
 data-config.xml)}}
 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.SolrCmdDistributor; sending update to 
 http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 
 add{,id=750041421} 
 params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57
 %3A8983%2Fsolr%2Finventory_shard1_replica1%2F

 --- there are 746 lines of log between entries ---

 DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire;   
 [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],di
 strib.from?[0x17] 
 http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delByQ
 [0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip%
 51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'sk
 i-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit
 Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48
 929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2Di
 visionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*Phot
 oCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine
 [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux
 City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162
 Long Track
 [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0]
 1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0x
 e0]+Description?VThis Bad boy will pull you through the deepest 
 snow!With the 162 track and 1000cc of power you can fly up any 
 hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission
 [0xe0]*ModelFacet7Ski-Doo|Summit 
 Highmark[0xe0]/DealerNameFacet9Certified
 Auto,
 Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0](DealerI
 D$4150[0xe0]AdCode$DX1Q[0xe0]*DealerName4Certified
 Auto,
 Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorCol
 or+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000
 SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID12[0xe0].F
 uelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certif
 ied Auto, Inc.|Sioux 
 City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit
 Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber000105
 [0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit
 highmark[\n]
 What could be the issue and how does one fix this issue?

 Thanks so much and if more information is needed I have preserved the 
 log files.

 AJ



RE: Missing Records

2014-10-31 Thread AJ Lemke
Hi Erick:

All of the records are coming out of an auto numbered field so the ID's will 
all be unique.

Here is the the test I ran this morning:

Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. 
(Duration: 28m)
Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: 903,993 
(538/s)
Started: 33 minutes ago

Last Modified:4 minutes ago
Num Docs:903829
Max Doc:903829
Heap Memory Usage:-1
Deleted Docs:0
Version:1517
Segment Count:16
Optimized: checked
Current: checked

If there were duplicates only one of the duplicates should be removed and I 
still should be able to search for the ID and find one correct?
As it is right now I am missing records that should be in the collection.

I also noticed this:

org.apache.solr.common.SolrException: Bad Request



request: 
http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

AJ


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, October 30, 2014 7:08 PM
To: solr-user@lucene.apache.org
Subject: Re: Missing Records

First question: Is there any possibility that some of the docs have duplicate 
IDs (uniqueKeys)? If so, then some of the docs will be replaced, which will 
lower your returns.
One way to figuring this out is to go to the admin screen and if numDocs  
maxDoc, then documents have been replaced.

Also, if numDocs is smaller than 903,993 then you probably have some docs being 
replaced. One warning, however. Even if docs are deleted, then this could still 
be the case because when segments are merged the deleted docs are purged.

Best,
Erick

On Thu, Oct 30, 2014 at 3:12 PM, S.L simpleliving...@gmail.com wrote:
 I am curious , how many shards do you have and whats the replication 
 factor you are using ?

 On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote:

 Hi All,

 We have a SOLR cloud instance that has been humming along nicely for 
 months.
 Last week we started experiencing missing records.

 Admin DIH Example:
 Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A 
 *:* search claims that there are only 903,902 this is the first full 
 index.
 Subsequent full indexes give the following counts for the *:* search
 903,805
 903,665
 826,357

 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0,
 Processed: 903,993 (x/s) every time. ---records per second is 
 variable


 I found an item that should be in the index but is not found in a search.

 Here are the referenced lines of the log file.

 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE 
 add{,id=750041421} 
 {{params(debug=falseoptimize=trueindent=truecommit=trueclean=true
 wt=jsoncommand=full-importentity=adsverbose=false),defaults(confi
 g=data-config.xml)}}
 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.SolrCmdDistributor; sending update to 
 http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 
 add{,id=750041421} 
 params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.5
 7%3A8983%2Fsolr%2Finventory_shard1_replica1%2F

 --- there are 746 lines of log between entries ---

 DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire;   
 [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],d
 istrib.from?[0x17] 
 http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delBy
 Q[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zi
 p%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower
 'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit
 Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.4
 8929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2
 DivisionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*P
 hotoCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine
 [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux
 City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162
 Long Track
 [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0
 ]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[
 0xe0]+Description?VThis Bad boy will pull you through the deepest 
 snow!With the 162 track and 1000cc of power you can fly up any 
 hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission
 [0xe0]*ModelFacet7Ski-Doo|Summit 
 Highmark[0xe0]/DealerNameFacet9Certified
 Auto,
 Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0

Re: Missing Records

2014-10-31 Thread Erick Erickson
OK, that is puzzling.

bq: If there were duplicates only one of the duplicates should be
removed and I still should be able to search for the ID and find one
correct?

Correct.

Your bad request error is puzzling, you may be on to something there.
What it looks like is that somehow some of the documents you're
sending to Solr aren't getting
indexed, either being dropped through the network or perhaps have
invalid fields, field formats (i.e. a date in the wrong format,
whatever) or some such. When you complete the run, what are the maxDoc
and numDocs numbers on one of the nodes?

What else do you see in the logs? They're pretty big after that many
adds, but maybe you can grep for ERROR and see something interesting
like stack traces. Or even org.apache.solr. This latter will give
you some false hits, but at least it's better than paging through a
huge log file

Personally, in this kind of situation I sometimes use SolrJ to do my
indexing rather than DIH, I find it easier to debug so that's another
possibility. In the worst case with SolrJ, you can send the docs one
at a time

Best,
Erick

On Fri, Oct 31, 2014 at 7:37 AM, AJ Lemke aj.le...@securitylabs.com wrote:
 Hi Erick:

 All of the records are coming out of an auto numbered field so the ID's will 
 all be unique.

 Here is the the test I ran this morning:

 Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. 
 (Duration: 28m)
 Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: 903,993 
 (538/s)
 Started: 33 minutes ago

 Last Modified:4 minutes ago
 Num Docs:903829
 Max Doc:903829
 Heap Memory Usage:-1
 Deleted Docs:0
 Version:1517
 Segment Count:16
 Optimized: checked
 Current: checked

 If there were duplicates only one of the duplicates should be removed and I 
 still should be able to search for the ID and find one correct?
 As it is right now I am missing records that should be in the collection.

 I also noticed this:

 org.apache.solr.common.SolrException: Bad Request



 request: 
 http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2
 at 
 org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 AJ


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Thursday, October 30, 2014 7:08 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Missing Records

 First question: Is there any possibility that some of the docs have duplicate 
 IDs (uniqueKeys)? If so, then some of the docs will be replaced, which will 
 lower your returns.
 One way to figuring this out is to go to the admin screen and if numDocs  
 maxDoc, then documents have been replaced.

 Also, if numDocs is smaller than 903,993 then you probably have some docs 
 being replaced. One warning, however. Even if docs are deleted, then this 
 could still be the case because when segments are merged the deleted docs are 
 purged.

 Best,
 Erick

 On Thu, Oct 30, 2014 at 3:12 PM, S.L simpleliving...@gmail.com wrote:
 I am curious , how many shards do you have and whats the replication
 factor you are using ?

 On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote:

 Hi All,

 We have a SOLR cloud instance that has been humming along nicely for
 months.
 Last week we started experiencing missing records.

 Admin DIH Example:
 Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A
 *:* search claims that there are only 903,902 this is the first full
 index.
 Subsequent full indexes give the following counts for the *:* search
 903,805
 903,665
 826,357

 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0,
 Processed: 903,993 (x/s) every time. ---records per second is
 variable


 I found an item that should be in the index but is not found in a search.

 Here are the referenced lines of the log file.

 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE
 add{,id=750041421}
 {{params(debug=falseoptimize=trueindent=truecommit=trueclean=true
 wt=jsoncommand=full-importentity=adsverbose=false),defaults(confi
 g=data-config.xml)}}
 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.SolrCmdDistributor; sending update to
 http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0
 add{,id=750041421}
 params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.5
 7%3A8983%2Fsolr%2Finventory_shard1_replica1%2F

 --- there are 746 lines of log between entries ---

 DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire;  
 [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib

RE: Missing Records

2014-10-31 Thread AJ Lemke
I have run some more tests so the numbers have changed a bit.

Index Results done on Node 1:
Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. 
(Duration: 31m 47s)
Requests: 1 (0/s), Fetched: 903,993 (474/s), Skipped: 0, Processed: 903,993

Node 1:
Last Modified: 44 minutes ago
Num Docs: 824216
Max Doc: 824216
Heap Memory Usage: -1
Deleted Docs: 0
Version: 1051
Segment Count: 1
Optimized: checked
Current: checked

Node 2:
Last Modified: 44 minutes ago
Num Docs: 824216
Max Doc: 824216
Heap Memory Usage: -1
Deleted Docs: 0
Version: 1051
Segment Count: 1
Optimized: checked
Current: checked

Search results are the same as the doc numbers above.

Logs only have one instance of an error:

ERROR - 2014-10-31 10:47:12.867; org.apache.solr.update.StreamingSolrServers$1; 
error
org.apache.solr.common.SolrException: Bad Request



request: 
http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Some info that may be of help
This is on my local vm using jetty with the embedded zookeeper.
Commands to start cloud:

java -DzkRun -jar start.jar
java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983 -jar start.jar

sh zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir 
~/development/configs/inventory/ -confname config_ inventory
sh zkcli.sh -zkhost localhost:9983 -cmd linkconfig -collection inventory 
-confname config_ inventory

curl 
http://localhost:8983/solr/admin/collections?action=CREATEname=inventorynumShards=1replicationFactor=2maxShardsPerNode=4;
curl http://localhost:8983/solr/admin/collections?action=RELOADname= 
inventory 

AJ


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, October 31, 2014 9:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Missing Records

OK, that is puzzling.

bq: If there were duplicates only one of the duplicates should be removed and I 
still should be able to search for the ID and find one correct?

Correct.

Your bad request error is puzzling, you may be on to something there.
What it looks like is that somehow some of the documents you're sending to Solr 
aren't getting indexed, either being dropped through the network or perhaps 
have invalid fields, field formats (i.e. a date in the wrong format,
whatever) or some such. When you complete the run, what are the maxDoc and 
numDocs numbers on one of the nodes?

What else do you see in the logs? They're pretty big after that many adds, but 
maybe you can grep for ERROR and see something interesting like stack traces. 
Or even org.apache.solr. This latter will give you some false hits, but at 
least it's better than paging through a huge log file

Personally, in this kind of situation I sometimes use SolrJ to do my indexing 
rather than DIH, I find it easier to debug so that's another possibility. In 
the worst case with SolrJ, you can send the docs one at a time

Best,
Erick

On Fri, Oct 31, 2014 at 7:37 AM, AJ Lemke aj.le...@securitylabs.com wrote:
 Hi Erick:

 All of the records are coming out of an auto numbered field so the ID's will 
 all be unique.

 Here is the the test I ran this morning:

 Indexing completed. Added/Updated: 903,993 documents. Deleted 0 
 documents. (Duration: 28m)
 Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: 
 903,993 (538/s)
 Started: 33 minutes ago

 Last Modified:4 minutes ago
 Num Docs:903829
 Max Doc:903829
 Heap Memory Usage:-1
 Deleted Docs:0
 Version:1517
 Segment Count:16
 Optimized: checked
 Current: checked

 If there were duplicates only one of the duplicates should be removed and I 
 still should be able to search for the ID and find one correct?
 As it is right now I am missing records that should be in the collection.

 I also noticed this:

 org.apache.solr.common.SolrException: Bad Request



 request: 
 http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2
 at 
 org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 AJ


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Thursday, October 30, 2014 7:08 PM
 To: solr-user

Re: Missing Records

2014-10-31 Thread Erick Erickson
Sorry to say this, but I don't think the numDocs/maxDoc numbers
are telling you anything. because it looks like you've optimized
which purges any data associated with deleted docs, including
the internal IDs which are the numDocs/maxDocs figures. So if there
were deletions, we can't see any evidence of same.


Siih.


On Fri, Oct 31, 2014 at 9:56 AM, AJ Lemke aj.le...@securitylabs.com wrote:
 I have run some more tests so the numbers have changed a bit.

 Index Results done on Node 1:
 Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. 
 (Duration: 31m 47s)
 Requests: 1 (0/s), Fetched: 903,993 (474/s), Skipped: 0, Processed: 903,993

 Node 1:
 Last Modified: 44 minutes ago
 Num Docs: 824216
 Max Doc: 824216
 Heap Memory Usage: -1
 Deleted Docs: 0
 Version: 1051
 Segment Count: 1
 Optimized: checked
 Current: checked

 Node 2:
 Last Modified: 44 minutes ago
 Num Docs: 824216
 Max Doc: 824216
 Heap Memory Usage: -1
 Deleted Docs: 0
 Version: 1051
 Segment Count: 1
 Optimized: checked
 Current: checked

 Search results are the same as the doc numbers above.

 Logs only have one instance of an error:

 ERROR - 2014-10-31 10:47:12.867; 
 org.apache.solr.update.StreamingSolrServers$1; error
 org.apache.solr.common.SolrException: Bad Request



 request: 
 http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2
 at 
 org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 Some info that may be of help
 This is on my local vm using jetty with the embedded zookeeper.
 Commands to start cloud:

 java -DzkRun -jar start.jar
 java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983 -jar start.jar

 sh zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir 
 ~/development/configs/inventory/ -confname config_ inventory
 sh zkcli.sh -zkhost localhost:9983 -cmd linkconfig -collection inventory 
 -confname config_ inventory

 curl 
 http://localhost:8983/solr/admin/collections?action=CREATEname=inventorynumShards=1replicationFactor=2maxShardsPerNode=4;
 curl http://localhost:8983/solr/admin/collections?action=RELOADname= 
 inventory 

 AJ


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Friday, October 31, 2014 9:49 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Missing Records

 OK, that is puzzling.

 bq: If there were duplicates only one of the duplicates should be removed and 
 I still should be able to search for the ID and find one correct?

 Correct.

 Your bad request error is puzzling, you may be on to something there.
 What it looks like is that somehow some of the documents you're sending to 
 Solr aren't getting indexed, either being dropped through the network or 
 perhaps have invalid fields, field formats (i.e. a date in the wrong format,
 whatever) or some such. When you complete the run, what are the maxDoc and 
 numDocs numbers on one of the nodes?

 What else do you see in the logs? They're pretty big after that many adds, 
 but maybe you can grep for ERROR and see something interesting like stack 
 traces. Or even org.apache.solr. This latter will give you some false hits, 
 but at least it's better than paging through a huge log file

 Personally, in this kind of situation I sometimes use SolrJ to do my indexing 
 rather than DIH, I find it easier to debug so that's another possibility. In 
 the worst case with SolrJ, you can send the docs one at a time

 Best,
 Erick

 On Fri, Oct 31, 2014 at 7:37 AM, AJ Lemke aj.le...@securitylabs.com wrote:
 Hi Erick:

 All of the records are coming out of an auto numbered field so the ID's will 
 all be unique.

 Here is the the test I ran this morning:

 Indexing completed. Added/Updated: 903,993 documents. Deleted 0
 documents. (Duration: 28m)
 Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed:
 903,993 (538/s)
 Started: 33 minutes ago

 Last Modified:4 minutes ago
 Num Docs:903829
 Max Doc:903829
 Heap Memory Usage:-1
 Deleted Docs:0
 Version:1517
 Segment Count:16
 Optimized: checked
 Current: checked

 If there were duplicates only one of the duplicates should be removed and I 
 still should be able to search for the ID and find one correct?
 As it is right now I am missing records that should be in the collection.

 I also noticed this:

 org.apache.solr.common.SolrException: Bad Request



 request: 
 http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2Fwt=javabinversion=2

Missing Records

2014-10-30 Thread AJ Lemke
Hi All,

We have a SOLR cloud instance that has been humming along nicely for months.
Last week we started experiencing missing records.

Admin DIH Example:
Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s)
A *:* search claims that there are only 903,902 this is the first full index.
Subsequent full indexes give the following counts for the *:* search
903,805
903,665
826,357

All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0, Processed: 
903,993 (x/s) every time. ---records per second is variable


I found an item that should be in the index but is not found in a search.

Here are the referenced lines of the log file.

DEBUG - 2014-10-30 15:10:51.160; 
org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE 
add{,id=750041421} 
{{params(debug=falseoptimize=trueindent=truecommit=trueclean=truewt=jsoncommand=full-importentity=adsverbose=false),defaults(config=data-config.xml)}}
DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.SolrCmdDistributor; 
sending update to http://192.168.20.57:7574/solr/inventory_shard1_replica2/ 
retry:0 add{,id=750041421} 
params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica1%2F

--- there are 746 lines of log between entries ---

DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire;   
[0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],distrib.from?[0x17]http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delByQ[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit
 
Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2DivisionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*PhotoCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine
 [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux 
City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162 
Long Track 
[0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0xe0]+Description?VThis
 Bad boy will pull you through the deepest snow!With the 162 track and 1000cc 
of power you can fly up any 
hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission 
[0xe0]*ModelFacet7Ski-Doo|Summit Highmark[0xe0]/DealerNameFacet9Certified Auto, 
Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0](DealerID$4150[0xe0]AdCode$DX1Q[0xe0]*DealerName4Certified
 Auto, 
Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorColor+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000
 
SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID12[0xe0].FuelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certified
 Auto, Inc.|Sioux City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit 
Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber000105[0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit
 highmark[\n]
What could be the issue and how does one fix this issue?

Thanks so much and if more information is needed I have preserved the log files.

AJ


Re: Missing Records

2014-10-30 Thread S.L
I am curious , how many shards do you have and whats the replication factor
you are using ?

On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote:

 Hi All,

 We have a SOLR cloud instance that has been humming along nicely for
 months.
 Last week we started experiencing missing records.

 Admin DIH Example:
 Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s)
 A *:* search claims that there are only 903,902 this is the first full
 index.
 Subsequent full indexes give the following counts for the *:* search
 903,805
 903,665
 826,357

 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0,
 Processed: 903,993 (x/s) every time. ---records per second is variable


 I found an item that should be in the index but is not found in a search.

 Here are the referenced lines of the log file.

 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE
 add{,id=750041421}
 {{params(debug=falseoptimize=trueindent=truecommit=trueclean=truewt=jsoncommand=full-importentity=adsverbose=false),defaults(config=data-config.xml)}}
 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.SolrCmdDistributor; sending update to
 http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0
 add{,id=750041421}
 params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica1%2F

 --- there are 746 lines of log between entries ---

 DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire;  
 [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],distrib.from?[0x17]
 http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delByQ[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit
 Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2DivisionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*PhotoCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine
 [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux
 City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162
 Long Track
 [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0xe0]+Description?VThis
 Bad boy will pull you through the deepest snow!With the 162 track and
 1000cc of power you can fly up any
 hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission
 [0xe0]*ModelFacet7Ski-Doo|Summit Highmark[0xe0]/DealerNameFacet9Certified
 Auto,
 Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0](DealerID$4150[0xe0]AdCode$DX1Q[0xe0]*DealerName4Certified
 Auto,
 Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorColor+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000
 SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID12[0xe0].FuelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certified
 Auto, Inc.|Sioux City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit
 Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber000105[0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit
 highmark[\n]
 What could be the issue and how does one fix this issue?

 Thanks so much and if more information is needed I have preserved the log
 files.

 AJ



Re: Missing Records

2014-10-30 Thread Erick Erickson
First question: Is there any possibility that some of the docs
have duplicate IDs (uniqueKeys)? If so, then some of
the docs will be replaced, which will lower your returns.
One way to figuring this out is to go to the admin screen and if
numDocs  maxDoc, then documents have been replaced.

Also, if numDocs is smaller than 903,993 then you probably have
some docs being replaced. One warning, however. Even if docs
are deleted, then this could still be the case because when segments
are merged the deleted docs are purged.

Best,
Erick

On Thu, Oct 30, 2014 at 3:12 PM, S.L simpleliving...@gmail.com wrote:
 I am curious , how many shards do you have and whats the replication factor
 you are using ?

 On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote:

 Hi All,

 We have a SOLR cloud instance that has been humming along nicely for
 months.
 Last week we started experiencing missing records.

 Admin DIH Example:
 Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s)
 A *:* search claims that there are only 903,902 this is the first full
 index.
 Subsequent full indexes give the following counts for the *:* search
 903,805
 903,665
 826,357

 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0,
 Processed: 903,993 (x/s) every time. ---records per second is variable


 I found an item that should be in the index but is not found in a search.

 Here are the referenced lines of the log file.

 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE
 add{,id=750041421}
 {{params(debug=falseoptimize=trueindent=truecommit=trueclean=truewt=jsoncommand=full-importentity=adsverbose=false),defaults(config=data-config.xml)}}
 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.SolrCmdDistributor; sending update to
 http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0
 add{,id=750041421}
 params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica1%2F

 --- there are 746 lines of log between entries ---

 DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire;  
 [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],distrib.from?[0x17]
 http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delByQ[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit
 Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2DivisionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*PhotoCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine
 [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux
 City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162
 Long Track
 [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0xe0]+Description?VThis
 Bad boy will pull you through the deepest snow!With the 162 track and
 1000cc of power you can fly up any
 hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission
 [0xe0]*ModelFacet7Ski-Doo|Summit Highmark[0xe0]/DealerNameFacet9Certified
 Auto,
 Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0](DealerID$4150[0xe0]AdCode$DX1Q[0xe0]*DealerName4Certified
 Auto,
 Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorColor+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000
 SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID12[0xe0].FuelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certified
 Auto, Inc.|Sioux City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit
 Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber000105[0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit
 highmark[\n]
 What could be the issue and how does one fix this issue?

 Thanks so much and if more information is needed I have preserved the log
 files.

 AJ



Re: Delta Import occasionally missing records.

2011-01-26 Thread Lance Norskog
The SolrEntityProcessor would be a top-level entity. You would do a
query like this: sort=timestamp,descrows=1fl=timestamp. This gives
you one data item: the timestamp of the last item added to the index.

With this, the JDBC sub-entity would create a query that chooses all
rows with a timestamp = this latest timestamp. It will not be easy to
put this together, but it is possible :)

Good luck!

Lance

On Mon, Jan 24, 2011 at 2:04 AM, btucker btuc...@mintel.com wrote:

 Thank you for your response.

 In what way is 'timestamp' not perfect?

 I've looked into the SolrEntityProcessor and added a timestamp field to our
 index.
 However i'm struggling to work out a query to get the max value od the
 timestamp field
 and does the SolrEntityProcessor entity appear before the root entity or
 does it wrap around the root entity.

 On 22 January 2011 07:24, Lance Norskog-2 [via Lucene] 
 ml-node+2307215-627680969-326...@n3.nabble.comml-node%2b2307215-627680969-326...@n3.nabble.com
 wrote:

 The timestamp thing is not perfect. You can instead do a search
 against Solr and find the latest timestamp in the index. SOLR-1499
 allows you to search against Solr in the DataImportHandler.

 On Fri, Jan 21, 2011 at 2:27 AM, btucker [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=2307215i=0
 wrote:

 
  Hello
 
  We've just started using solr to provide search functionality for our
  application with the DataImportHandler performing a delta-import every 1
  fired by crontab, which works great, however it does occasionally miss
  records that are added to the database while the delta-import is running.

 
  Our data-config.xml has the following queries in its root entity:
 
  query=SELECT id, date_published, date_created, publish_flag FROM Item
 WHERE
  id  0
 
  AND record_type_id=0
 
  ORDER BY id DESC
  preImportDeleteQuery=SELECT item_id AS Id FROM
  gnpd_production.item_deletions
  deletedPkQuery=SELECT item_id AS id FROM gnpd_production.item_deletions
  WHERE deletion_date =
 
  SUBDATE('${dataimporter.last_index_time}', INTERVAL 5 MINUTE)
  deltaImportQuery=SELECT id, date_published, date_created, publish_flag
 FROM
  Item WHERE id  0
 
  AND record_type_id=0
 
  AND id=${dataimporter.delta.id}
 
  ORDER BY id DESC
  deltaQuery=SELECT id, date_published, date_created, publish_flag FROM
 Item
  WHERE id  0
 
  AND record_type_id=0
 
  AND sys_time_stamp =
 
  SUBDATE('${dataimporter.last_index_time}', INTERVAL 1 MINUTE) ORDER BY id

  DESC
 
  I think the problem i'm having comes from the way solr stores the
  last_index_time in conf/dataimport.properties as stated on the wiki as
 
  When delta-import command is executed, it reads the start time stored
 in
  conf/dataimport.properties. It uses that timestamp to run delta queries
 and
  after completion, updates the timestamp in conf/dataimport.properties.
 
  Which to me seems to indicate that any records with a time-stamp between
  when the dataimport starts and ends will be missed as the last_index_time
 is
  set to when it completes the import.
 
  This doesn't seem quite right to me. I would have expected the
  last_index_time to refer to when the dataimport was last STARTED so that
  there was no gaps in the timestamp covered.
 
  I changed the deltaQuery of our config to include the SUBDATE by INTERVAL
 1
  MINUTE statement to alleviate this problem, but it does only cover times
  when the delta-import takes less than a minute.
 
  Any ideas as to how this can be overcome? ,other than increasing the
  INTERVAL to something larger.
 
  Regards
 
  Barry Tucker
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.htmlhttp://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html?by-user=t
  Sent from the Solr - User mailing list archive at Nabble.com.
 



 --
 Lance Norskog
 [hidden email] http://user/SendEmail.jtp?type=nodenode=2307215i=1


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2307215.html
  To unsubscribe from Delta Import occasionally missing records., click
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=2300877code=YnR1Y2tlckBtaW50ZWwuY29tfDIzMDA4Nzd8LTEzMDE5MDUxOTI=.



 font size=1 face=Verdana

 Mintel International Group Ltd | 18-19 Long Lane | London EC1A 9PL UK
 Registered in England: Number 1475918. | VAT Number: GB 232 9342 72

 Contact details for our other offices can be found at 
 http://www.mintel.com/office-locations.

 This email and any attachments may include content that is confidential, 
 privileged, or otherwise protected
 under applicable law. Unauthorised disclosure, copying, distribution, or use 
 of the contents is prohibited
 and may be unlawful. If you have received this email in error, including

Re: Delta Import occasionally missing records.

2011-01-24 Thread btucker

Thank you for your response.

In what way is 'timestamp' not perfect?

I've looked into the SolrEntityProcessor and added a timestamp field to our
index.
However i'm struggling to work out a query to get the max value od the
timestamp field
and does the SolrEntityProcessor entity appear before the root entity or
does it wrap around the root entity.

On 22 January 2011 07:24, Lance Norskog-2 [via Lucene] 
ml-node+2307215-627680969-326...@n3.nabble.comml-node%2b2307215-627680969-326...@n3.nabble.com
 wrote:

 The timestamp thing is not perfect. You can instead do a search
 against Solr and find the latest timestamp in the index. SOLR-1499
 allows you to search against Solr in the DataImportHandler.

 On Fri, Jan 21, 2011 at 2:27 AM, btucker [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=2307215i=0
 wrote:

 
  Hello
 
  We've just started using solr to provide search functionality for our
  application with the DataImportHandler performing a delta-import every 1
  fired by crontab, which works great, however it does occasionally miss
  records that are added to the database while the delta-import is running.

 
  Our data-config.xml has the following queries in its root entity:
 
  query=SELECT id, date_published, date_created, publish_flag FROM Item
 WHERE
  id  0
 
  AND record_type_id=0
 
  ORDER BY id DESC
  preImportDeleteQuery=SELECT item_id AS Id FROM
  gnpd_production.item_deletions
  deletedPkQuery=SELECT item_id AS id FROM gnpd_production.item_deletions
  WHERE deletion_date =
 
  SUBDATE('${dataimporter.last_index_time}', INTERVAL 5 MINUTE)
  deltaImportQuery=SELECT id, date_published, date_created, publish_flag
 FROM
  Item WHERE id  0
 
  AND record_type_id=0
 
  AND id=${dataimporter.delta.id}
 
  ORDER BY id DESC
  deltaQuery=SELECT id, date_published, date_created, publish_flag FROM
 Item
  WHERE id  0
 
  AND record_type_id=0
 
  AND sys_time_stamp =
 
  SUBDATE('${dataimporter.last_index_time}', INTERVAL 1 MINUTE) ORDER BY id

  DESC
 
  I think the problem i'm having comes from the way solr stores the
  last_index_time in conf/dataimport.properties as stated on the wiki as
 
  When delta-import command is executed, it reads the start time stored
 in
  conf/dataimport.properties. It uses that timestamp to run delta queries
 and
  after completion, updates the timestamp in conf/dataimport.properties.
 
  Which to me seems to indicate that any records with a time-stamp between
  when the dataimport starts and ends will be missed as the last_index_time
 is
  set to when it completes the import.
 
  This doesn't seem quite right to me. I would have expected the
  last_index_time to refer to when the dataimport was last STARTED so that
  there was no gaps in the timestamp covered.
 
  I changed the deltaQuery of our config to include the SUBDATE by INTERVAL
 1
  MINUTE statement to alleviate this problem, but it does only cover times
  when the delta-import takes less than a minute.
 
  Any ideas as to how this can be overcome? ,other than increasing the
  INTERVAL to something larger.
 
  Regards
 
  Barry Tucker
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.htmlhttp://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html?by-user=t
  Sent from the Solr - User mailing list archive at Nabble.com.
 



 --
 Lance Norskog
 [hidden email] http://user/SendEmail.jtp?type=nodenode=2307215i=1


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2307215.html
  To unsubscribe from Delta Import occasionally missing records., click
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=2300877code=YnR1Y2tlckBtaW50ZWwuY29tfDIzMDA4Nzd8LTEzMDE5MDUxOTI=.



font size=1 face=Verdana

Mintel International Group Ltd | 18-19 Long Lane | London EC1A 9PL UK
Registered in England: Number 1475918. | VAT Number: GB 232 9342 72

Contact details for our other offices can be found at 
http://www.mintel.com/office-locations.

This email and any attachments may include content that is confidential, 
privileged, or otherwise protected
under applicable law. Unauthorised disclosure, copying, distribution, or use of 
the contents is prohibited 
and may be unlawful. If you have received this email in error, including 
without appropriate authorisation, 
then please reply to the sender about the error and delete this email and any 
attachments./font

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2318572.html
Sent from the Solr - User mailing list archive at Nabble.com.


Delta Import occasionally missing records.

2011-01-21 Thread btucker

Hello

We've just started using solr to provide search functionality for our
application with the DataImportHandler performing a delta-import every 1
fired by crontab, which works great, however it does occasionally miss
records that are added to the database while the delta-import is running.

Our data-config.xml has the following queries in its root entity:

query=SELECT id, date_published, date_created, publish_flag FROM Item WHERE
id  0


AND record_type_id=0


ORDER BY id DESC
preImportDeleteQuery=SELECT item_id AS Id FROM
gnpd_production.item_deletions
deletedPkQuery=SELECT item_id AS id FROM gnpd_production.item_deletions
WHERE deletion_date =
 
SUBDATE('${dataimporter.last_index_time}', INTERVAL 5 MINUTE)
deltaImportQuery=SELECT id, date_published, date_created, publish_flag FROM
Item WHERE id  0

   
AND record_type_id=0

   
AND id=${dataimporter.delta.id}

   
ORDER BY id DESC
deltaQuery=SELECT id, date_published, date_created, publish_flag FROM Item
WHERE id  0

 
AND record_type_id=0

 
AND sys_time_stamp =

   
SUBDATE('${dataimporter.last_index_time}', INTERVAL 1 MINUTE) ORDER BY id
DESC

I think the problem i'm having comes from the way solr stores the
last_index_time in conf/dataimport.properties as stated on the wiki as 

When delta-import command is executed, it reads the start time stored in
conf/dataimport.properties. It uses that timestamp to run delta queries and
after completion, updates the timestamp in conf/dataimport.properties.

Which to me seems to indicate that any records with a time-stamp between
when the dataimport starts and ends will be missed as the last_index_time is
set to when it completes the import.

This doesn't seem quite right to me. I would have expected the
last_index_time to refer to when the dataimport was last STARTED so that
there was no gaps in the timestamp covered.

I changed the deltaQuery of our config to include the SUBDATE by INTERVAL 1
MINUTE statement to alleviate this problem, but it does only cover times
when the delta-import takes less than a minute.

Any ideas as to how this can be overcome? ,other than increasing the
INTERVAL to something larger.

Regards

Barry Tucker
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Delta Import occasionally missing records.

2011-01-21 Thread Lance Norskog
The timestamp thing is not perfect. You can instead do a search
against Solr and find the latest timestamp in the index. SOLR-1499
allows you to search against Solr in the DataImportHandler.

On Fri, Jan 21, 2011 at 2:27 AM, btucker btuc...@mintel.com wrote:

 Hello

 We've just started using solr to provide search functionality for our
 application with the DataImportHandler performing a delta-import every 1
 fired by crontab, which works great, however it does occasionally miss
 records that are added to the database while the delta-import is running.

 Our data-config.xml has the following queries in its root entity:

 query=SELECT id, date_published, date_created, publish_flag FROM Item WHERE
 id  0

 AND record_type_id=0

 ORDER BY id DESC
 preImportDeleteQuery=SELECT item_id AS Id FROM
 gnpd_production.item_deletions
 deletedPkQuery=SELECT item_id AS id FROM gnpd_production.item_deletions
 WHERE deletion_date =

 SUBDATE('${dataimporter.last_index_time}', INTERVAL 5 MINUTE)
 deltaImportQuery=SELECT id, date_published, date_created, publish_flag FROM
 Item WHERE id  0

 AND record_type_id=0

 AND id=${dataimporter.delta.id}

 ORDER BY id DESC
 deltaQuery=SELECT id, date_published, date_created, publish_flag FROM Item
 WHERE id  0

 AND record_type_id=0

 AND sys_time_stamp =

 SUBDATE('${dataimporter.last_index_time}', INTERVAL 1 MINUTE) ORDER BY id
 DESC

 I think the problem i'm having comes from the way solr stores the
 last_index_time in conf/dataimport.properties as stated on the wiki as

 When delta-import command is executed, it reads the start time stored in
 conf/dataimport.properties. It uses that timestamp to run delta queries and
 after completion, updates the timestamp in conf/dataimport.properties.

 Which to me seems to indicate that any records with a time-stamp between
 when the dataimport starts and ends will be missed as the last_index_time is
 set to when it completes the import.

 This doesn't seem quite right to me. I would have expected the
 last_index_time to refer to when the dataimport was last STARTED so that
 there was no gaps in the timestamp covered.

 I changed the deltaQuery of our config to include the SUBDATE by INTERVAL 1
 MINUTE statement to alleviate this problem, but it does only cover times
 when the delta-import takes less than a minute.

 Any ideas as to how this can be overcome? ,other than increasing the
 INTERVAL to something larger.

 Regards

 Barry Tucker
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Delta-Import-occasionally-missing-records-tp2300877p2300877.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com