Hi there,
I've been working with this issue for a while and I really don’t know what the
root cause is. Any insight would be great!
I have 14 million records in a mysql DB. I grab 100,000 records from the DB at
a time and then use ConcurrentUpdateSolrServer (with queue size = 50 and thread
count = 4 and using the internally managed solr client) to write the documents
to the solr index.
If I build metadata only (I.e. Only from DB to Solr), then the index build
takes 4 hrs with no errors.
But if I build metadata + ocr text (ocr text is stored on the file system and
can be very large), then the index build takes 15 – 16 hrs and often times I
get a few early EOF errors on the Solr server.
>From Solr.log:
INFO - 2014-06-13 06:28:27.113;
org.apache.solr.update.processor.LogUpdateProcessor; [ltdl3testperf]
webapp=/solr path=/update params={wt=javabin&version=2} {add=[trpy0136
(1470801743195406336), nfhc0136 (1470801743199600640), sfhc0136
(1470801743205892096), kghc0136 (1470801743218475008), zfhc0136
(1470801743220572160), jghc0136 (1470801743237349376), rghc0136
(1470801743268806656), ffhc0136 (1470801743270903808), pghc0136
(1470801743285583872), sghc0136 (1470801743286632448), ... (14165 adds)]} 0
260102
ERROR - 2014-06-13 06:28:27.114; org.apache.solr.common.SolrException;
java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException] early
EOF
at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
…
We tried increasing the solr server from 4 to 6 cpus. We moved the solr server
to a faster disk. I reduced the queue size for the for
ConcurrentUpdateSolrServer from 100 to 50. But we cannot consistently get a
full index going without any the EOF errors.
In my past three builds (I build them overnight):
1. The first one succeeded
2. The second one had one early EOF error and dropped 3 records out of 14
million
3. The third one had many early EOFs and dropped around 200,000 records
One cluster of the errors occurred at around 6:28am. I looked at the cpu and
file I/O stats around that time, and didn't see anything out of the ordinary.
> sar
06:00:01 AM all 42.13 0.00 1.54 2.13 0.00 54.20
06:10:01 AM all 43.30 0.00 1.68 2.77 0.00 52.24
06:20:01 AM all 47.73 0.00 1.83 2.43 0.00 48.01
06:30:01 AM all 47.71 0.00 1.76 3.15 0.00 47.38
06:40:01 AM all 47.01 0.00 1.68 2.55 0.00 48.76
> sar –d
06:00:01 AM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz
await svctm %util
06:20:01 AM dev8-0 1.84 2.35 370.95 203.01 0.05
27.60 9.58 1.76
06:20:01 AM dev8-16 83.05 464.90 44384.81 540.05 13.25
160.17 2.53 21.03
06:20:01 AM dev8-32 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00
06:20:01 AM dev253-0 1.41 1.71 10.90 8.95 0.01
10.16 3.03 0.43
06:20:01 AM dev253-1 45.09 0.64 360.06 8.00 2.46
54.66 0.30 1.37
06:20:01 AM dev253-2 5513.98 464.90 44092.00 8.08 1623.60
295.54 0.04 21.04
06:30:01 AM dev8-0 2.52 100.62 83.64 72.99 0.03
10.42 6.59 1.66
06:30:01 AM dev8-16 52.56 1502.75 18736.64 385.06 5.67
107.95 2.17 11.42
06:30:01 AM dev8-32 42.55 0.01 38923.71 914.83 15.33
360.27 3.84 16.35
06:30:01 AM dev253-0 3.03 98.24 13.55 36.93 0.03
9.44 2.99 0.90
06:30:01 AM dev253-1 9.06 2.38 70.09 8.00 0.26
29.19 0.84 0.77
06:30:01 AM dev253-2 7216.35 1502.76 57660.35 8.20 2599.49
360.22 0.04 26.58
Does anyone have any suggestions of where I can dig for the root cause?
Thanks!
Rebecca Tang
Applications Developer, UCSF CKM
Legacy Tobacco Document Library<legacy.library.ucsf.edu/>
E: [email protected]