Re: Indexing 20M documents from MySQL with DIH

Li Thu, 21 Apr 2011 17:30:53 -0700

Can you post the dataconfig.XML? Probably you didn't use batch size

Sent from my iPhone


On Apr 21, 2011, at 5:09 PM, Scott Bigelow <eph...@gmail.com> wrote:

> Thanks for the e-mail. I probably should have provided more details,
> but I was more interested in making sure I was approaching the problem
> correctly (using DIH, with one big SELECT statement for millions of
> rows) instead of solving this specific problem. Here's a partial
> stacktrace from this specific problem:
> 
> ...
> Caused by: java.io.EOFException: Can not read response from server.
> Expected to read 4 bytes, read 0 bytes before connection was
> unexpectedly lost.
>        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
>        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
>        ... 22 more
> Apr 21, 2011 3:53:28 AM
> org.apache.solr.handler.dataimport.EntityProcessorBase getNext
> SEVERE: getNext() failed for query 'REDACTED'
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
> Communications link failure
> 
> The last packet successfully received from the server was 128
> milliseconds ago.  The last packet sent successfully to the server was
> 25,273,484 milliseconds ago.
> ...
> 
> 
> A custom indexer, so that's a fairly common practice? So when you are
> dealing with these large indexes, do you try not to fully rebuild them
> when you can? It's not a nightly thing, but something to do in case of
> a disaster? Is there a difference in the performance of an index that
> was built all at once vs. one that has had delta inserts and updates
> applied over a period of months?
> 
> Thank you for your insight.
> 
> 
> On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
> <hossman_luc...@fucit.org> wrote:
>> 
>> : For a new project, I need to index about 20M records (30 fields) and I
>> : have been running into issues with MySQL disconnects, right around
>> : 15M. I've tried several remedies I've found on blogs, changing
>> 
>> if you can provide some concrete error/log messages and the details of how
>> you are configuring your datasource that might help folks provide better
>> suggestions -- youv'e said you run into a problem but you havne't provided
>> any details for people to go on in giving you feedback.
>> 
>> : resolved the issue. It got me wondering: Is this the way everyone does
>> : it? What about 100M records up to 1B; are those all pulled using DIH
>> : and a single query?
>> 
>> I've only recently started using DIH, and while it definitely has a lot
>> of quirks/anoyances, it seems like a pretty good 80/20 solution for
>> indexing with Solr -- but that doens't mean it's perfect for all
>> situations.
>> 
>> Writing custom indexer code can certianly make sense in a lot of cases --
>> particularly where you already have a data pblishing system that you wnat
>> to tie into directly -- the trick is to ensure you have a decent strategy
>> for rebuilding the entire index should the need arrise (but this is relaly
>> only an issue if your primary indexing solution is incremental -- many use
>> cases can be satisifed just fine with a brute force "full rebuild
>> periodically" impelmentation.
>> 
>> 
>> -Hoss
>>

Re: Indexing 20M documents from MySQL with DIH

Reply via email to