Re: Indexing 20M documents from MySQL with DIH

Erick Erickson Fri, 22 Apr 2011 05:47:18 -0700

{{{A custom indexer, so that's a fairly common practice? So when you are
dealing with these large indexes, do you try not to fully rebuild them
when you can? It's not a nightly thing, but something to do in case of
a disaster? Is there a difference in the performance of an index that
was built all at once vs. one that has had delta inserts and updates
applied over a period of months?}}}


Is it a common practice? Like all of this, "it depends". It's certainly
easier to let DIH do the work. Sometimes DIH doesn't have all the
capabilities necessary. Or as Chris said, in the case where you already
have a system built up and it's easier to just grab the output from
that and send it to Solr, perhaps with SolrJ and not use DIH. Some people
are just more comfortable with their own code...

"Do you try not to fully rebuild". It depends on how painful a full rebuild
is. Some people just like the simplicity of starting over every day/week/month.
But you *have* to be able to rebuild your index in case of disaster, and
a periodic full rebuild certainly keeps that process up to date.

"Is there a difference...delta inserts...updates...applied over months". Not
if you do an optimize. When a document is deleted (or updated), it's only
marked as deleted. The associated data is still in the index. Optimize will
reclaim that space and compact the segments, perhaps down to one.
But there's no real operational difference between a newly-rebuilt index
and one that's been optimized. If you don't delete/update, there's not
much reason to optimize either....

I'll leave the DIH to others......

Best
Erick

On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow <eph...@gmail.com> wrote:
> Thanks for the e-mail. I probably should have provided more details,
> but I was more interested in making sure I was approaching the problem
> correctly (using DIH, with one big SELECT statement for millions of
> rows) instead of solving this specific problem. Here's a partial
> stacktrace from this specific problem:
>
> ...
> Caused by: java.io.EOFException: Can not read response from server.
> Expected to read 4 bytes, read 0 bytes before connection was
> unexpectedly lost.
>        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
>        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
>        ... 22 more
> Apr 21, 2011 3:53:28 AM
> org.apache.solr.handler.dataimport.EntityProcessorBase getNext
> SEVERE: getNext() failed for query 'REDACTED'
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
> Communications link failure
>
> The last packet successfully received from the server was 128
> milliseconds ago.  The last packet sent successfully to the server was
> 25,273,484 milliseconds ago.
> ...
>
>
> A custom indexer, so that's a fairly common practice? So when you are
> dealing with these large indexes, do you try not to fully rebuild them
> when you can? It's not a nightly thing, but something to do in case of
> a disaster? Is there a difference in the performance of an index that
> was built all at once vs. one that has had delta inserts and updates
> applied over a period of months?
>
> Thank you for your insight.
>
>
> On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
> <hossman_luc...@fucit.org> wrote:
>>
>> : For a new project, I need to index about 20M records (30 fields) and I
>> : have been running into issues with MySQL disconnects, right around
>> : 15M. I've tried several remedies I've found on blogs, changing
>>
>> if you can provide some concrete error/log messages and the details of how
>> you are configuring your datasource that might help folks provide better
>> suggestions -- youv'e said you run into a problem but you havne't provided
>> any details for people to go on in giving you feedback.
>>
>> : resolved the issue. It got me wondering: Is this the way everyone does
>> : it? What about 100M records up to 1B; are those all pulled using DIH
>> : and a single query?
>>
>> I've only recently started using DIH, and while it definitely has a lot
>> of quirks/anoyances, it seems like a pretty good 80/20 solution for
>> indexing with Solr -- but that doens't mean it's perfect for all
>> situations.
>>
>> Writing custom indexer code can certianly make sense in a lot of cases --
>> particularly where you already have a data pblishing system that you wnat
>> to tie into directly -- the trick is to ensure you have a decent strategy
>> for rebuilding the entire index should the need arrise (but this is relaly
>> only an issue if your primary indexing solution is incremental -- many use
>> cases can be satisifed just fine with a brute force "full rebuild
>> periodically" impelmentation.
>>
>>
>> -Hoss
>>
>

Re: Indexing 20M documents from MySQL with DIH

Reply via email to