Re: Indexing 20M documents from MySQL with DIH

Alexey Serba Thu, 05 May 2011 12:10:59 -0700

{quote}
...
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was
unexpectedly lost.
       at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
       at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
       ... 22 more
Apr 21, 2011 3:53:28 AM
org.apache.solr.handler.dataimport.EntityProcessorBase getNext
SEVERE: getNext() failed for query 'REDACTED'
org.apache.solr.handler.dataimport.DataImportHandlerException:
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
Communications link failure


The last packet successfully received from the server was 128
milliseconds ago.  The last packet sent successfully to the server was
25,273,484 milliseconds ago.
...
{quote}

It could probably be because of autocommit / segment merging. You
could try to disable autocommit / increase mergeFactor

{quote}
I've used sphinx in the past, which uses multiple queries to pull out
a subset of records ranged based on PrimaryKey, does Solr offer
functionality similar to this? It seems that once a Solr index gets to
a certain size, the indexing of a batch takes longer than MySQL's
net_write_timeout, so it kills the connection.
{quote}

I was thinking about some hackish solution to paginate results
<entity name ="pages" query="SELECT id FROM generate_series( (SELECT
count(*) from source_table) / 1000 )" ... >
  <entity name="records" query="SELECT * from source_table LIMIT 1000
OFFSET ${pages.id}*1000">
  </entity>
</entity>
Or something along those lines ( you'd need to to calculate offset in
pages query )

But unfortunately MySQL does not provide generate_series function
(it's postgres function and there'r similar solutions for oracle and
mssql).


On Mon, Apr 25, 2011 at 3:59 AM, Scott Bigelow <eph...@gmail.com> wrote:
> Thank you everyone for your help. I ended up getting the index to work
> using the exact same config file on a (substantially) larger instance.
>
> On Fri, Apr 22, 2011 at 5:46 AM, Erick Erickson <erickerick...@gmail.com> 
> wrote:
>> {{{A custom indexer, so that's a fairly common practice? So when you are
>> dealing with these large indexes, do you try not to fully rebuild them
>> when you can? It's not a nightly thing, but something to do in case of
>> a disaster? Is there a difference in the performance of an index that
>> was built all at once vs. one that has had delta inserts and updates
>> applied over a period of months?}}}
>>
>> Is it a common practice? Like all of this, "it depends". It's certainly
>> easier to let DIH do the work. Sometimes DIH doesn't have all the
>> capabilities necessary. Or as Chris said, in the case where you already
>> have a system built up and it's easier to just grab the output from
>> that and send it to Solr, perhaps with SolrJ and not use DIH. Some people
>> are just more comfortable with their own code...
>>
>> "Do you try not to fully rebuild". It depends on how painful a full rebuild
>> is. Some people just like the simplicity of starting over every 
>> day/week/month.
>> But you *have* to be able to rebuild your index in case of disaster, and
>> a periodic full rebuild certainly keeps that process up to date.
>>
>> "Is there a difference...delta inserts...updates...applied over months". Not
>> if you do an optimize. When a document is deleted (or updated), it's only
>> marked as deleted. The associated data is still in the index. Optimize will
>> reclaim that space and compact the segments, perhaps down to one.
>> But there's no real operational difference between a newly-rebuilt index
>> and one that's been optimized. If you don't delete/update, there's not
>> much reason to optimize either....
>>
>> I'll leave the DIH to others......
>>
>> Best
>> Erick
>>
>> On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow <eph...@gmail.com> wrote:
>>> Thanks for the e-mail. I probably should have provided more details,
>>> but I was more interested in making sure I was approaching the problem
>>> correctly (using DIH, with one big SELECT statement for millions of
>>> rows) instead of solving this specific problem. Here's a partial
>>> stacktrace from this specific problem:
>>>
>>> ...
>>> Caused by: java.io.EOFException: Can not read response from server.
>>> Expected to read 4 bytes, read 0 bytes before connection was
>>> unexpectedly lost.
>>>        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
>>>        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
>>>        ... 22 more
>>> Apr 21, 2011 3:53:28 AM
>>> org.apache.solr.handler.dataimport.EntityProcessorBase getNext
>>> SEVERE: getNext() failed for query 'REDACTED'
>>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>>> Communications link failure
>>>
>>> The last packet successfully received from the server was 128
>>> milliseconds ago.  The last packet sent successfully to the server was
>>> 25,273,484 milliseconds ago.
>>> ...
>>>
>>>
>>> A custom indexer, so that's a fairly common practice? So when you are
>>> dealing with these large indexes, do you try not to fully rebuild them
>>> when you can? It's not a nightly thing, but something to do in case of
>>> a disaster? Is there a difference in the performance of an index that
>>> was built all at once vs. one that has had delta inserts and updates
>>> applied over a period of months?
>>>
>>> Thank you for your insight.
>>>
>>>
>>> On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
>>> <hossman_luc...@fucit.org> wrote:
>>>>
>>>> : For a new project, I need to index about 20M records (30 fields) and I
>>>> : have been running into issues with MySQL disconnects, right around
>>>> : 15M. I've tried several remedies I've found on blogs, changing
>>>>
>>>> if you can provide some concrete error/log messages and the details of how
>>>> you are configuring your datasource that might help folks provide better
>>>> suggestions -- youv'e said you run into a problem but you havne't provided
>>>> any details for people to go on in giving you feedback.
>>>>
>>>> : resolved the issue. It got me wondering: Is this the way everyone does
>>>> : it? What about 100M records up to 1B; are those all pulled using DIH
>>>> : and a single query?
>>>>
>>>> I've only recently started using DIH, and while it definitely has a lot
>>>> of quirks/anoyances, it seems like a pretty good 80/20 solution for
>>>> indexing with Solr -- but that doens't mean it's perfect for all
>>>> situations.
>>>>
>>>> Writing custom indexer code can certianly make sense in a lot of cases --
>>>> particularly where you already have a data pblishing system that you wnat
>>>> to tie into directly -- the trick is to ensure you have a decent strategy
>>>> for rebuilding the entire index should the need arrise (but this is relaly
>>>> only an issue if your primary indexing solution is incremental -- many use
>>>> cases can be satisifed just fine with a brute force "full rebuild
>>>> periodically" impelmentation.
>>>>
>>>>
>>>> -Hoss
>>>>
>>>
>>
>

Re: Indexing 20M documents from MySQL with DIH

Reply via email to