{quote} ... Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost. at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) ... 22 more Apr 21, 2011 3:53:28 AM org.apache.solr.handler.dataimport.EntityProcessorBase getNext SEVERE: getNext() failed for query 'REDACTED' org.apache.solr.handler.dataimport.DataImportHandlerException: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 128 milliseconds ago. The last packet sent successfully to the server was 25,273,484 milliseconds ago. ... {quote} It could probably be because of autocommit / segment merging. You could try to disable autocommit / increase mergeFactor {quote} I've used sphinx in the past, which uses multiple queries to pull out a subset of records ranged based on PrimaryKey, does Solr offer functionality similar to this? It seems that once a Solr index gets to a certain size, the indexing of a batch takes longer than MySQL's net_write_timeout, so it kills the connection. {quote} I was thinking about some hackish solution to paginate results <entity name ="pages" query="SELECT id FROM generate_series( (SELECT count(*) from source_table) / 1000 )" ... > <entity name="records" query="SELECT * from source_table LIMIT 1000 OFFSET ${pages.id}*1000"> </entity> </entity> Or something along those lines ( you'd need to to calculate offset in pages query ) But unfortunately MySQL does not provide generate_series function (it's postgres function and there'r similar solutions for oracle and mssql). On Mon, Apr 25, 2011 at 3:59 AM, Scott Bigelow <eph...@gmail.com> wrote: > Thank you everyone for your help. I ended up getting the index to work > using the exact same config file on a (substantially) larger instance. > > On Fri, Apr 22, 2011 at 5:46 AM, Erick Erickson <erickerick...@gmail.com> > wrote: >> {{{A custom indexer, so that's a fairly common practice? So when you are >> dealing with these large indexes, do you try not to fully rebuild them >> when you can? It's not a nightly thing, but something to do in case of >> a disaster? Is there a difference in the performance of an index that >> was built all at once vs. one that has had delta inserts and updates >> applied over a period of months?}}} >> >> Is it a common practice? Like all of this, "it depends". It's certainly >> easier to let DIH do the work. Sometimes DIH doesn't have all the >> capabilities necessary. Or as Chris said, in the case where you already >> have a system built up and it's easier to just grab the output from >> that and send it to Solr, perhaps with SolrJ and not use DIH. Some people >> are just more comfortable with their own code... >> >> "Do you try not to fully rebuild". It depends on how painful a full rebuild >> is. Some people just like the simplicity of starting over every >> day/week/month. >> But you *have* to be able to rebuild your index in case of disaster, and >> a periodic full rebuild certainly keeps that process up to date. >> >> "Is there a difference...delta inserts...updates...applied over months". Not >> if you do an optimize. When a document is deleted (or updated), it's only >> marked as deleted. The associated data is still in the index. Optimize will >> reclaim that space and compact the segments, perhaps down to one. >> But there's no real operational difference between a newly-rebuilt index >> and one that's been optimized. If you don't delete/update, there's not >> much reason to optimize either.... >> >> I'll leave the DIH to others...... >> >> Best >> Erick >> >> On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow <eph...@gmail.com> wrote: >>> Thanks for the e-mail. I probably should have provided more details, >>> but I was more interested in making sure I was approaching the problem >>> correctly (using DIH, with one big SELECT statement for millions of >>> rows) instead of solving this specific problem. Here's a partial >>> stacktrace from this specific problem: >>> >>> ... >>> Caused by: java.io.EOFException: Can not read response from server. >>> Expected to read 4 bytes, read 0 bytes before connection was >>> unexpectedly lost. >>> at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) >>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) >>> ... 22 more >>> Apr 21, 2011 3:53:28 AM >>> org.apache.solr.handler.dataimport.EntityProcessorBase getNext >>> SEVERE: getNext() failed for query 'REDACTED' >>> org.apache.solr.handler.dataimport.DataImportHandlerException: >>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: >>> Communications link failure >>> >>> The last packet successfully received from the server was 128 >>> milliseconds ago. The last packet sent successfully to the server was >>> 25,273,484 milliseconds ago. >>> ... >>> >>> >>> A custom indexer, so that's a fairly common practice? So when you are >>> dealing with these large indexes, do you try not to fully rebuild them >>> when you can? It's not a nightly thing, but something to do in case of >>> a disaster? Is there a difference in the performance of an index that >>> was built all at once vs. one that has had delta inserts and updates >>> applied over a period of months? >>> >>> Thank you for your insight. >>> >>> >>> On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter >>> <hossman_luc...@fucit.org> wrote: >>>> >>>> : For a new project, I need to index about 20M records (30 fields) and I >>>> : have been running into issues with MySQL disconnects, right around >>>> : 15M. I've tried several remedies I've found on blogs, changing >>>> >>>> if you can provide some concrete error/log messages and the details of how >>>> you are configuring your datasource that might help folks provide better >>>> suggestions -- youv'e said you run into a problem but you havne't provided >>>> any details for people to go on in giving you feedback. >>>> >>>> : resolved the issue. It got me wondering: Is this the way everyone does >>>> : it? What about 100M records up to 1B; are those all pulled using DIH >>>> : and a single query? >>>> >>>> I've only recently started using DIH, and while it definitely has a lot >>>> of quirks/anoyances, it seems like a pretty good 80/20 solution for >>>> indexing with Solr -- but that doens't mean it's perfect for all >>>> situations. >>>> >>>> Writing custom indexer code can certianly make sense in a lot of cases -- >>>> particularly where you already have a data pblishing system that you wnat >>>> to tie into directly -- the trick is to ensure you have a decent strategy >>>> for rebuilding the entire index should the need arrise (but this is relaly >>>> only an issue if your primary indexing solution is incremental -- many use >>>> cases can be satisifed just fine with a brute force "full rebuild >>>> periodically" impelmentation. >>>> >>>> >>>> -Hoss >>>> >>> >> >