Re: Indexing 20M documents from MySQL with DIH
Alex, thanks for your response. I suspect you're right about autoCommit; i ended up solving the problem by merely moving the entire Solr install, untouched, to a significantly larger instance (EC2 m1.small to m1.large). I think it is appropriately sized now for the quantity and intensity of queries that will be thrown at it when it enters production, so I never bothered to get it working on the smaller instance. Your examples are interesting, I wonder if you could create some count table to make up for MySQL's lack of row generator. Either way, it seems like paging through results would be a must-have for any enterprise-level indexer, and I'm surprised to find it missing in Solr. When relying on the delta import mechanism for updates, it's not like one would need the consistency of pulling the entire record set as a single, isolated query, since the delta import is designed to fetch new documents and merge them in to a slightly out-of-date/inconsistent index. On Thu, May 5, 2011 at 12:10 PM, Alexey Serba wrote: > {quote} > ... > Caused by: java.io.EOFException: Can not read response from server. > Expected to read 4 bytes, read 0 bytes before connection was > unexpectedly lost. > at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) > at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) > ... 22 more > Apr 21, 2011 3:53:28 AM > org.apache.solr.handler.dataimport.EntityProcessorBase getNext > SEVERE: getNext() failed for query 'REDACTED' > org.apache.solr.handler.dataimport.DataImportHandlerException: > com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: > Communications link failure > > The last packet successfully received from the server was 128 > milliseconds ago. The last packet sent successfully to the server was > 25,273,484 milliseconds ago. > ... > {quote} > > It could probably be because of autocommit / segment merging. You > could try to disable autocommit / increase mergeFactor > > {quote} > I've used sphinx in the past, which uses multiple queries to pull out > a subset of records ranged based on PrimaryKey, does Solr offer > functionality similar to this? It seems that once a Solr index gets to > a certain size, the indexing of a batch takes longer than MySQL's > net_write_timeout, so it kills the connection. > {quote} > > I was thinking about some hackish solution to paginate results > > > > > Or something along those lines ( you'd need to to calculate offset in > pages query ) > > But unfortunately MySQL does not provide generate_series function > (it's postgres function and there'r similar solutions for oracle and > mssql). > > > On Mon, Apr 25, 2011 at 3:59 AM, Scott Bigelow wrote: >> Thank you everyone for your help. I ended up getting the index to work >> using the exact same config file on a (substantially) larger instance. >> >> On Fri, Apr 22, 2011 at 5:46 AM, Erick Erickson >> wrote: >>> {{{A custom indexer, so that's a fairly common practice? So when you are >>> dealing with these large indexes, do you try not to fully rebuild them >>> when you can? It's not a nightly thing, but something to do in case of >>> a disaster? Is there a difference in the performance of an index that >>> was built all at once vs. one that has had delta inserts and updates >>> applied over a period of months?}}} >>> >>> Is it a common practice? Like all of this, "it depends". It's certainly >>> easier to let DIH do the work. Sometimes DIH doesn't have all the >>> capabilities necessary. Or as Chris said, in the case where you already >>> have a system built up and it's easier to just grab the output from >>> that and send it to Solr, perhaps with SolrJ and not use DIH. Some people >>> are just more comfortable with their own code... >>> >>> "Do you try not to fully rebuild". It depends on how painful a full rebuild >>> is. Some people just like the simplicity of starting over every >>> day/week/month. >>> But you *have* to be able to rebuild your index in case of disaster, and >>> a periodic full rebuild certainly keeps that process up to date. >>> >>> "Is there a difference...delta inserts...updates...applied over months". Not >>> if you do an optimize. When a document is deleted (or updated), it's only >>> marked as deleted. The associated data is still in the index. Optimize will >>> reclaim that space and compact the segments, perhaps down to one. >>> But there's no real operational difference between a newly-rebuilt index >>> and one that's been optimized. If you don't delete/update, there's not >>> much reason to optimize either >>> >>> I'll leave the DIH to others.. >>> >>> Best >>> Erick >>> >>> On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow wrote: Thanks for the e-mail. I probably should have provided more details, but I was more interested in making sure I was approaching the problem correctly (using DIH, with one big SELECT statement for millions of rows) instead of solving this specific problem. Here's a partial >>
Re: Indexing 20M documents from MySQL with DIH
I am running into this problem as well, but only sporadically, and only in my 3.1 test environment, not 1.4.1 production. I may have narrowed things down, I am interested now in learning whether this is a problem with the MySQL connector or DIH. On 4/21/2011 6:09 PM, Scott Bigelow wrote: Thanks for the e-mail. I probably should have provided more details, but I was more interested in making sure I was approaching the problem correctly (using DIH, with one big SELECT statement for millions of rows) instead of solving this specific problem. Here's a partial stacktrace from this specific problem: ... Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost. at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) ... 22 more Apr 21, 2011 3:53:28 AM org.apache.solr.handler.dataimport.EntityProcessorBase getNext SEVERE: getNext() failed for query 'REDACTED' org.apache.solr.handler.dataimport.DataImportHandlerException: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet successfully received from the server was 128 milliseconds ago. The last packet sent successfully to the server was 25,273,484 milliseconds ago. ...
Re: Indexing 20M documents from MySQL with DIH
{quote} ... Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost. at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) ... 22 more Apr 21, 2011 3:53:28 AM org.apache.solr.handler.dataimport.EntityProcessorBase getNext SEVERE: getNext() failed for query 'REDACTED' org.apache.solr.handler.dataimport.DataImportHandlerException: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet successfully received from the server was 128 milliseconds ago. The last packet sent successfully to the server was 25,273,484 milliseconds ago. ... {quote} It could probably be because of autocommit / segment merging. You could try to disable autocommit / increase mergeFactor {quote} I've used sphinx in the past, which uses multiple queries to pull out a subset of records ranged based on PrimaryKey, does Solr offer functionality similar to this? It seems that once a Solr index gets to a certain size, the indexing of a batch takes longer than MySQL's net_write_timeout, so it kills the connection. {quote} I was thinking about some hackish solution to paginate results Or something along those lines ( you'd need to to calculate offset in pages query ) But unfortunately MySQL does not provide generate_series function (it's postgres function and there'r similar solutions for oracle and mssql). On Mon, Apr 25, 2011 at 3:59 AM, Scott Bigelow wrote: > Thank you everyone for your help. I ended up getting the index to work > using the exact same config file on a (substantially) larger instance. > > On Fri, Apr 22, 2011 at 5:46 AM, Erick Erickson > wrote: >> {{{A custom indexer, so that's a fairly common practice? So when you are >> dealing with these large indexes, do you try not to fully rebuild them >> when you can? It's not a nightly thing, but something to do in case of >> a disaster? Is there a difference in the performance of an index that >> was built all at once vs. one that has had delta inserts and updates >> applied over a period of months?}}} >> >> Is it a common practice? Like all of this, "it depends". It's certainly >> easier to let DIH do the work. Sometimes DIH doesn't have all the >> capabilities necessary. Or as Chris said, in the case where you already >> have a system built up and it's easier to just grab the output from >> that and send it to Solr, perhaps with SolrJ and not use DIH. Some people >> are just more comfortable with their own code... >> >> "Do you try not to fully rebuild". It depends on how painful a full rebuild >> is. Some people just like the simplicity of starting over every >> day/week/month. >> But you *have* to be able to rebuild your index in case of disaster, and >> a periodic full rebuild certainly keeps that process up to date. >> >> "Is there a difference...delta inserts...updates...applied over months". Not >> if you do an optimize. When a document is deleted (or updated), it's only >> marked as deleted. The associated data is still in the index. Optimize will >> reclaim that space and compact the segments, perhaps down to one. >> But there's no real operational difference between a newly-rebuilt index >> and one that's been optimized. If you don't delete/update, there's not >> much reason to optimize either >> >> I'll leave the DIH to others.. >> >> Best >> Erick >> >> On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow wrote: >>> Thanks for the e-mail. I probably should have provided more details, >>> but I was more interested in making sure I was approaching the problem >>> correctly (using DIH, with one big SELECT statement for millions of >>> rows) instead of solving this specific problem. Here's a partial >>> stacktrace from this specific problem: >>> >>> ... >>> Caused by: java.io.EOFException: Can not read response from server. >>> Expected to read 4 bytes, read 0 bytes before connection was >>> unexpectedly lost. >>> at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) >>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) >>> ... 22 more >>> Apr 21, 2011 3:53:28 AM >>> org.apache.solr.handler.dataimport.EntityProcessorBase getNext >>> SEVERE: getNext() failed for query 'REDACTED' >>> org.apache.solr.handler.dataimport.DataImportHandlerException: >>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: >>> Communications link failure >>> >>> The last packet successfully received from the server was 128 >>> milliseconds ago. The last packet sent successfully to the server was >>> 25,273,484 milliseconds ago. >>> ... >>> >>> >>> A custom indexer, so that's a fairly common practice? So when you are >>> dealing with these large indexes, do you try not to fully rebuild them >>> when you can? It's not a nightly thing, but something to do in case of >>> a disaster? Is there a difference in the performance
Re: Indexing 20M documents from MySQL with DIH
Thank you everyone for your help. I ended up getting the index to work using the exact same config file on a (substantially) larger instance. On Fri, Apr 22, 2011 at 5:46 AM, Erick Erickson wrote: > {{{A custom indexer, so that's a fairly common practice? So when you are > dealing with these large indexes, do you try not to fully rebuild them > when you can? It's not a nightly thing, but something to do in case of > a disaster? Is there a difference in the performance of an index that > was built all at once vs. one that has had delta inserts and updates > applied over a period of months?}}} > > Is it a common practice? Like all of this, "it depends". It's certainly > easier to let DIH do the work. Sometimes DIH doesn't have all the > capabilities necessary. Or as Chris said, in the case where you already > have a system built up and it's easier to just grab the output from > that and send it to Solr, perhaps with SolrJ and not use DIH. Some people > are just more comfortable with their own code... > > "Do you try not to fully rebuild". It depends on how painful a full rebuild > is. Some people just like the simplicity of starting over every > day/week/month. > But you *have* to be able to rebuild your index in case of disaster, and > a periodic full rebuild certainly keeps that process up to date. > > "Is there a difference...delta inserts...updates...applied over months". Not > if you do an optimize. When a document is deleted (or updated), it's only > marked as deleted. The associated data is still in the index. Optimize will > reclaim that space and compact the segments, perhaps down to one. > But there's no real operational difference between a newly-rebuilt index > and one that's been optimized. If you don't delete/update, there's not > much reason to optimize either > > I'll leave the DIH to others.. > > Best > Erick > > On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow wrote: >> Thanks for the e-mail. I probably should have provided more details, >> but I was more interested in making sure I was approaching the problem >> correctly (using DIH, with one big SELECT statement for millions of >> rows) instead of solving this specific problem. Here's a partial >> stacktrace from this specific problem: >> >> ... >> Caused by: java.io.EOFException: Can not read response from server. >> Expected to read 4 bytes, read 0 bytes before connection was >> unexpectedly lost. >> at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) >> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) >> ... 22 more >> Apr 21, 2011 3:53:28 AM >> org.apache.solr.handler.dataimport.EntityProcessorBase getNext >> SEVERE: getNext() failed for query 'REDACTED' >> org.apache.solr.handler.dataimport.DataImportHandlerException: >> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: >> Communications link failure >> >> The last packet successfully received from the server was 128 >> milliseconds ago. The last packet sent successfully to the server was >> 25,273,484 milliseconds ago. >> ... >> >> >> A custom indexer, so that's a fairly common practice? So when you are >> dealing with these large indexes, do you try not to fully rebuild them >> when you can? It's not a nightly thing, but something to do in case of >> a disaster? Is there a difference in the performance of an index that >> was built all at once vs. one that has had delta inserts and updates >> applied over a period of months? >> >> Thank you for your insight. >> >> >> On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter >> wrote: >>> >>> : For a new project, I need to index about 20M records (30 fields) and I >>> : have been running into issues with MySQL disconnects, right around >>> : 15M. I've tried several remedies I've found on blogs, changing >>> >>> if you can provide some concrete error/log messages and the details of how >>> you are configuring your datasource that might help folks provide better >>> suggestions -- youv'e said you run into a problem but you havne't provided >>> any details for people to go on in giving you feedback. >>> >>> : resolved the issue. It got me wondering: Is this the way everyone does >>> : it? What about 100M records up to 1B; are those all pulled using DIH >>> : and a single query? >>> >>> I've only recently started using DIH, and while it definitely has a lot >>> of quirks/anoyances, it seems like a pretty good 80/20 solution for >>> indexing with Solr -- but that doens't mean it's perfect for all >>> situations. >>> >>> Writing custom indexer code can certianly make sense in a lot of cases -- >>> particularly where you already have a data pblishing system that you wnat >>> to tie into directly -- the trick is to ensure you have a decent strategy >>> for rebuilding the entire index should the need arrise (but this is relaly >>> only an issue if your primary indexing solution is incremental -- many use >>> cases can be satisifed just fine with a brute force "full rebuild >>> periodically" im
Re: Indexing 20M documents from MySQL with DIH
{{{A custom indexer, so that's a fairly common practice? So when you are dealing with these large indexes, do you try not to fully rebuild them when you can? It's not a nightly thing, but something to do in case of a disaster? Is there a difference in the performance of an index that was built all at once vs. one that has had delta inserts and updates applied over a period of months?}}} Is it a common practice? Like all of this, "it depends". It's certainly easier to let DIH do the work. Sometimes DIH doesn't have all the capabilities necessary. Or as Chris said, in the case where you already have a system built up and it's easier to just grab the output from that and send it to Solr, perhaps with SolrJ and not use DIH. Some people are just more comfortable with their own code... "Do you try not to fully rebuild". It depends on how painful a full rebuild is. Some people just like the simplicity of starting over every day/week/month. But you *have* to be able to rebuild your index in case of disaster, and a periodic full rebuild certainly keeps that process up to date. "Is there a difference...delta inserts...updates...applied over months". Not if you do an optimize. When a document is deleted (or updated), it's only marked as deleted. The associated data is still in the index. Optimize will reclaim that space and compact the segments, perhaps down to one. But there's no real operational difference between a newly-rebuilt index and one that's been optimized. If you don't delete/update, there's not much reason to optimize either I'll leave the DIH to others.. Best Erick On Thu, Apr 21, 2011 at 8:09 PM, Scott Bigelow wrote: > Thanks for the e-mail. I probably should have provided more details, > but I was more interested in making sure I was approaching the problem > correctly (using DIH, with one big SELECT statement for millions of > rows) instead of solving this specific problem. Here's a partial > stacktrace from this specific problem: > > ... > Caused by: java.io.EOFException: Can not read response from server. > Expected to read 4 bytes, read 0 bytes before connection was > unexpectedly lost. > at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) > at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) > ... 22 more > Apr 21, 2011 3:53:28 AM > org.apache.solr.handler.dataimport.EntityProcessorBase getNext > SEVERE: getNext() failed for query 'REDACTED' > org.apache.solr.handler.dataimport.DataImportHandlerException: > com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: > Communications link failure > > The last packet successfully received from the server was 128 > milliseconds ago. The last packet sent successfully to the server was > 25,273,484 milliseconds ago. > ... > > > A custom indexer, so that's a fairly common practice? So when you are > dealing with these large indexes, do you try not to fully rebuild them > when you can? It's not a nightly thing, but something to do in case of > a disaster? Is there a difference in the performance of an index that > was built all at once vs. one that has had delta inserts and updates > applied over a period of months? > > Thank you for your insight. > > > On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter > wrote: >> >> : For a new project, I need to index about 20M records (30 fields) and I >> : have been running into issues with MySQL disconnects, right around >> : 15M. I've tried several remedies I've found on blogs, changing >> >> if you can provide some concrete error/log messages and the details of how >> you are configuring your datasource that might help folks provide better >> suggestions -- youv'e said you run into a problem but you havne't provided >> any details for people to go on in giving you feedback. >> >> : resolved the issue. It got me wondering: Is this the way everyone does >> : it? What about 100M records up to 1B; are those all pulled using DIH >> : and a single query? >> >> I've only recently started using DIH, and while it definitely has a lot >> of quirks/anoyances, it seems like a pretty good 80/20 solution for >> indexing with Solr -- but that doens't mean it's perfect for all >> situations. >> >> Writing custom indexer code can certianly make sense in a lot of cases -- >> particularly where you already have a data pblishing system that you wnat >> to tie into directly -- the trick is to ensure you have a decent strategy >> for rebuilding the entire index should the need arrise (but this is relaly >> only an issue if your primary indexing solution is incremental -- many use >> cases can be satisifed just fine with a brute force "full rebuild >> periodically" impelmentation. >> >> >> -Hoss >> >
Re: Indexing 20M documents from MySQL with DIH
Can you post the dataconfig.XML? Probably you didn't use batch size Sent from my iPhone On Apr 21, 2011, at 5:09 PM, Scott Bigelow wrote: > Thanks for the e-mail. I probably should have provided more details, > but I was more interested in making sure I was approaching the problem > correctly (using DIH, with one big SELECT statement for millions of > rows) instead of solving this specific problem. Here's a partial > stacktrace from this specific problem: > > ... > Caused by: java.io.EOFException: Can not read response from server. > Expected to read 4 bytes, read 0 bytes before connection was > unexpectedly lost. >at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) >at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) >... 22 more > Apr 21, 2011 3:53:28 AM > org.apache.solr.handler.dataimport.EntityProcessorBase getNext > SEVERE: getNext() failed for query 'REDACTED' > org.apache.solr.handler.dataimport.DataImportHandlerException: > com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: > Communications link failure > > The last packet successfully received from the server was 128 > milliseconds ago. The last packet sent successfully to the server was > 25,273,484 milliseconds ago. > ... > > > A custom indexer, so that's a fairly common practice? So when you are > dealing with these large indexes, do you try not to fully rebuild them > when you can? It's not a nightly thing, but something to do in case of > a disaster? Is there a difference in the performance of an index that > was built all at once vs. one that has had delta inserts and updates > applied over a period of months? > > Thank you for your insight. > > > On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter > wrote: >> >> : For a new project, I need to index about 20M records (30 fields) and I >> : have been running into issues with MySQL disconnects, right around >> : 15M. I've tried several remedies I've found on blogs, changing >> >> if you can provide some concrete error/log messages and the details of how >> you are configuring your datasource that might help folks provide better >> suggestions -- youv'e said you run into a problem but you havne't provided >> any details for people to go on in giving you feedback. >> >> : resolved the issue. It got me wondering: Is this the way everyone does >> : it? What about 100M records up to 1B; are those all pulled using DIH >> : and a single query? >> >> I've only recently started using DIH, and while it definitely has a lot >> of quirks/anoyances, it seems like a pretty good 80/20 solution for >> indexing with Solr -- but that doens't mean it's perfect for all >> situations. >> >> Writing custom indexer code can certianly make sense in a lot of cases -- >> particularly where you already have a data pblishing system that you wnat >> to tie into directly -- the trick is to ensure you have a decent strategy >> for rebuilding the entire index should the need arrise (but this is relaly >> only an issue if your primary indexing solution is incremental -- many use >> cases can be satisifed just fine with a brute force "full rebuild >> periodically" impelmentation. >> >> >> -Hoss >>
Re: Indexing 20M documents from MySQL with DIH
Thanks for the e-mail. I probably should have provided more details, but I was more interested in making sure I was approaching the problem correctly (using DIH, with one big SELECT statement for millions of rows) instead of solving this specific problem. Here's a partial stacktrace from this specific problem: ... Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost. at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989) ... 22 more Apr 21, 2011 3:53:28 AM org.apache.solr.handler.dataimport.EntityProcessorBase getNext SEVERE: getNext() failed for query 'REDACTED' org.apache.solr.handler.dataimport.DataImportHandlerException: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet successfully received from the server was 128 milliseconds ago. The last packet sent successfully to the server was 25,273,484 milliseconds ago. ... A custom indexer, so that's a fairly common practice? So when you are dealing with these large indexes, do you try not to fully rebuild them when you can? It's not a nightly thing, but something to do in case of a disaster? Is there a difference in the performance of an index that was built all at once vs. one that has had delta inserts and updates applied over a period of months? Thank you for your insight. On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter wrote: > > : For a new project, I need to index about 20M records (30 fields) and I > : have been running into issues with MySQL disconnects, right around > : 15M. I've tried several remedies I've found on blogs, changing > > if you can provide some concrete error/log messages and the details of how > you are configuring your datasource that might help folks provide better > suggestions -- youv'e said you run into a problem but you havne't provided > any details for people to go on in giving you feedback. > > : resolved the issue. It got me wondering: Is this the way everyone does > : it? What about 100M records up to 1B; are those all pulled using DIH > : and a single query? > > I've only recently started using DIH, and while it definitely has a lot > of quirks/anoyances, it seems like a pretty good 80/20 solution for > indexing with Solr -- but that doens't mean it's perfect for all > situations. > > Writing custom indexer code can certianly make sense in a lot of cases -- > particularly where you already have a data pblishing system that you wnat > to tie into directly -- the trick is to ensure you have a decent strategy > for rebuilding the entire index should the need arrise (but this is relaly > only an issue if your primary indexing solution is incremental -- many use > cases can be satisifed just fine with a brute force "full rebuild > periodically" impelmentation. > > > -Hoss >
Re: Indexing 20M documents from MySQL with DIH
: For a new project, I need to index about 20M records (30 fields) and I : have been running into issues with MySQL disconnects, right around : 15M. I've tried several remedies I've found on blogs, changing if you can provide some concrete error/log messages and the details of how you are configuring your datasource that might help folks provide better suggestions -- youv'e said you run into a problem but you havne't provided any details for people to go on in giving you feedback. : resolved the issue. It got me wondering: Is this the way everyone does : it? What about 100M records up to 1B; are those all pulled using DIH : and a single query? I've only recently started using DIH, and while it definitely has a lot of quirks/anoyances, it seems like a pretty good 80/20 solution for indexing with Solr -- but that doens't mean it's perfect for all situations. Writing custom indexer code can certianly make sense in a lot of cases -- particularly where you already have a data pblishing system that you wnat to tie into directly -- the trick is to ensure you have a decent strategy for rebuilding the entire index should the need arrise (but this is relaly only an issue if your primary indexing solution is incremental -- many use cases can be satisifed just fine with a brute force "full rebuild periodically" impelmentation. -Hoss
Re: Indexing 20M documents from MySQL with DIH
Thanks for your response! I think the issue is that the records are being returned TOO fast from MySQL. I can dump them to CSV in about 30 minutes, but building the solr index takes hours on the system I'm using. I may just need to use a more powerful Solr instance so it doesn't leave MySQL hanging for too long? What about autoCommit, does that factor in to your import strategy? 2011/4/21 Robert Gründler : > we're indexing around 10M records from a mysql database into > a single solr core. > > The DataImportHandler needs to join 3 sub-entities to denormalize > the data. > > We've run into some troubles for the first 2 attempts, but setting > batchSize="-1" for the dataSource resolved the issues. > > Do you need a lot of complex joins to import the data from mysql? > > > > -robert > > > > > On 4/21/11 8:08 PM, Scott Bigelow wrote: >> >> I've been using Solr for a while now, indexing 2-4 million records >> using the DIH to pull data from MySQL, which has been working great. >> For a new project, I need to index about 20M records (30 fields) and I >> have been running into issues with MySQL disconnects, right around >> 15M. I've tried several remedies I've found on blogs, changing >> autoCommit, batchSize etc., and none of them have seem to majorly >> resolved the issue. It got me wondering: Is this the way everyone does >> it? What about 100M records up to 1B; are those all pulled using DIH >> and a single query? >> >> I've used sphinx in the past, which uses multiple queries to pull out >> a subset of records ranged based on PrimaryKey, does Solr offer >> functionality similar to this? It seems that once a Solr index gets to >> a certain size, the indexing of a batch takes longer than MySQL's >> net_write_timeout, so it kills the connection. >> >> Thanks for your help, I really enjoy using Solr and I look forward to >> indexing even more data! > >
Re: Indexing 20M documents from MySQL with DIH
we're indexing around 10M records from a mysql database into a single solr core. The DataImportHandler needs to join 3 sub-entities to denormalize the data. We've run into some troubles for the first 2 attempts, but setting batchSize="-1" for the dataSource resolved the issues. Do you need a lot of complex joins to import the data from mysql? -robert On 4/21/11 8:08 PM, Scott Bigelow wrote: I've been using Solr for a while now, indexing 2-4 million records using the DIH to pull data from MySQL, which has been working great. For a new project, I need to index about 20M records (30 fields) and I have been running into issues with MySQL disconnects, right around 15M. I've tried several remedies I've found on blogs, changing autoCommit, batchSize etc., and none of them have seem to majorly resolved the issue. It got me wondering: Is this the way everyone does it? What about 100M records up to 1B; are those all pulled using DIH and a single query? I've used sphinx in the past, which uses multiple queries to pull out a subset of records ranged based on PrimaryKey, does Solr offer functionality similar to this? It seems that once a Solr index gets to a certain size, the indexing of a batch takes longer than MySQL's net_write_timeout, so it kills the connection. Thanks for your help, I really enjoy using Solr and I look forward to indexing even more data!
Indexing 20M documents from MySQL with DIH
I've been using Solr for a while now, indexing 2-4 million records using the DIH to pull data from MySQL, which has been working great. For a new project, I need to index about 20M records (30 fields) and I have been running into issues with MySQL disconnects, right around 15M. I've tried several remedies I've found on blogs, changing autoCommit, batchSize etc., and none of them have seem to majorly resolved the issue. It got me wondering: Is this the way everyone does it? What about 100M records up to 1B; are those all pulled using DIH and a single query? I've used sphinx in the past, which uses multiple queries to pull out a subset of records ranged based on PrimaryKey, does Solr offer functionality similar to this? It seems that once a Solr index gets to a certain size, the indexing of a batch takes longer than MySQL's net_write_timeout, so it kills the connection. Thanks for your help, I really enjoy using Solr and I look forward to indexing even more data!