Re: Dataimport performance
On 6/7/2018 12:19 AM, kotekaman wrote: sorry. may i know how to code it? Code *what*? Here's the same wiki page that I gave you for your last message: https://wiki.apache.org/solr/UsingMailingLists Even if I go to the Nabble website and discover that you've replied to a topic that's SEVEN AND A HALF YEARS OLD, that information doesn't help me understand exactly what it is you want to know. The previous information in the topic is a question and answer about what kind of performance can be expected from the dataimport handler. There's nothing about coding in it. Thanks, Shawn
Re: Dataimport performance
sorry. may i know how to code it? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Dataimport performance
On 19.12.2010, at 23:30, Alexey Serba wrote: > > Also Ephraim proposed a really neat solution with GROUP_CONCAT, but > I'm not sure that all RDBMS-es support that. Thats MySQL only syntax. But if you google you can find similar solution for other RDBMS. regards, Lukas Kahwe Smith m...@pooteeweet.org
Re: Dataimport performance
> With subquery and with left join: 320k in 6 Min 30 It's 820 records per second. It's _really_ impressive considering the fact that DIH performs separate sql query for every record in your case. >> So there's one track entity with an artist sub-entity. My (admittedly >> rather limited) experience has been that sub-entities, where you have >> to run a separate query for every row in the parent entity, really >> slow down data import. Sub entities slows down data import indeed. You can try to avoid separate query for every row by using CachedSqlEntityProcessor. There are couple of options - 1) you can load all sub-entity data in memory or 2) you can reduce the number of sql queries by caching sub entity data per id. There's no silver bullet and each option has its own pros and cons. Also Ephraim proposed a really neat solution with GROUP_CONCAT, but I'm not sure that all RDBMS-es support that. 2010/12/15 Robert Gründler : > i've benchmarked the import already with 500k records, one time without the > artists subquery, and one time without the join in the main query: > > > Without subquery: 500k in 3 min 30 sec > > Without join and without subquery: 500k in 2 min 30. > > With subquery and with left join: 320k in 6 Min 30 > > > so the joins / subqueries are definitely a bottleneck. > > How exactly did you implement the custom data import? > > In our case, we need to de-normalize the relations of the sql data for the > index, > so i fear i can't really get rid of the join / subquery. > > > -robert > > > > > > On Dec 15, 2010, at 15:43 , Tim Heckman wrote: > >> 2010/12/15 Robert Gründler : >>> The data-config.xml looks like this (only 1 entity): >>> >>> >>> >>> >>> >>> >>> >>> >> name="sf_unique_id"/> >>> >>> >>> >>> >>> >>> >> >> So there's one track entity with an artist sub-entity. My (admittedly >> rather limited) experience has been that sub-entities, where you have >> to run a separate query for every row in the parent entity, really >> slow down data import. For my own purposes, I wrote a custom data >> import using SolrJ to improve the performance (from 3 hours to 10 >> minutes). >> >> Just as a test, how long does it take if you comment out the artists entity? > >
Re: Dataimport performance
Hi, LuSqlv2 beta comes out in the next few weeks, and is designed to address this issue (among others). LuSql original (http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql now moved to: https://code.google.com/p/lusql/) is a JDBC-->Lucene high performance loader. You may have seen my posts on this list suggesting LuSql as high performance alternative to DIH, for a subset of use cases. LuSqlV2 has evolved into a full extract-transform-load (ETL) high performance engine, focusing on many of the issues of interest to the Lucene/SOLR community. It has a pipelined, pluggable, multithreaded architecture. It is basically: pluggable source --> 0 or more pluggable filters --> pluggable sink Source plugins implemented: - JDBC, Lucene, SOLR (SolrJ), BDB, CSV, RMI, Java Serialization Sink plugins implemented: - JDBC, Lucene, SOLR (SolrJ), BDB, XML, RMI, Java Serialization, Tee, NullSink [I am working on a memcached Sink] A number of different filters implemented (i.e. get PDF file from filesystem based on SQL field and convert & get test, etc) including: BDBJoinFIlter, JDBCJoinFilter -- This particular problem is one of the unit tests I have: given a simple database of: 1- table Name 2- table City 3- table nameCityJoin 4- table Job 5- table nameJobJoin run a JDBC-->BDB LuSql instance each for of City+nameCityJoin and Job+nameJobJoin; then run a JDBC-->SolrJ on table Name, adding 2 BDBJoinFIlters, each which take the BDB generated earlier and do the join (you just tell the filters which field from the JDBC-generated to use against the BDB key). So your use case use a larger example of this. Also of interest: - Java RMI (Remote Method Invocation): both an RMISink(Server) and RMISource(Client) are implemented. This means you can set up N machines which are doing something, and have one or more clients (on their own machines) that are pulling this data and doing something with it. For example, JDBC-->PDFToTextFilter-->RMI (converting PDF files to text based on the contents of a SQL database, with text files in the file system): basically doing some heavy lifting, and then start up an RMI-->SolrJ (or Lucene) which is a client to the N PDF converting machines, doing only the Lucene/SOLR indexing. The client does a pull when it needs more data. You can have N servers x M clients! Oh, string fields length > 1024 are automatically gzipped by the RMI Sink(Server), to reduce network (at the cost of cpu: selectable). I am looking into RMI alternatives, like Thrift, ProtoBuf for my next Source/Sinks to implement. Another example is the reverse use case: when the indexing is more expensive getting the data. Example: One JDBC-->RMISink(Server) instance, N RMISource(Client)-->Lucene instances; this allows multiple Lucenes to be fed from a single JDBC source, across machines. - TeeSink: the Tee sink hides N sinks, so you can split the pipeline into multiple Sinks. I've used it to send the same content to Lucene as well as BDB in one fell swoop. Can you say index and content store in one step? I am working on cleaning up the code, writing docs (I made the mistake of making great docs for LusqlV1, so I have work to do...!), and making a couple more tests. I will announce the beta on this and the Lucene list. If you have any questions, please contact me. Thanks, Glen Newton http://zzzoot.blogspot.com --> Old LuSql benchmarks: http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html On Thu, Dec 16, 2010 at 12:04 PM, Dyer, James wrote: > We have ~50 long-running SQL queries that need to be joined and denormalized. > Not all of the queries are to the same db, and some data comes from > fixed-width data feeds. Our current search engine (that we are converting to > SOLR) has a fast disk-caching mechanism that lets you cache all of these data > sources and then it will join them locally prior to indexing. > > I'm in the process of developing something similar for DIH that uses the > Berkley db to do the same thing. Its good enough that I can do nightly full > re-indexes of all our data while developing the front-end, but it is still > very rough. Possibly I would like to get this refined enough to eventually > submit as a jira ticket / patch as it seems this is a somewhat common problem > that needs solving. > > Even with our current search engine, the join & denormalize step is always > the longest-running part of the process. However, I have it running fairly > fast by partitioning the data by a modulus of the primary key and then > running several jobs in parallel. The trick is not to get I/O bound. Things > run fast if you can set it up to maximize CPU. > > James Dyer > E-Commerce Systems > Ingram Content Group > (615) 213-4311 > > > -Original Message- > From: Ephraim Ofir [mailto:ephra...@icq.com] > Sent
RE: Dataimport performance
We have ~50 long-running SQL queries that need to be joined and denormalized. Not all of the queries are to the same db, and some data comes from fixed-width data feeds. Our current search engine (that we are converting to SOLR) has a fast disk-caching mechanism that lets you cache all of these data sources and then it will join them locally prior to indexing. I'm in the process of developing something similar for DIH that uses the Berkley db to do the same thing. Its good enough that I can do nightly full re-indexes of all our data while developing the front-end, but it is still very rough. Possibly I would like to get this refined enough to eventually submit as a jira ticket / patch as it seems this is a somewhat common problem that needs solving. Even with our current search engine, the join & denormalize step is always the longest-running part of the process. However, I have it running fairly fast by partitioning the data by a modulus of the primary key and then running several jobs in parallel. The trick is not to get I/O bound. Things run fast if you can set it up to maximize CPU. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Ephraim Ofir [mailto:ephra...@icq.com] Sent: Thursday, December 16, 2010 3:04 AM To: solr-user@lucene.apache.org Subject: RE: Dataimport performance Check out http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com%3e This approach of not using sub entities really improved our load time. Ephraim Ofir -Original Message- From: Robert Gründler [mailto:rob...@dubture.com] Sent: Wednesday, December 15, 2010 4:49 PM To: solr-user@lucene.apache.org Subject: Re: Dataimport performance i've benchmarked the import already with 500k records, one time without the artists subquery, and one time without the join in the main query: Without subquery: 500k in 3 min 30 sec Without join and without subquery: 500k in 2 min 30. With subquery and with left join: 320k in 6 Min 30 so the joins / subqueries are definitely a bottleneck. How exactly did you implement the custom data import? In our case, we need to de-normalize the relations of the sql data for the index, so i fear i can't really get rid of the join / subquery. -robert On Dec 15, 2010, at 15:43 , Tim Heckman wrote: > 2010/12/15 Robert Gründler : >> The data-config.xml looks like this (only 1 entity): >> >> >> >> >> >> >> >>> name="sf_unique_id"/> >> >> >> >> >> >> > > So there's one track entity with an artist sub-entity. My (admittedly > rather limited) experience has been that sub-entities, where you have > to run a separate query for every row in the parent entity, really > slow down data import. For my own purposes, I wrote a custom data > import using SolrJ to improve the performance (from 3 hours to 10 > minutes). > > Just as a test, how long does it take if you comment out the artists entity?
RE: Dataimport performance
Check out http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com%3e This approach of not using sub entities really improved our load time. Ephraim Ofir -Original Message- From: Robert Gründler [mailto:rob...@dubture.com] Sent: Wednesday, December 15, 2010 4:49 PM To: solr-user@lucene.apache.org Subject: Re: Dataimport performance i've benchmarked the import already with 500k records, one time without the artists subquery, and one time without the join in the main query: Without subquery: 500k in 3 min 30 sec Without join and without subquery: 500k in 2 min 30. With subquery and with left join: 320k in 6 Min 30 so the joins / subqueries are definitely a bottleneck. How exactly did you implement the custom data import? In our case, we need to de-normalize the relations of the sql data for the index, so i fear i can't really get rid of the join / subquery. -robert On Dec 15, 2010, at 15:43 , Tim Heckman wrote: > 2010/12/15 Robert Gründler : >> The data-config.xml looks like this (only 1 entity): >> >> >> >> >> >> >> >>> name="sf_unique_id"/> >> >> >> >> >> >> > > So there's one track entity with an artist sub-entity. My (admittedly > rather limited) experience has been that sub-entities, where you have > to run a separate query for every row in the parent entity, really > slow down data import. For my own purposes, I wrote a custom data > import using SolrJ to improve the performance (from 3 hours to 10 > minutes). > > Just as a test, how long does it take if you comment out the artists entity?
Re: Dataimport performance
Can you do just one join in the top-level query? The DIH does not have a batching mechanism for these joins, but your database does. On Wed, Dec 15, 2010 at 7:11 AM, Tim Heckman wrote: > The custom import I wrote is a java application that uses the SolrJ > library. Basically, where I had sub-entities in the DIH config I did > the mappings inside my java code. > > 1. Identify a subset or "chunk" of the primary id's to work on (so I > don't have to load everything into memory at once) and put those in a > temp table. I used a modulus on the id. > 2. Select all of the outer entity from the database (joining on the > id's in the temp table), and load the data from that result set into > new solr input documents. I keep these in a hash map keyed on the > id's. > 3. Then select all of the inner entity, joining on the id's from the > temp table. The result set has to include the id's from step 2. I go > through this result set and load the data into the matching solr input > documents from step 2. > 4. Push that set of input documents to solr (optionally committing > them), then go back to step 1 using the next subset or chunk. > > Not sure if this is the absolute best approach, but it's working well > enough for my specific case. > > Tim > > > 2010/12/15 Robert Gründler : >> i've benchmarked the import already with 500k records, one time without the >> artists subquery, and one time without the join in the main query: >> >> >> Without subquery: 500k in 3 min 30 sec >> >> Without join and without subquery: 500k in 2 min 30. >> >> With subquery and with left join: 320k in 6 Min 30 >> >> >> so the joins / subqueries are definitely a bottleneck. >> >> How exactly did you implement the custom data import? >> >> In our case, we need to de-normalize the relations of the sql data for the >> index, >> so i fear i can't really get rid of the join / subquery. >> >> >> -robert >> >> >> >> >> >> On Dec 15, 2010, at 15:43 , Tim Heckman wrote: >> >>> 2010/12/15 Robert Gründler : The data-config.xml looks like this (only 1 entity): >>> name="sf_unique_id"/> >>> >>> So there's one track entity with an artist sub-entity. My (admittedly >>> rather limited) experience has been that sub-entities, where you have >>> to run a separate query for every row in the parent entity, really >>> slow down data import. For my own purposes, I wrote a custom data >>> import using SolrJ to improve the performance (from 3 hours to 10 >>> minutes). >>> >>> Just as a test, how long does it take if you comment out the artists entity? >> >> > -- Lance Norskog goks...@gmail.com
Re: Dataimport performance
The custom import I wrote is a java application that uses the SolrJ library. Basically, where I had sub-entities in the DIH config I did the mappings inside my java code. 1. Identify a subset or "chunk" of the primary id's to work on (so I don't have to load everything into memory at once) and put those in a temp table. I used a modulus on the id. 2. Select all of the outer entity from the database (joining on the id's in the temp table), and load the data from that result set into new solr input documents. I keep these in a hash map keyed on the id's. 3. Then select all of the inner entity, joining on the id's from the temp table. The result set has to include the id's from step 2. I go through this result set and load the data into the matching solr input documents from step 2. 4. Push that set of input documents to solr (optionally committing them), then go back to step 1 using the next subset or chunk. Not sure if this is the absolute best approach, but it's working well enough for my specific case. Tim 2010/12/15 Robert Gründler : > i've benchmarked the import already with 500k records, one time without the > artists subquery, and one time without the join in the main query: > > > Without subquery: 500k in 3 min 30 sec > > Without join and without subquery: 500k in 2 min 30. > > With subquery and with left join: 320k in 6 Min 30 > > > so the joins / subqueries are definitely a bottleneck. > > How exactly did you implement the custom data import? > > In our case, we need to de-normalize the relations of the sql data for the > index, > so i fear i can't really get rid of the join / subquery. > > > -robert > > > > > > On Dec 15, 2010, at 15:43 , Tim Heckman wrote: > >> 2010/12/15 Robert Gründler : >>> The data-config.xml looks like this (only 1 entity): >>> >>> >>> >>> >>> >>> >>> >>> >> name="sf_unique_id"/> >>> >>> >>> >>> >>> >>> >> >> So there's one track entity with an artist sub-entity. My (admittedly >> rather limited) experience has been that sub-entities, where you have >> to run a separate query for every row in the parent entity, really >> slow down data import. For my own purposes, I wrote a custom data >> import using SolrJ to improve the performance (from 3 hours to 10 >> minutes). >> >> Just as a test, how long does it take if you comment out the artists entity? > >
Re: Dataimport performance
i've benchmarked the import already with 500k records, one time without the artists subquery, and one time without the join in the main query: Without subquery: 500k in 3 min 30 sec Without join and without subquery: 500k in 2 min 30. With subquery and with left join: 320k in 6 Min 30 so the joins / subqueries are definitely a bottleneck. How exactly did you implement the custom data import? In our case, we need to de-normalize the relations of the sql data for the index, so i fear i can't really get rid of the join / subquery. -robert On Dec 15, 2010, at 15:43 , Tim Heckman wrote: > 2010/12/15 Robert Gründler : >> The data-config.xml looks like this (only 1 entity): >> >> >> >> >> >> >> >>> name="sf_unique_id"/> >> >> >> >> >> >> > > So there's one track entity with an artist sub-entity. My (admittedly > rather limited) experience has been that sub-entities, where you have > to run a separate query for every row in the parent entity, really > slow down data import. For my own purposes, I wrote a custom data > import using SolrJ to improve the performance (from 3 hours to 10 > minutes). > > Just as a test, how long does it take if you comment out the artists entity?
Re: Dataimport performance
2010/12/15 Robert Gründler : > The data-config.xml looks like this (only 1 entity): > > > > > > > > name="sf_unique_id"/> > > > > > > So there's one track entity with an artist sub-entity. My (admittedly rather limited) experience has been that sub-entities, where you have to run a separate query for every row in the parent entity, really slow down data import. For my own purposes, I wrote a custom data import using SolrJ to improve the performance (from 3 hours to 10 minutes). Just as a test, how long does it take if you comment out the artists entity?
Re: Dataimport performance
We are currently running Solr 4.x from trunk. -d64 -Xms10240M -Xmx10240M Total Rows Fetched: 24935988 Total Documents Skipped: 0 Total Documents Processed: 24568997 Time Taken: 5:55:19.104 24.5 Million Docs as XML from filesystem with less than 6 hours. May be your MySQL is the bottleneck? Regards Bernd Am 15.12.2010 14:40, schrieb Robert Gründler: > Hi, > > we're looking for some comparison-benchmarks for importing large tables from > a mysql database (full import). > > Currently, a full-import of ~ 8 Million rows from a MySQL database takes > around 3 hours, on a QuadCore Machine with 16 GB of > ram and a Raid 10 storage setup. Solr is running on a apache tomcat instance, > where it is the only app. The tomcat instance > has the following memory-related java_opts: > > -Xms4096M -Xmx5120M > > > The data-config.xml looks like this (only 1 entity): > > > > > > > > name="sf_unique_id"/> > > > > > > > > > We have the feeling that 3 hours for this import is quite long - regarding > the performance of the server running solr/mysql. > > Are we wrong with that assumption, or do people experience similar import > times with this amount of data to be imported? > > > thanks! > > > -robert > > > -- * Bernd FehlingUniversitätsbibliothek Bielefeld Dipl.-Inform. (FH)Universitätsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 bernd.fehl...@uni-bielefeld.de33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *
Re: Dataimport performance
> What version of Solr are you using? Solr Specification Version: 1.4.1 Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42 Lucene Specification Version: 2.9.3 Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55 -robert > > Adam > > 2010/12/15 Robert Gründler > >> Hi, >> >> we're looking for some comparison-benchmarks for importing large tables >> from a mysql database (full import). >> >> Currently, a full-import of ~ 8 Million rows from a MySQL database takes >> around 3 hours, on a QuadCore Machine with 16 GB of >> ram and a Raid 10 storage setup. Solr is running on a apache tomcat >> instance, where it is the only app. The tomcat instance >> has the following memory-related java_opts: >> >> -Xms4096M -Xmx5120M >> >> >> The data-config.xml looks like this (only 1 entity): >> >> >> >> >> >> >> >> > name="sf_unique_id"/> >> >> >> >> >> >> >> >> >> We have the feeling that 3 hours for this import is quite long - regarding >> the performance of the server running solr/mysql. >> >> Are we wrong with that assumption, or do people experience similar import >> times with this amount of data to be imported? >> >> >> thanks! >> >> >> -robert >> >> >> >>
Re: Dataimport performance
You're adding on the order of 750 rows (docs)/second, which isn't bad... have you profiled the machine as this runs? Even just with top (assuming unix)... because the very first question is always "what takes the time, getting the data from MySQL or indexing or I/O?". If you aren't maxing out your CPU, then you probably want to explore the other questions (db query speed, network latency) to get a sense whether you're going as fast as you can or not... Best Erick 2010/12/15 Robert Gründler > Hi, > > we're looking for some comparison-benchmarks for importing large tables > from a mysql database (full import). > > Currently, a full-import of ~ 8 Million rows from a MySQL database takes > around 3 hours, on a QuadCore Machine with 16 GB of > ram and a Raid 10 storage setup. Solr is running on a apache tomcat > instance, where it is the only app. The tomcat instance > has the following memory-related java_opts: > > -Xms4096M -Xmx5120M > > > The data-config.xml looks like this (only 1 entity): > > > > > > > > name="sf_unique_id"/> > > > > > > > > > We have the feeling that 3 hours for this import is quite long - regarding > the performance of the server running solr/mysql. > > Are we wrong with that assumption, or do people experience similar import > times with this amount of data to be imported? > > > thanks! > > > -robert > > > >
Re: Dataimport performance
What version of Solr are you using? Adam 2010/12/15 Robert Gründler > Hi, > > we're looking for some comparison-benchmarks for importing large tables > from a mysql database (full import). > > Currently, a full-import of ~ 8 Million rows from a MySQL database takes > around 3 hours, on a QuadCore Machine with 16 GB of > ram and a Raid 10 storage setup. Solr is running on a apache tomcat > instance, where it is the only app. The tomcat instance > has the following memory-related java_opts: > > -Xms4096M -Xmx5120M > > > The data-config.xml looks like this (only 1 entity): > > > > > > > > name="sf_unique_id"/> > > > > > > > > > We have the feeling that 3 hours for this import is quite long - regarding > the performance of the server running solr/mysql. > > Are we wrong with that assumption, or do people experience similar import > times with this amount of data to be imported? > > > thanks! > > > -robert > > > >