RE: Solr 4.9 Calling DIH concurrently
Thanks James. Your idea worked well( using multiple request handlers). I will try and implement some code when I have some spare cycles. By the way by coding you mean using the same request handler and some how querying it simultaneously. Howz it possible? Thanks meena -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-9-Calling-DIH-concurrently-tp4183744p4184184.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.9 Calling DIH concurrently
Suresh and Meena, I have solved this problem by taking a row count on a query, and adding its modulo as another field called threadid. The base query is wrapped in a query that selects a subset of the results for indexing. The modulo on the row number was intentional - you cannot rely on id columns to be well distributed and you cannot rely on the number of rows to stay constant over time. To make it more concrete, I have a base DataImportHandler configuration that looks something like what's below - your SQL may differ as we use Oracle. entity name=medsite dataSource=oltp01_prod rootEntity=true query=SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM medplus.public_topic_sites_us_v t) WHERE threadid = %%d%% transformer=TemplateTransformer ... /entity To get it to be multi-threaded, I then copy it to 4 different configuration files as follows: echo Medical Sites Configuration - ${MEDSITES_CONF:=medical-sites-conf.xml} echo Medical Sites Prototype - ${MEDSITES_PROTOTYPE:=medical-sites-%%d%%-conf.xml} for tid in `seq 0 3`; do MEDSITES_OUT=`echo $MEDSITES_PROTOTYPE| sed -e s/%%d%%/$tid/` sed -e s/%%d%%/$tid/ $MEDSITES_CONF $MEDSITES_OUT done Then, I have 4 requestHandlers in solrconfig.xml that point to each of these files.They are /import/medical-sites-0 through /import/medical-sites-3. Note that this wouldn't work with a single Data Import Handler that was parameterized - a particular data Import Handler is either idle or busy, and no longer should be run in multiple threads. How this would work if the first entity weren't the root entity is another question - you can usually structure it with the first SQL query being the root entity if you are using SQL. XML is another story, however. I did it this way because I wanted to stay with Solr out-of-the-box because it was an evaluation of what Data Import Handler could do. If I were doing this without some business requirement to evaluate whether Solr out-of-the-box could do multithreaded database improt, I'd probably write a multi-threaded front-end that did the queries and transformations I needed to do. In this case, I was considering the best way to do all our data imports from RDBMS, and Data Import Handler is the only good solution that involves writing configuration, not code. The distinction is slight, I think. Hope this helps, Dan Davis On Wed, Feb 4, 2015 at 3:02 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Suresh, There are a few common workaround for such problem. But, I think that submitting more than maxIndexingThreads is not really productive. Also, I think that out-of-memory problem is caused not by indexing, but by opening searcher. Do you really need to open it? I don't think it's a good idea to search on the instance which cooks many T index at the same time. Are you sure you don't issue superfluous commit, and you've disabled auto-commit? let's nail down oom problem first, and then deal with indexing speedup. I like huge indices! On Wed, Feb 4, 2015 at 1:10 AM, Arumugam, Suresh suresh.arumu...@emc.com wrote: We are also facing the same problem in loading 14 Billion documents into Solr 4.8.10. Dataimport is working in Single threaded, which is taking more than 3 weeks. This is working fine without any issues but it takes months to complete the load. When we tried SolrJ with the below configuration in Multithreaded load, the Solr is taking more memory at one point we will end up in out of memory as well. Batch Doc count : 10 docs No of Threads : 16/32 Solr Memory Allocated : 200 GB The reason can be as below. Solr is taking the snapshot, whenever we open a SearchIndexer. Due to this more memory is getting consumed solr is extremely slow while running 16 or more threads for loading. If anyone have already done the multithreaded data load into Solr in a quicker way, Can you please share the code or logic in using the SolrJ API? Thanks in advance. Regards, Suresh.A -Original Message- From: Dyer, James [mailto:james.d...@ingramcontent.com] Sent: Tuesday, February 03, 2015 1:58 PM To: solr-user@lucene.apache.org Subject: RE: Solr 4.9 Calling DIH concurrently DIH is single-threaded. There was once a threaded option, but it was buggy and subsequently was removed. What I do is partition my data and run multiple dih request handlers at the same time. It means redundant sections in solrconfig.xml and its not very elegant but it works. For instance, for a sql query, I add something like this: where mod(id, ${dataimporter.request.numPartitions})=${dataimporter.request.currentPartition}. I think, though, most users who want to make the most out of multithreading write their own program and use the solrj api to send the updates. James Dyer Ingram Content Group
RE: Solr 4.9 Calling DIH concurrently
Yes, that is what I mean. In my case, for each /dataimport in the defaults section, I also put something like this: str name=currentPartition1/str ...and then reference it in the data-config.xml with ${dataimporter.request.currentPartition} . This way the same data-config.xml can be used for each handler. As I said before, while this works (and this is what I do in production), it seems generally preferable to write code for this use-case. James Dyer Ingram Content Group -Original Message- From: meena.sri...@mathworks.com [mailto:meena.sri...@mathworks.com] Sent: Tuesday, February 03, 2015 4:24 PM To: solr-user@lucene.apache.org Subject: RE: Solr 4.9 Calling DIH concurrently Thanks James. After lots of search and reading now I think I understand a little from your answer. If I understand correctly my solrconfig.xml will have section like this requestHandler name=/dataimport1 class=solr.DataImportHandler lst name=defaults str name=configdb-data-config1.xml/str /lst /requestHandler requestHandler name=/dataimport2 class=solr.DataImportHandler lst name=defaults str name=configdb-data-config1.xml/str /lst /requestHandler . . . . . requestHandler name=/dataimport8 class=solr.DataImportHandler lst name=defaults str name=configdb-data-config1.xml/str /lst /requestHandler Is this correct. If its true then I can call 8 such requests maxIndexingThreads8/maxIndexingThreads and solr will commit data when the ramBufferSizeMB100/ramBufferSizeMB of 100MB is reached per thread. Thanks again for your time. Thanks Meena -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-9-Calling-DIH-concurrently-tp4183744p4183750.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.9 Calling DIH concurrently
Data Import Handler is the only good solution that involves writing configuration, not code. - I also had a requirement not to look at product-oriented enhancements to Solr, and there are many products I didn't look at, or rejected, like django-haystack. Perl, ruby, and python have good handling of both databases and Solr, as does Java with JDBC and SolrJ. Pushing to Solr probably has more legs than Data Import Handler going forward. On Wed, Feb 4, 2015 at 11:13 AM, Dan Davis dansm...@gmail.com wrote: Suresh and Meena, I have solved this problem by taking a row count on a query, and adding its modulo as another field called threadid. The base query is wrapped in a query that selects a subset of the results for indexing. The modulo on the row number was intentional - you cannot rely on id columns to be well distributed and you cannot rely on the number of rows to stay constant over time. To make it more concrete, I have a base DataImportHandler configuration that looks something like what's below - your SQL may differ as we use Oracle. entity name=medsite dataSource=oltp01_prod rootEntity=true query=SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM medplus.public_topic_sites_us_v t) WHERE threadid = %%d%% transformer=TemplateTransformer ... /entity To get it to be multi-threaded, I then copy it to 4 different configuration files as follows: echo Medical Sites Configuration - ${MEDSITES_CONF:=medical-sites-conf.xml} echo Medical Sites Prototype - ${MEDSITES_PROTOTYPE:=medical-sites-%%d%%-conf.xml} for tid in `seq 0 3`; do MEDSITES_OUT=`echo $MEDSITES_PROTOTYPE| sed -e s/%%d%%/$tid/` sed -e s/%%d%%/$tid/ $MEDSITES_CONF $MEDSITES_OUT done Then, I have 4 requestHandlers in solrconfig.xml that point to each of these files.They are /import/medical-sites-0 through /import/medical-sites-3. Note that this wouldn't work with a single Data Import Handler that was parameterized - a particular data Import Handler is either idle or busy, and no longer should be run in multiple threads. How this would work if the first entity weren't the root entity is another question - you can usually structure it with the first SQL query being the root entity if you are using SQL. XML is another story, however. I did it this way because I wanted to stay with Solr out-of-the-box because it was an evaluation of what Data Import Handler could do. If I were doing this without some business requirement to evaluate whether Solr out-of-the-box could do multithreaded database improt, I'd probably write a multi-threaded front-end that did the queries and transformations I needed to do. In this case, I was considering the best way to do all our data imports from RDBMS, and Data Import Handler is the only good solution that involves writing configuration, not code. The distinction is slight, I think. Hope this helps, Dan Davis On Wed, Feb 4, 2015 at 3:02 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Suresh, There are a few common workaround for such problem. But, I think that submitting more than maxIndexingThreads is not really productive. Also, I think that out-of-memory problem is caused not by indexing, but by opening searcher. Do you really need to open it? I don't think it's a good idea to search on the instance which cooks many T index at the same time. Are you sure you don't issue superfluous commit, and you've disabled auto-commit? let's nail down oom problem first, and then deal with indexing speedup. I like huge indices! On Wed, Feb 4, 2015 at 1:10 AM, Arumugam, Suresh suresh.arumu...@emc.com wrote: We are also facing the same problem in loading 14 Billion documents into Solr 4.8.10. Dataimport is working in Single threaded, which is taking more than 3 weeks. This is working fine without any issues but it takes months to complete the load. When we tried SolrJ with the below configuration in Multithreaded load, the Solr is taking more memory at one point we will end up in out of memory as well. Batch Doc count : 10 docs No of Threads : 16/32 Solr Memory Allocated : 200 GB The reason can be as below. Solr is taking the snapshot, whenever we open a SearchIndexer. Due to this more memory is getting consumed solr is extremely slow while running 16 or more threads for loading. If anyone have already done the multithreaded data load into Solr in a quicker way, Can you please share the code or logic in using the SolrJ API? Thanks in advance. Regards, Suresh.A -Original Message- From: Dyer, James [mailto:james.d...@ingramcontent.com] Sent: Tuesday, February 03, 2015 1:58 PM To: solr-user@lucene.apache.org Subject: RE: Solr 4.9 Calling DIH concurrently DIH is single-threaded. There was once a threaded option
Re: Solr 4.9 Calling DIH concurrently
Suresh, There are a few common workaround for such problem. But, I think that submitting more than maxIndexingThreads is not really productive. Also, I think that out-of-memory problem is caused not by indexing, but by opening searcher. Do you really need to open it? I don't think it's a good idea to search on the instance which cooks many T index at the same time. Are you sure you don't issue superfluous commit, and you've disabled auto-commit? let's nail down oom problem first, and then deal with indexing speedup. I like huge indices! On Wed, Feb 4, 2015 at 1:10 AM, Arumugam, Suresh suresh.arumu...@emc.com wrote: We are also facing the same problem in loading 14 Billion documents into Solr 4.8.10. Dataimport is working in Single threaded, which is taking more than 3 weeks. This is working fine without any issues but it takes months to complete the load. When we tried SolrJ with the below configuration in Multithreaded load, the Solr is taking more memory at one point we will end up in out of memory as well. Batch Doc count : 10 docs No of Threads : 16/32 Solr Memory Allocated : 200 GB The reason can be as below. Solr is taking the snapshot, whenever we open a SearchIndexer. Due to this more memory is getting consumed solr is extremely slow while running 16 or more threads for loading. If anyone have already done the multithreaded data load into Solr in a quicker way, Can you please share the code or logic in using the SolrJ API? Thanks in advance. Regards, Suresh.A -Original Message- From: Dyer, James [mailto:james.d...@ingramcontent.com] Sent: Tuesday, February 03, 2015 1:58 PM To: solr-user@lucene.apache.org Subject: RE: Solr 4.9 Calling DIH concurrently DIH is single-threaded. There was once a threaded option, but it was buggy and subsequently was removed. What I do is partition my data and run multiple dih request handlers at the same time. It means redundant sections in solrconfig.xml and its not very elegant but it works. For instance, for a sql query, I add something like this: where mod(id, ${dataimporter.request.numPartitions})=${dataimporter.request.currentPartition}. I think, though, most users who want to make the most out of multithreading write their own program and use the solrj api to send the updates. James Dyer Ingram Content Group -Original Message- From: meena.sri...@mathworks.com [mailto:meena.sri...@mathworks.com] Sent: Tuesday, February 03, 2015 3:43 PM To: solr-user@lucene.apache.org Subject: Solr 4.9 Calling DIH concurrently Hi I am using solr 4.9 and need to index million of documents from database. I am using DIH and sending request to fetch by ids. Is there a way to run multiple indexing threads, concurrently in DIH. I want to take advantage of maxIndexingThreads parameter. How do I do it. I am just invoking DIH handler using solrj HttpSolrServer. And issue requests sequentially. http://localhost:8983/solr/db/dataimport?command=full-importclean=falsemaxId=100minId=1 http://localhost:8983/solr/db/dataimport?command=full-importclean=falsemaxId=201minId=101 -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-9-Calling-DIH-concurrently-tp4183744.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
RE: Solr 4.9 Calling DIH concurrently
DIH is single-threaded. There was once a threaded option, but it was buggy and subsequently was removed. What I do is partition my data and run multiple dih request handlers at the same time. It means redundant sections in solrconfig.xml and its not very elegant but it works. For instance, for a sql query, I add something like this: where mod(id, ${dataimporter.request.numPartitions})=${dataimporter.request.currentPartition}. I think, though, most users who want to make the most out of multithreading write their own program and use the solrj api to send the updates. James Dyer Ingram Content Group -Original Message- From: meena.sri...@mathworks.com [mailto:meena.sri...@mathworks.com] Sent: Tuesday, February 03, 2015 3:43 PM To: solr-user@lucene.apache.org Subject: Solr 4.9 Calling DIH concurrently Hi I am using solr 4.9 and need to index million of documents from database. I am using DIH and sending request to fetch by ids. Is there a way to run multiple indexing threads, concurrently in DIH. I want to take advantage of maxIndexingThreads parameter. How do I do it. I am just invoking DIH handler using solrj HttpSolrServer. And issue requests sequentially. http://localhost:8983/solr/db/dataimport?command=full-importclean=falsemaxId=100minId=1 http://localhost:8983/solr/db/dataimport?command=full-importclean=falsemaxId=201minId=101 -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-9-Calling-DIH-concurrently-tp4183744.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr 4.9 Calling DIH concurrently
We are also facing the same problem in loading 14 Billion documents into Solr 4.8.10. Dataimport is working in Single threaded, which is taking more than 3 weeks. This is working fine without any issues but it takes months to complete the load. When we tried SolrJ with the below configuration in Multithreaded load, the Solr is taking more memory at one point we will end up in out of memory as well. Batch Doc count : 10 docs No of Threads : 16/32 Solr Memory Allocated : 200 GB The reason can be as below. Solr is taking the snapshot, whenever we open a SearchIndexer. Due to this more memory is getting consumed solr is extremely slow while running 16 or more threads for loading. If anyone have already done the multithreaded data load into Solr in a quicker way, Can you please share the code or logic in using the SolrJ API? Thanks in advance. Regards, Suresh.A -Original Message- From: Dyer, James [mailto:james.d...@ingramcontent.com] Sent: Tuesday, February 03, 2015 1:58 PM To: solr-user@lucene.apache.org Subject: RE: Solr 4.9 Calling DIH concurrently DIH is single-threaded. There was once a threaded option, but it was buggy and subsequently was removed. What I do is partition my data and run multiple dih request handlers at the same time. It means redundant sections in solrconfig.xml and its not very elegant but it works. For instance, for a sql query, I add something like this: where mod(id, ${dataimporter.request.numPartitions})=${dataimporter.request.currentPartition}. I think, though, most users who want to make the most out of multithreading write their own program and use the solrj api to send the updates. James Dyer Ingram Content Group -Original Message- From: meena.sri...@mathworks.com [mailto:meena.sri...@mathworks.com] Sent: Tuesday, February 03, 2015 3:43 PM To: solr-user@lucene.apache.org Subject: Solr 4.9 Calling DIH concurrently Hi I am using solr 4.9 and need to index million of documents from database. I am using DIH and sending request to fetch by ids. Is there a way to run multiple indexing threads, concurrently in DIH. I want to take advantage of maxIndexingThreads parameter. How do I do it. I am just invoking DIH handler using solrj HttpSolrServer. And issue requests sequentially. http://localhost:8983/solr/db/dataimport?command=full-importclean=falsemaxId=100minId=1 http://localhost:8983/solr/db/dataimport?command=full-importclean=falsemaxId=201minId=101 -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-9-Calling-DIH-concurrently-tp4183744.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr 4.9 Calling DIH concurrently
Thanks James. After lots of search and reading now I think I understand a little from your answer. If I understand correctly my solrconfig.xml will have section like this requestHandler name=/dataimport1 class=solr.DataImportHandler lst name=defaults str name=configdb-data-config1.xml/str /lst /requestHandler requestHandler name=/dataimport2 class=solr.DataImportHandler lst name=defaults str name=configdb-data-config1.xml/str /lst /requestHandler . . . . . requestHandler name=/dataimport8 class=solr.DataImportHandler lst name=defaults str name=configdb-data-config1.xml/str /lst /requestHandler Is this correct. If its true then I can call 8 such requests maxIndexingThreads8/maxIndexingThreads and solr will commit data when the ramBufferSizeMB100/ramBufferSizeMB of 100MB is reached per thread. Thanks again for your time. Thanks Meena -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-9-Calling-DIH-concurrently-tp4183744p4183750.html Sent from the Solr - User mailing list archive at Nabble.com.