RE: Solr 4.9 Calling DIH concurrently

2015-02-05 Thread meena.sri...@mathworks.com
Thanks James.
Your idea worked well( using multiple request handlers). 
I will try and implement some code when I have some spare cycles. By the way
by coding you mean using the same request handler and some how querying it
simultaneously. Howz it possible? 
Thanks
meena




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-9-Calling-DIH-concurrently-tp4183744p4184184.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.9 Calling DIH concurrently

2015-02-04 Thread Dan Davis
Suresh and Meena,

I have solved this problem by taking a row count on a query, and adding its
modulo as another field called threadid. The base query is wrapped in a
query that selects a subset of the results for indexing.   The modulo on
the row number was intentional - you cannot rely on id columns to be well
distributed and you cannot rely on the number of rows to stay constant over
time.

To make it more concrete, I have a base DataImportHandler configuration
that looks something like what's below - your SQL may differ as we use
Oracle.

 entity name=medsite dataSource=oltp01_prod
rootEntity=true
query=SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM
medplus.public_topic_sites_us_v t) WHERE threadid = %%d%%
transformer=TemplateTransformer
...

 /entity


To get it to be multi-threaded, I then copy it to 4 different configuration
files as follows:

echo Medical Sites Configuration - 
${MEDSITES_CONF:=medical-sites-conf.xml}
echo Medical Sites Prototype - 
${MEDSITES_PROTOTYPE:=medical-sites-%%d%%-conf.xml}
for tid in `seq 0 3`; do
   MEDSITES_OUT=`echo $MEDSITES_PROTOTYPE| sed -e s/%%d%%/$tid/`
   sed -e s/%%d%%/$tid/ $MEDSITES_CONF  $MEDSITES_OUT
done


Then, I have 4 requestHandlers in solrconfig.xml that point to each of
these files.They are /import/medical-sites-0 through
/import/medical-sites-3.   Note that this wouldn't work with a single
Data Import Handler that was parameterized - a particular data Import
Handler is either idle or busy, and no longer should be run in multiple
threads.   How this would work if the first entity weren't the root entity
is another question - you can usually structure it with the first SQL query
being the root entity if you are using SQL.   XML is another story, however.

I did it this way because I wanted to stay with Solr out-of-the-box
because it was an evaluation of what Data Import Handler could do.   If I
were doing this without some business requirement to evaluate whether Solr
out-of-the-box could do multithreaded database improt, I'd probably write
a multi-threaded front-end that did the queries and transformations I
needed to do.   In this case, I was considering the best way to do all
our data imports from RDBMS, and Data Import Handler is the only good
solution that involves writing configuration, not code.   The distinction
is slight, I think.

Hope this helps,

Dan Davis

On Wed, Feb 4, 2015 at 3:02 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 Suresh,

 There are a few common workaround for such problem. But, I think that
 submitting more than maxIndexingThreads is not really productive. Also, I
 think that out-of-memory problem is caused not by indexing, but by opening
 searcher. Do you really need to open it? I don't think it's a good idea to
 search on the instance which cooks many T index at the same time. Are you
 sure you don't issue superfluous commit, and you've disabled auto-commit?

 let's nail down oom problem first, and then deal with indexing speedup. I
 like huge indices!

 On Wed, Feb 4, 2015 at 1:10 AM, Arumugam, Suresh suresh.arumu...@emc.com
 wrote:

  We are also facing the same problem in loading 14 Billion documents into
  Solr 4.8.10.
 
  Dataimport is working in Single threaded, which is taking more than 3
  weeks. This is working fine without any issues but it takes months to
  complete the load.
 
  When we tried SolrJ with the below configuration in Multithreaded load,
  the Solr is taking more memory  at one point we will end up in out of
  memory as well.
 
  Batch Doc count  :  10 docs
  No of Threads  : 16/32
 
  Solr Memory Allocated : 200 GB
 
  The reason can be as below.
 
  Solr is taking the snapshot, whenever we open a SearchIndexer.
  Due to this more memory is getting consumed  solr is extremely
  slow while running 16 or more threads for loading.
 
  If anyone have already done the multithreaded data load into Solr in a
  quicker way, Can you please share the code or logic in using the SolrJ
 API?
 
  Thanks in advance.
 
  Regards,
  Suresh.A
 
  -Original Message-
  From: Dyer, James [mailto:james.d...@ingramcontent.com]
  Sent: Tuesday, February 03, 2015 1:58 PM
  To: solr-user@lucene.apache.org
  Subject: RE: Solr 4.9 Calling DIH concurrently
 
  DIH is single-threaded.  There was once a threaded option, but it was
  buggy and subsequently was removed.
 
  What I do is partition my data and run multiple dih request handlers at
  the same time.  It means redundant sections in solrconfig.xml and its not
  very elegant but it works.
 
  For instance, for a sql query, I add something like this: where mod(id,
 
 ${dataimporter.request.numPartitions})=${dataimporter.request.currentPartition}.
 
  I think, though, most users who want to make the most out of
  multithreading write their own program and use the solrj api to send the
  updates.
 
  James Dyer
  Ingram Content Group

RE: Solr 4.9 Calling DIH concurrently

2015-02-04 Thread Dyer, James
Yes, that is what I mean.  In my case, for each /dataimport in the defaults 
section, I also put something like this:

str name=currentPartition1/str

...and then reference it in the data-config.xml with 
${dataimporter.request.currentPartition} .  This way the same data-config.xml 
can be used for each handler.

As I said before, while this works (and this is what I do in production), it 
seems generally preferable to write code for this use-case.

James Dyer
Ingram Content Group


-Original Message-
From: meena.sri...@mathworks.com [mailto:meena.sri...@mathworks.com] 
Sent: Tuesday, February 03, 2015 4:24 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr 4.9 Calling DIH concurrently

Thanks James. After lots of search and reading now I think I understand a
little from your answer.
If I understand correctly my solrconfig.xml will have section like this

requestHandler name=/dataimport1 class=solr.DataImportHandler
lst name=defaults
  str name=configdb-data-config1.xml/str
/lst
  /requestHandler

requestHandler name=/dataimport2 class=solr.DataImportHandler
lst name=defaults
  str name=configdb-data-config1.xml/str
/lst
  /requestHandler

.
.
.
.
.
requestHandler name=/dataimport8 class=solr.DataImportHandler
lst name=defaults
  str name=configdb-data-config1.xml/str
/lst
  /requestHandler


Is this correct. If its true then I can call 8 such requests 
maxIndexingThreads8/maxIndexingThreads

and solr will commit data when the 
ramBufferSizeMB100/ramBufferSizeMB

of 100MB is reached per thread.

Thanks again for your time.

Thanks
Meena






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-9-Calling-DIH-concurrently-tp4183744p4183750.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Solr 4.9 Calling DIH concurrently

2015-02-04 Thread Dan Davis
Data Import Handler is the only good solution that involves writing
configuration, not code.  - I also had a requirement not to look at
product-oriented enhancements to Solr, and there are many products I didn't
look at, or rejected, like django-haystack.   Perl, ruby, and python have
good handling of both databases and Solr, as does Java with JDBC and SolrJ.
  Pushing to Solr probably has more legs than Data Import Handler going
forward.

On Wed, Feb 4, 2015 at 11:13 AM, Dan Davis dansm...@gmail.com wrote:

 Suresh and Meena,

 I have solved this problem by taking a row count on a query, and adding
 its modulo as another field called threadid. The base query is wrapped
 in a query that selects a subset of the results for indexing.   The modulo
 on the row number was intentional - you cannot rely on id columns to be
 well distributed and you cannot rely on the number of rows to stay constant
 over time.

 To make it more concrete, I have a base DataImportHandler configuration
 that looks something like what's below - your SQL may differ as we use
 Oracle.

  entity name=medsite dataSource=oltp01_prod
 rootEntity=true
 query=SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM
 medplus.public_topic_sites_us_v t) WHERE threadid = %%d%%
 transformer=TemplateTransformer
 ...

  /entity


 To get it to be multi-threaded, I then copy it to 4 different
 configuration files as follows:

 echo Medical Sites Configuration - 
 ${MEDSITES_CONF:=medical-sites-conf.xml}
 echo Medical Sites Prototype - 
 ${MEDSITES_PROTOTYPE:=medical-sites-%%d%%-conf.xml}
 for tid in `seq 0 3`; do
MEDSITES_OUT=`echo $MEDSITES_PROTOTYPE| sed -e s/%%d%%/$tid/`
sed -e s/%%d%%/$tid/ $MEDSITES_CONF  $MEDSITES_OUT
 done


 Then, I have 4 requestHandlers in solrconfig.xml that point to each of
 these files.They are /import/medical-sites-0 through
 /import/medical-sites-3.   Note that this wouldn't work with a single
 Data Import Handler that was parameterized - a particular data Import
 Handler is either idle or busy, and no longer should be run in multiple
 threads.   How this would work if the first entity weren't the root entity
 is another question - you can usually structure it with the first SQL query
 being the root entity if you are using SQL.   XML is another story, however.

 I did it this way because I wanted to stay with Solr out-of-the-box
 because it was an evaluation of what Data Import Handler could do.   If I
 were doing this without some business requirement to evaluate whether Solr
 out-of-the-box could do multithreaded database improt, I'd probably write
 a multi-threaded front-end that did the queries and transformations I
 needed to do.   In this case, I was considering the best way to do all
 our data imports from RDBMS, and Data Import Handler is the only good
 solution that involves writing configuration, not code.   The distinction
 is slight, I think.

 Hope this helps,

 Dan Davis

 On Wed, Feb 4, 2015 at 3:02 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

 Suresh,

 There are a few common workaround for such problem. But, I think that
 submitting more than maxIndexingThreads is not really productive. Also,
 I
 think that out-of-memory problem is caused not by indexing, but by opening
 searcher. Do you really need to open it? I don't think it's a good idea to
 search on the instance which cooks many T index at the same time. Are you
 sure you don't issue superfluous commit, and you've disabled auto-commit?

 let's nail down oom problem first, and then deal with indexing speedup. I
 like huge indices!

 On Wed, Feb 4, 2015 at 1:10 AM, Arumugam, Suresh suresh.arumu...@emc.com
 
 wrote:

  We are also facing the same problem in loading 14 Billion documents into
  Solr 4.8.10.
 
  Dataimport is working in Single threaded, which is taking more than 3
  weeks. This is working fine without any issues but it takes months to
  complete the load.
 
  When we tried SolrJ with the below configuration in Multithreaded load,
  the Solr is taking more memory  at one point we will end up in out of
  memory as well.
 
  Batch Doc count  :  10 docs
  No of Threads  : 16/32
 
  Solr Memory Allocated : 200 GB
 
  The reason can be as below.
 
  Solr is taking the snapshot, whenever we open a SearchIndexer.
  Due to this more memory is getting consumed  solr is extremely
  slow while running 16 or more threads for loading.
 
  If anyone have already done the multithreaded data load into Solr in a
  quicker way, Can you please share the code or logic in using the SolrJ
 API?
 
  Thanks in advance.
 
  Regards,
  Suresh.A
 
  -Original Message-
  From: Dyer, James [mailto:james.d...@ingramcontent.com]
  Sent: Tuesday, February 03, 2015 1:58 PM
  To: solr-user@lucene.apache.org
  Subject: RE: Solr 4.9 Calling DIH concurrently
 
  DIH is single-threaded.  There was once a threaded option

Re: Solr 4.9 Calling DIH concurrently

2015-02-04 Thread Mikhail Khludnev
Suresh,

There are a few common workaround for such problem. But, I think that
submitting more than maxIndexingThreads is not really productive. Also, I
think that out-of-memory problem is caused not by indexing, but by opening
searcher. Do you really need to open it? I don't think it's a good idea to
search on the instance which cooks many T index at the same time. Are you
sure you don't issue superfluous commit, and you've disabled auto-commit?

let's nail down oom problem first, and then deal with indexing speedup. I
like huge indices!

On Wed, Feb 4, 2015 at 1:10 AM, Arumugam, Suresh suresh.arumu...@emc.com
wrote:

 We are also facing the same problem in loading 14 Billion documents into
 Solr 4.8.10.

 Dataimport is working in Single threaded, which is taking more than 3
 weeks. This is working fine without any issues but it takes months to
 complete the load.

 When we tried SolrJ with the below configuration in Multithreaded load,
 the Solr is taking more memory  at one point we will end up in out of
 memory as well.

 Batch Doc count  :  10 docs
 No of Threads  : 16/32

 Solr Memory Allocated : 200 GB

 The reason can be as below.

 Solr is taking the snapshot, whenever we open a SearchIndexer.
 Due to this more memory is getting consumed  solr is extremely
 slow while running 16 or more threads for loading.

 If anyone have already done the multithreaded data load into Solr in a
 quicker way, Can you please share the code or logic in using the SolrJ API?

 Thanks in advance.

 Regards,
 Suresh.A

 -Original Message-
 From: Dyer, James [mailto:james.d...@ingramcontent.com]
 Sent: Tuesday, February 03, 2015 1:58 PM
 To: solr-user@lucene.apache.org
 Subject: RE: Solr 4.9 Calling DIH concurrently

 DIH is single-threaded.  There was once a threaded option, but it was
 buggy and subsequently was removed.

 What I do is partition my data and run multiple dih request handlers at
 the same time.  It means redundant sections in solrconfig.xml and its not
 very elegant but it works.

 For instance, for a sql query, I add something like this: where mod(id,
 ${dataimporter.request.numPartitions})=${dataimporter.request.currentPartition}.

 I think, though, most users who want to make the most out of
 multithreading write their own program and use the solrj api to send the
 updates.

 James Dyer
 Ingram Content Group


 -Original Message-
 From: meena.sri...@mathworks.com [mailto:meena.sri...@mathworks.com]
 Sent: Tuesday, February 03, 2015 3:43 PM
 To: solr-user@lucene.apache.org
 Subject: Solr 4.9 Calling DIH concurrently

 Hi

 I am using solr 4.9 and need to index million of documents from database.
 I am using DIH and sending request to fetch by ids. Is there a way to run
 multiple indexing threads, concurrently in DIH.
 I want to take advantage of
 maxIndexingThreads
 parameter. How do I do it. I am just invoking DIH handler using solrj
 HttpSolrServer.
 And issue requests sequentially.

 http://localhost:8983/solr/db/dataimport?command=full-importclean=falsemaxId=100minId=1


 http://localhost:8983/solr/db/dataimport?command=full-importclean=falsemaxId=201minId=101





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-9-Calling-DIH-concurrently-tp4183744.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


RE: Solr 4.9 Calling DIH concurrently

2015-02-03 Thread Dyer, James
DIH is single-threaded.  There was once a threaded option, but it was buggy and 
subsequently was removed.  

What I do is partition my data and run multiple dih request handlers at the 
same time.  It means redundant sections in solrconfig.xml and its not very 
elegant but it works.

For instance, for a sql query, I add something like this: where mod(id, 
${dataimporter.request.numPartitions})=${dataimporter.request.currentPartition}.

I think, though, most users who want to make the most out of multithreading 
write their own program and use the solrj api to send the updates.

James Dyer
Ingram Content Group


-Original Message-
From: meena.sri...@mathworks.com [mailto:meena.sri...@mathworks.com] 
Sent: Tuesday, February 03, 2015 3:43 PM
To: solr-user@lucene.apache.org
Subject: Solr 4.9 Calling DIH concurrently

Hi 

I am using solr 4.9 and need to index million of documents from database. I
am using DIH and sending request to fetch by ids. Is there a way to run
multiple indexing threads, concurrently in DIH. 
I want to take advantage of 
maxIndexingThreads
parameter. How do I do it. I am just invoking DIH handler using solrj
HttpSolrServer.
And issue requests sequentially.
http://localhost:8983/solr/db/dataimport?command=full-importclean=falsemaxId=100minId=1

http://localhost:8983/solr/db/dataimport?command=full-importclean=falsemaxId=201minId=101





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-9-Calling-DIH-concurrently-tp4183744.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: Solr 4.9 Calling DIH concurrently

2015-02-03 Thread Arumugam, Suresh
We are also facing the same problem in loading 14 Billion documents into Solr 
4.8.10.

Dataimport is working in Single threaded, which is taking more than 3 weeks. 
This is working fine without any issues but it takes months to complete the 
load.

When we tried SolrJ with the below configuration in Multithreaded load, the 
Solr is taking more memory  at one point we will end up in out of memory as 
well.

Batch Doc count  :  10 docs
No of Threads  : 16/32

Solr Memory Allocated : 200 GB  

The reason can be as below.

Solr is taking the snapshot, whenever we open a SearchIndexer. 
Due to this more memory is getting consumed  solr is extremely slow 
while running 16 or more threads for loading.

If anyone have already done the multithreaded data load into Solr in a quicker 
way, Can you please share the code or logic in using the SolrJ API?

Thanks in advance.

Regards,
Suresh.A

-Original Message-
From: Dyer, James [mailto:james.d...@ingramcontent.com] 
Sent: Tuesday, February 03, 2015 1:58 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr 4.9 Calling DIH concurrently

DIH is single-threaded.  There was once a threaded option, but it was buggy and 
subsequently was removed.  

What I do is partition my data and run multiple dih request handlers at the 
same time.  It means redundant sections in solrconfig.xml and its not very 
elegant but it works.

For instance, for a sql query, I add something like this: where mod(id, 
${dataimporter.request.numPartitions})=${dataimporter.request.currentPartition}.

I think, though, most users who want to make the most out of multithreading 
write their own program and use the solrj api to send the updates.

James Dyer
Ingram Content Group


-Original Message-
From: meena.sri...@mathworks.com [mailto:meena.sri...@mathworks.com]
Sent: Tuesday, February 03, 2015 3:43 PM
To: solr-user@lucene.apache.org
Subject: Solr 4.9 Calling DIH concurrently

Hi 

I am using solr 4.9 and need to index million of documents from database. I am 
using DIH and sending request to fetch by ids. Is there a way to run multiple 
indexing threads, concurrently in DIH. 
I want to take advantage of
maxIndexingThreads
parameter. How do I do it. I am just invoking DIH handler using solrj 
HttpSolrServer.
And issue requests sequentially.
http://localhost:8983/solr/db/dataimport?command=full-importclean=falsemaxId=100minId=1

http://localhost:8983/solr/db/dataimport?command=full-importclean=falsemaxId=201minId=101





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-9-Calling-DIH-concurrently-tp4183744.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: Solr 4.9 Calling DIH concurrently

2015-02-03 Thread meena.sri...@mathworks.com
Thanks James. After lots of search and reading now I think I understand a
little from your answer.
If I understand correctly my solrconfig.xml will have section like this

requestHandler name=/dataimport1 class=solr.DataImportHandler
lst name=defaults
  str name=configdb-data-config1.xml/str
/lst
  /requestHandler

requestHandler name=/dataimport2 class=solr.DataImportHandler
lst name=defaults
  str name=configdb-data-config1.xml/str
/lst
  /requestHandler

.
.
.
.
.
requestHandler name=/dataimport8 class=solr.DataImportHandler
lst name=defaults
  str name=configdb-data-config1.xml/str
/lst
  /requestHandler


Is this correct. If its true then I can call 8 such requests 
maxIndexingThreads8/maxIndexingThreads

and solr will commit data when the 
ramBufferSizeMB100/ramBufferSizeMB

of 100MB is reached per thread.

Thanks again for your time.

Thanks
Meena






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-9-Calling-DIH-concurrently-tp4183744p4183750.html
Sent from the Solr - User mailing list archive at Nabble.com.