Re: SOLR indexing takes longer time

2020-08-18 Thread Walter Underwood
Instead of writing code, I’d fire up SQL Workbench/J, load the same JDBC driver
that is being used in Solr, and run the query.

https://www.sql-workbench.eu 

If that takes 3.5 hours, you have isolated the problem.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 18, 2020, at 6:50 AM, David Hastings  
> wrote:
> 
> Another thing to mention is to make sure the indexer you build doesnt send
> commits until its actually done.  Made that mistake with some early in
> house indexers.
> 
> On Tue, Aug 18, 2020 at 9:38 AM Charlie Hull  wrote:
> 
>> 1. You could write some code to pull the items out of Mongo and dump
>> them to disk - if this is still slow, then it's Mongo that's the problem.
>> 2. Write a standalone indexer to replace DIH, it's single threaded and
>> deprecated anyway.
>> 3. Minor point - consider whether you need to index everything every
>> time or just the deltas.
>> 4. Upgrade Solr anyway, not for speed reasons but because that's a very
>> old version you're running.
>> 
>> HTH
>> 
>> Charlie
>> 
>> On 17/08/2020 19:22, Abhijit Pawar wrote:
>>> Hello,
>>> 
>>> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
>>> replicas and just single core.
>>> It takes almost 3.5 hours to index that data.
>>> I am using a data import handler to import data from the mongo database.
>>> 
>>> Is there something we can do to reduce the time taken to index?
>>> Will upgrade to newer version help?
>>> 
>>> Appreciate your help!
>>> 
>>> Regards,
>>> Abhijit
>>> 
>> 
>> --
>> Charlie Hull
>> OpenSource Connections, previously Flax
>> 
>> tel/fax: +44 (0)8700 118334
>> mobile:  +44 (0)7767 825828
>> web: www.o19s.com
>> 
>> 



Re: SOLR indexing takes longer time

2020-08-18 Thread David Hastings
Another thing to mention is to make sure the indexer you build doesnt send
commits until its actually done.  Made that mistake with some early in
house indexers.

On Tue, Aug 18, 2020 at 9:38 AM Charlie Hull  wrote:

> 1. You could write some code to pull the items out of Mongo and dump
> them to disk - if this is still slow, then it's Mongo that's the problem.
> 2. Write a standalone indexer to replace DIH, it's single threaded and
> deprecated anyway.
> 3. Minor point - consider whether you need to index everything every
> time or just the deltas.
> 4. Upgrade Solr anyway, not for speed reasons but because that's a very
> old version you're running.
>
> HTH
>
> Charlie
>
> On 17/08/2020 19:22, Abhijit Pawar wrote:
> > Hello,
> >
> > We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> > replicas and just single core.
> > It takes almost 3.5 hours to index that data.
> > I am using a data import handler to import data from the mongo database.
> >
> > Is there something we can do to reduce the time taken to index?
> > Will upgrade to newer version help?
> >
> > Appreciate your help!
> >
> > Regards,
> > Abhijit
> >
>
> --
> Charlie Hull
> OpenSource Connections, previously Flax
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.o19s.com
>
>


Re: SOLR indexing takes longer time

2020-08-18 Thread Charlie Hull
1. You could write some code to pull the items out of Mongo and dump 
them to disk - if this is still slow, then it's Mongo that's the problem.
2. Write a standalone indexer to replace DIH, it's single threaded and 
deprecated anyway.
3. Minor point - consider whether you need to index everything every 
time or just the deltas.
4. Upgrade Solr anyway, not for speed reasons but because that's a very 
old version you're running.


HTH

Charlie

On 17/08/2020 19:22, Abhijit Pawar wrote:

Hello,

We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
replicas and just single core.
It takes almost 3.5 hours to index that data.
I am using a data import handler to import data from the mongo database.

Is there something we can do to reduce the time taken to index?
Will upgrade to newer version help?

Appreciate your help!

Regards,
Abhijit



--
Charlie Hull
OpenSource Connections, previously Flax

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.o19s.com



Re: SOLR indexing takes longer time

2020-08-17 Thread Aroop Ganguly
Adding on to what others have said, indexing speed in general is largely 
affected by the parallelism and isolation you can give to each node.
Is there a reason why you cannot have more than 1 shard?
If you have 5 node cluster, why not have 5 shards, maxshardspernode=1 replica=1 
is ok. You should see dramatic gains.
Solr’s power and speed in doing everything comes from using it as a distributed 
system. By sharing more you will be using the benefit of that distributed 
capability,

HTH

Regards
Aroop

> On Aug 17, 2020, at 11:22 AM, Abhijit Pawar  wrote:
> 
> Hello,
> 
> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> replicas and just single core.
> It takes almost 3.5 hours to index that data.
> I am using a data import handler to import data from the mongo database.
> 
> Is there something we can do to reduce the time taken to index?
> Will upgrade to newer version help?
> 
> Appreciate your help!
> 
> Regards,
> Abhijit



Re: SOLR indexing takes longer time

2020-08-17 Thread Shawn Heisey

On 8/17/2020 12:22 PM, Abhijit Pawar wrote:

We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
replicas and just single core.
It takes almost 3.5 hours to index that data.
I am using a data import handler to import data from the mongo database.

Is there something we can do to reduce the time taken to index?
Will upgrade to newer version help?


There's not enough information here to provide a diagnosis.

Are you running Solr in cloud mode (with zookeeper)?

3.5 hours for 20 documents sounds like slowness with the data 
source, not a problem with Solr, but it's too soon to rule anything out.


Would you be able to write a program that pulls data from your mongo 
database but doesn't send it to Solr?  Ideally it would be a Java 
program using the same JDBC driver you're using with DIH.


Thanks,
Shawn



Re: SOLR indexing takes longer time

2020-08-17 Thread Walter Underwood
I’m seeing multiple red flags for performance here. The top ones are “DIH”,
“MongoDB”, and “SQL on MongoDB”. MongoDB is not a relational database.

Our multi-threaded extractor using the Mongo API was still three times slower
than the same approach on MySQL.

Check the CPU usage on the Solr hosts while you are indexing. If it is under 
50%, the bottleneck is MongoDB and single-threaded indexing.

For another check, run that same query in a regular database client and time it.
The Solr indexing will never be faster than that.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 17, 2020, at 11:58 AM, Abhijit Pawar  wrote:
> 
> Sure Divye,
> 
> *Here's the config.*
> 
> *conf/solr-config.xml:*
> 
> 
> 
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
> 
>  name="config">/home/ec2-user/solr/solr-5.4.1/server/solr/test_core/conf/dataimport/data-source-config.xml
> 
> 
> 
> 
> 
> *schema.xml:*
> has of all the field definitions
> 
> *conf/dataimport/data-source-config.xml*
> 
> 
>  driver="com.mongodb.jdbc.MongoDriver" url="mongodb://< ADDRESS>>:27017/<>"/>
> 
>  dataSource="mongod"
> transformer="<>,TemplateTransformer"
> onError="continue"
> pk="uuid"
> query="SELECT field1,field2,field3,.. FROM products"
> deltaImportQuery="SELECT field1,field2,field3,.. FROM products WHERE
> orgidStr = '${dataimporter.request.orgid}' AND idStr =
> '${dataimporter.delta.idStr}'"
> deltaQuery="SELECT idStr FROM products WHERE orgidStr =
> '${dataimporter.request.orgid}' AND updatedAt >
> '${dataimporter.last_index_time}'"
>> 
> 
> 
> 
> 
> .
> .
> . 4-5 more nested entities...
> 
> On Mon, Aug 17, 2020 at 1:32 PM Divye Handa 
> wrote:
> 
>> Can you share the dih configuration you are using for same?
>> 
>> On Mon, 17 Aug, 2020, 23:52 Abhijit Pawar,  wrote:
>> 
>>> Hello,
>>> 
>>> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
>>> replicas and just single core.
>>> It takes almost 3.5 hours to index that data.
>>> I am using a data import handler to import data from the mongo database.
>>> 
>>> Is there something we can do to reduce the time taken to index?
>>> Will upgrade to newer version help?
>>> 
>>> Appreciate your help!
>>> 
>>> Regards,
>>> Abhijit
>>> 
>> 



Re: SOLR indexing takes longer time

2020-08-17 Thread Abhijit Pawar
Sure Divye,

*Here's the config.*

*conf/solr-config.xml:*





/home/ec2-user/solr/solr-5.4.1/server/solr/test_core/conf/dataimport/data-source-config.xml





*schema.xml:*
has of all the field definitions

*conf/dataimport/data-source-config.xml*









.
.
. 4-5 more nested entities...

On Mon, Aug 17, 2020 at 1:32 PM Divye Handa 
wrote:

> Can you share the dih configuration you are using for same?
>
> On Mon, 17 Aug, 2020, 23:52 Abhijit Pawar,  wrote:
>
> > Hello,
> >
> > We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> > replicas and just single core.
> > It takes almost 3.5 hours to index that data.
> > I am using a data import handler to import data from the mongo database.
> >
> > Is there something we can do to reduce the time taken to index?
> > Will upgrade to newer version help?
> >
> > Appreciate your help!
> >
> > Regards,
> > Abhijit
> >
>


Re: SOLR indexing takes longer time

2020-08-17 Thread Jörn Franke
The DIH is single threaded and deprecated. Your best bet is to have a 
script/program extracting data from MongoDB and write them to Solr in Batches 
using multiple threads. You will see a significant higher performance for your 
data.

> Am 17.08.2020 um 20:23 schrieb Abhijit Pawar :
> 
> Hello,
> 
> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> replicas and just single core.
> It takes almost 3.5 hours to index that data.
> I am using a data import handler to import data from the mongo database.
> 
> Is there something we can do to reduce the time taken to index?
> Will upgrade to newer version help?
> 
> Appreciate your help!
> 
> Regards,
> Abhijit


Re: SOLR indexing takes longer time

2020-08-17 Thread Divye Handa
Can you share the dih configuration you are using for same?

On Mon, 17 Aug, 2020, 23:52 Abhijit Pawar,  wrote:

> Hello,
>
> We are indexing some 200K plus documents in SOLR 5.4.1 with no shards /
> replicas and just single core.
> It takes almost 3.5 hours to index that data.
> I am using a data import handler to import data from the mongo database.
>
> Is there something we can do to reduce the time taken to index?
> Will upgrade to newer version help?
>
> Appreciate your help!
>
> Regards,
> Abhijit
>