subject:"Indexing Speed"

Re: Slower indexing speed in Solr 8.0.0

2019-04-03 Thread Zheng Lin Edwin Yeo

Hi David,

Yes, I do have this field "_root_" in the schema.

   

However, I don't think I have use the field, and there is no difference in
the indexing speed after I remove the field.

Regards,
Edwin

On Wed, 3 Apr 2019 at 22:57, David Smiley  wrote:

> Hi Edwin,
>
> I'd like to rule something out.  Does your schema define a field "_root_"?
> If you don't have nested documents then remove it.  It's presence adds
> indexing weight in 8.0 that was not there previously.  I'm not sure how
> much though; I've hoped small but who knows.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, Apr 2, 2019 at 10:17 PM Zheng Lin Edwin Yeo 
> wrote:
>
> > Hi,
> >
> > I am setting up the latest Solr 8.0.0, and I am re-indexing the data from
> > scratch in Solr 8.0.0
> >
> > However, I found that the indexing speed is slower in Solr 8.0.0, as
> > compared to the earlier version like Solr 7.7.1. I have not changed the
> > schema.xml and solrconfig.xml yet, just did a change of the
> > luceneMatchVersion in solrconfig.xml to 8.0.0
> > uceneMatchVersion>8.0.0
> >
> > On average, the speed is about 40% to 50% slower. For example, the
> indexing
> > speed was about 17 mins in Solr 7.7.1, but now it takes about 25 mins to
> > index the same set of data.
> >
> > What could be the reason that causes the indexing to be slower in Solr
> > 8.0.0?
> >
> > Regards,
> > Edwin
> >
>

Re: Slower indexing speed in Solr 8.0.0

2019-04-03 Thread David Smiley

Hi Edwin,

I'd like to rule something out.  Does your schema define a field "_root_"?
If you don't have nested documents then remove it.  It's presence adds
indexing weight in 8.0 that was not there previously.  I'm not sure how
much though; I've hoped small but who knows.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Apr 2, 2019 at 10:17 PM Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> I am setting up the latest Solr 8.0.0, and I am re-indexing the data from
> scratch in Solr 8.0.0
>
> However, I found that the indexing speed is slower in Solr 8.0.0, as
> compared to the earlier version like Solr 7.7.1. I have not changed the
> schema.xml and solrconfig.xml yet, just did a change of the
> luceneMatchVersion in solrconfig.xml to 8.0.0
> uceneMatchVersion>8.0.0
>
> On average, the speed is about 40% to 50% slower. For example, the indexing
> speed was about 17 mins in Solr 7.7.1, but now it takes about 25 mins to
> index the same set of data.
>
> What could be the reason that causes the indexing to be slower in Solr
> 8.0.0?
>
> Regards,
> Edwin
>

Re: Slower indexing speed in Solr 8.0.0

2019-04-03 Thread David Smiley

What/where is this benchmark?  I recall once Ishan was working with a
volunteer to set up something like Lucene has but sadly it was not
successful

On Wed, Apr 3, 2019 at 6:04 AM Đạt Cao Mạnh  wrote:

> Hi guys,
>
> I'm seeing the same problems with Shalin nightly indexing benchmark. This
> happen around this period
> git log --before=2018-12-07 --after=2018-11-21
>
> On Wed, Apr 3, 2019 at 8:45 AM Toke Eskildsen  wrote:
>
>> On Wed, 2019-04-03 at 15:24 +0800, Zheng Lin Edwin Yeo wrote:
>> > Yes, I am using DocValues for most of my fields.
>>
>> So that's a culprit. Thank you.
>>
>> > Currently we can't share the test data yet as some of the records are
>> > sensitive. Do you have any data from CSV file that you can test?
>>
>> Not really. I asked because it was a relatively easy way to do testing
>> (replicate your indexing flow with both Solr 7 & 8 as end-points,
>> attach JVisualVM to the Solrs and compare the profiles).
>>
>>
>> I'll put on my to-do to create a test or two with the scenario
>> "indexing from CSV with many DocValues fields". I'll try and generate
>> some test data and see if I can reproduce with them. If this is to be a
>> JIRA, that's needed anyway. Can't promise when I'll get to it, sorry.
>>
>> If this does turn out to be the cause of your performance regression,
>> the fix (if possible) will be for a later Solr version. Currently it is
>> not possible to tweak the docValues indexing parameters outside of code
>> changes.
>>
>>
>> Do note that we're still operating on guesses here. The cause for your
>> regression might easily be elsewhere.
>>
>> - Toke Eskildsen, Royal Danish Library
>>
>>
>>
>
> --
> *Best regards,*
> *Cao Mạnh Đạt*
>
>
> *D.O.B : 31-07-1991Cell: (+84) 946.328.329E-mail: caomanhdat...@gmail.com
> *
>
-- 
Sent from Gmail Mobile

Re: Slower indexing speed in Solr 8.0.0

2019-04-03 Thread Toke Eskildsen

On Wed, 2019-04-03 at 18:04 +0800, Zheng Lin Edwin Yeo wrote:
> I have tried to set all the docValues in my schema.xml to false and
> do the indexing again.
> There isn't any difference with the indexing speed as compared to
> when we have enabled the docValues.

Thank you for sparing me the work.

- Toke Eskildsen, Royal Danish Library

Re: Slower indexing speed in Solr 8.0.0

2019-04-03 Thread Zheng Lin Edwin Yeo

Hi Toke,

I have tried to set all the docValues in my schema.xml to false and do the
indexing again.
There isn't any difference with the indexing speed as compared to when we
have enabled the docValues.

Seems like the cause of the regression might be somewhere else?

Regards,
Edwin

On Wed, 3 Apr 2019 at 15:45, Toke Eskildsen  wrote:

> On Wed, 2019-04-03 at 15:24 +0800, Zheng Lin Edwin Yeo wrote:
> > Yes, I am using DocValues for most of my fields.
>
> So that's a culprit. Thank you.
>
> > Currently we can't share the test data yet as some of the records are
> > sensitive. Do you have any data from CSV file that you can test?
>
> Not really. I asked because it was a relatively easy way to do testing
> (replicate your indexing flow with both Solr 7 & 8 as end-points,
> attach JVisualVM to the Solrs and compare the profiles).
>
>
> I'll put on my to-do to create a test or two with the scenario
> "indexing from CSV with many DocValues fields". I'll try and generate
> some test data and see if I can reproduce with them. If this is to be a
> JIRA, that's needed anyway. Can't promise when I'll get to it, sorry.
>
> If this does turn out to be the cause of your performance regression,
> the fix (if possible) will be for a later Solr version. Currently it is
> not possible to tweak the docValues indexing parameters outside of code
> changes.
>
>
> Do note that we're still operating on guesses here. The cause for your
> regression might easily be elsewhere.
>
> - Toke Eskildsen, Royal Danish Library
>
>
>

Re: Slower indexing speed in Solr 8.0.0

2019-04-03 Thread Đạt Cao Mạnh

Hi guys,

I'm seeing the same problems with Shalin nightly indexing benchmark. This
happen around this period
git log --before=2018-12-07 --after=2018-11-21

On Wed, Apr 3, 2019 at 8:45 AM Toke Eskildsen  wrote:

> On Wed, 2019-04-03 at 15:24 +0800, Zheng Lin Edwin Yeo wrote:
> > Yes, I am using DocValues for most of my fields.
>
> So that's a culprit. Thank you.
>
> > Currently we can't share the test data yet as some of the records are
> > sensitive. Do you have any data from CSV file that you can test?
>
> Not really. I asked because it was a relatively easy way to do testing
> (replicate your indexing flow with both Solr 7 & 8 as end-points,
> attach JVisualVM to the Solrs and compare the profiles).
>
>
> I'll put on my to-do to create a test or two with the scenario
> "indexing from CSV with many DocValues fields". I'll try and generate
> some test data and see if I can reproduce with them. If this is to be a
> JIRA, that's needed anyway. Can't promise when I'll get to it, sorry.
>
> If this does turn out to be the cause of your performance regression,
> the fix (if possible) will be for a later Solr version. Currently it is
> not possible to tweak the docValues indexing parameters outside of code
> changes.
>
>
> Do note that we're still operating on guesses here. The cause for your
> regression might easily be elsewhere.
>
> - Toke Eskildsen, Royal Danish Library
>
>
>

-- 
*Best regards,*
*Cao Mạnh Đạt*


*D.O.B : 31-07-1991Cell: (+84) 946.328.329E-mail: caomanhdat...@gmail.com
*

Re: Slower indexing speed in Solr 8.0.0

2019-04-03 Thread Toke Eskildsen

On Wed, 2019-04-03 at 15:24 +0800, Zheng Lin Edwin Yeo wrote:
> Yes, I am using DocValues for most of my fields.

So that's a culprit. Thank you.

> Currently we can't share the test data yet as some of the records are
> sensitive. Do you have any data from CSV file that you can test? 

Not really. I asked because it was a relatively easy way to do testing
(replicate your indexing flow with both Solr 7 & 8 as end-points,
attach JVisualVM to the Solrs and compare the profiles).

I'll put on my to-do to create a test or two with the scenario
"indexing from CSV with many DocValues fields". I'll try and generate
some test data and see if I can reproduce with them. If this is to be a
JIRA, that's needed anyway. Can't promise when I'll get to it, sorry.

If this does turn out to be the cause of your performance regression,
the fix (if possible) will be for a later Solr version. Currently it is
not possible to tweak the docValues indexing parameters outside of code
changes.

Do note that we're still operating on guesses here. The cause for your
regression might easily be elsewhere.

- Toke Eskildsen, Royal Danish Library

Re: Slower indexing speed in Solr 8.0.0

2019-04-03 Thread Zheng Lin Edwin Yeo

Yes, I am using DocValues for most of my fields.

I am using dynamicField, in which I have appended the field name with
things like _s, _i, etc in the CSV file.

Currently we can't share the test data yet as some of the records are
sensitive. Do you have any data from CSV file that you can test?
If not we have to remove all the sensitive data before I can share.

Regards,
Edwin

On Wed, 3 Apr 2019 at 14:38, Toke Eskildsen  wrote:

> On Wed, 2019-04-03 at 10:17 +0800, Zheng Lin Edwin Yeo wrote:
> > What could be the reason that causes the indexing to be slower in
> > Solr 8.0.0?
>
> As Aroop states there can be multiple explanations. One of them is the
> change to how DocValues are handled in 8.0.0. The indexing impact
> should be tiny, but mistakes happen. With that in mind, do you have
> DocValues enabled for a lot of your fields?
>
> Performance issues like this one are notoriously hard to debug remote.
> Is it possible for you to share your setup and your test data?
>
> - Toke Eskildsen, Royal Danish Library
>
>
>

Re: Slower indexing speed in Solr 8.0.0

2019-04-03 Thread Toke Eskildsen

On Wed, 2019-04-03 at 10:17 +0800, Zheng Lin Edwin Yeo wrote:
> What could be the reason that causes the indexing to be slower in
> Solr 8.0.0?

As Aroop states there can be multiple explanations. One of them is the
change to how DocValues are handled in 8.0.0. The indexing impact
should be tiny, but mistakes happen. With that in mind, do you have
DocValues enabled for a lot of your fields?

Performance issues like this one are notoriously hard to debug remote.
Is it possible for you to share your setup and your test data?

- Toke Eskildsen, Royal Danish Library

Re: Slower indexing speed in Solr 8.0.0

2019-04-02 Thread Zheng Lin Edwin Yeo

I'm using external zookeeper, running on Solr Cloud with one shards and two
replicas. This is a testing setup, so there is only one machine.
The input data is coming from CSV file. I am indexing one CSV file at a
time, and each CSV file contains 3 million records.
I'm indexing using the code from the SimplePostTools.

I have already tried it more than 10 times, and for all the time that I
tried, the indexing speed in 8.0 are all at least 40% slower than 7.7.1

Regards,
Edwin




On Wed, 3 Apr 2019 at 11:19, Aroop Ganguly  wrote:

> Indexing speeds are function of a lot of variables in my experience.
>
> What is your setup like?
> What kind of cluster you have, the number of shards you created, the
> number of machines etc?
> Where is your input data coming from? What technology do you use to
> indexing (simple java threads or something more robust like flink/spark)?
> How many documents do you index at a time?
>
> How many times have u run the indexer job on the new 8.0 setup before
> concluding its slower?
> Make a matrix of all these variables and test over at least 5 runs before
> making an opinion.
>
> I’d love hear more
>
> > On Apr 2, 2019, at 7:41 PM, Zheng Lin Edwin Yeo 
> wrote:
> >
> > For additional info, I am still using the same version of the major
> > components like ZooKeeper, Tika, Carrot2 and Jetty.
> >
> > Regards,
> > Edwin
> >
> > On Wed, 3 Apr 2019 at 10:17, Zheng Lin Edwin Yeo 
> > wrote:
> >
> >> Hi,
> >>
> >> I am setting up the latest Solr 8.0.0, and I am re-indexing the data
> from
> >> scratch in Solr 8.0.0
> >>
> >> However, I found that the indexing speed is slower in Solr 8.0.0, as
> >> compared to the earlier version like Solr 7.7.1. I have not changed the
> >> schema.xml and solrconfig.xml yet, just did a change of the
> >> luceneMatchVersion in solrconfig.xml to 8.0.0
> >> uceneMatchVersion>8.0.0
> >>
> >> On average, the speed is about 40% to 50% slower. For example, the
> >> indexing speed was about 17 mins in Solr 7.7.1, but now it takes about
> 25
> >> mins to index the same set of data.
> >>
> >> What could be the reason that causes the indexing to be slower in Solr
> >> 8.0.0?
> >>
> >> Regards,
> >> Edwin
> >>
>
>

Re: Slower indexing speed in Solr 8.0.0

2019-04-02 Thread Aroop Ganguly

Indexing speeds are function of a lot of variables in my experience.

What is your setup like? 
What kind of cluster you have, the number of shards you created, the number of 
machines etc?
Where is your input data coming from? What technology do you use to indexing 
(simple java threads or something more robust like flink/spark)?
How many documents do you index at a time?

How many times have u run the indexer job on the new 8.0 setup before 
concluding its slower?
Make a matrix of all these variables and test over at least 5 runs before 
making an opinion.

I’d love hear more 

> On Apr 2, 2019, at 7:41 PM, Zheng Lin Edwin Yeo  wrote:
> 
> For additional info, I am still using the same version of the major
> components like ZooKeeper, Tika, Carrot2 and Jetty.
> 
> Regards,
> Edwin
> 
> On Wed, 3 Apr 2019 at 10:17, Zheng Lin Edwin Yeo 
> wrote:
> 
>> Hi,
>> 
>> I am setting up the latest Solr 8.0.0, and I am re-indexing the data from
>> scratch in Solr 8.0.0
>> 
>> However, I found that the indexing speed is slower in Solr 8.0.0, as
>> compared to the earlier version like Solr 7.7.1. I have not changed the
>> schema.xml and solrconfig.xml yet, just did a change of the
>> luceneMatchVersion in solrconfig.xml to 8.0.0
>> uceneMatchVersion>8.0.0
>> 
>> On average, the speed is about 40% to 50% slower. For example, the
>> indexing speed was about 17 mins in Solr 7.7.1, but now it takes about 25
>> mins to index the same set of data.
>> 
>> What could be the reason that causes the indexing to be slower in Solr
>> 8.0.0?
>> 
>> Regards,
>> Edwin
>>

Re: Slower indexing speed in Solr 8.0.0

2019-04-02 Thread Zheng Lin Edwin Yeo

For additional info, I am still using the same version of the major
components like ZooKeeper, Tika, Carrot2 and Jetty.

Regards,
Edwin

On Wed, 3 Apr 2019 at 10:17, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> I am setting up the latest Solr 8.0.0, and I am re-indexing the data from
> scratch in Solr 8.0.0
>
> However, I found that the indexing speed is slower in Solr 8.0.0, as
> compared to the earlier version like Solr 7.7.1. I have not changed the
> schema.xml and solrconfig.xml yet, just did a change of the
> luceneMatchVersion in solrconfig.xml to 8.0.0
> uceneMatchVersion>8.0.0
>
> On average, the speed is about 40% to 50% slower. For example, the
> indexing speed was about 17 mins in Solr 7.7.1, but now it takes about 25
> mins to index the same set of data.
>
> What could be the reason that causes the indexing to be slower in Solr
> 8.0.0?
>
> Regards,
> Edwin
>

Slower indexing speed in Solr 8.0.0

2019-04-02 Thread Zheng Lin Edwin Yeo

Hi,

I am setting up the latest Solr 8.0.0, and I am re-indexing the data from
scratch in Solr 8.0.0

However, I found that the indexing speed is slower in Solr 8.0.0, as
compared to the earlier version like Solr 7.7.1. I have not changed the
schema.xml and solrconfig.xml yet, just did a change of the
luceneMatchVersion in solrconfig.xml to 8.0.0
uceneMatchVersion>8.0.0

On average, the speed is about 40% to 50% slower. For example, the indexing
speed was about 17 mins in Solr 7.7.1, but now it takes about 25 mins to
index the same set of data.

What could be the reason that causes the indexing to be slower in Solr
8.0.0?

Regards,
Edwin

Re: Improve indexing speed?

2019-01-01 Thread Shawn Heisey


On 1/1/2019 8:59 AM, John Milton wrote:

My document contains 65 fields. All the fields needs to be indexed. But for
the 100 documents takes 10 seconds for indexing.
I am using Solr 7.5 (2 cloud instance), with 50 shards.


The best way to achieve fast indexing in Solr is to index multiple items 
in parallel.  That is, make your indexing system multi-threaded or 
multi-process.


As Erick also asked ... why do you have so many shards?  The only good 
reason I can imagine for so many shards is a need to handle billions of 
documents.


Thanks,
Shawn

Re: Improve indexing speed?

2019-01-01 Thread Hendrik Haddorp

How are you indexing the documents? Are you using SolrJ or the plain 
REST API?
Are you sending the documents one by one or all in one request? The 
performance is far better if you send the 100 documents in one request.

If you send them individual, are you doing any commits between them?

regards,
Hendrik

On 01.01.2019 16:59, John Milton wrote:

Hi to all,

My document contains 65 fields. All the fields needs to be indexed. But for
the 100 documents takes 10 seconds for indexing.
I am using Solr 7.5 (2 cloud instance), with 50 shards.
It's running on Windows OS and it has 32 GB RAM. Java heap space 15 GB.
How to improve indexing speed?
Note :
All the fields contains maximum 20 characters only. Field type is text
general with case insensitive.

Thanks,
John Milton

Re: Improve indexing speed?

2019-01-01 Thread Erick Erickson

What have you tried? The first thing I'd try is using just 1 or 2
shards. My first guess is that you're doing a lot of GC because you
have 50 shards in a single JVM (1 replica/shard?).

I regularly get several thousand Wikipedia docs/second on my macbook
pro, so your numbers are way out of the norm.

Best,
Erick

On Tue, Jan 1, 2019 at 9:05 AM John Milton  wrote:
>
> Hi to all,
>
> My document contains 65 fields. All the fields needs to be indexed. But for
> the 100 documents takes 10 seconds for indexing.
> I am using Solr 7.5 (2 cloud instance), with 50 shards.
> It's running on Windows OS and it has 32 GB RAM. Java heap space 15 GB.
> How to improve indexing speed?
> Note :
> All the fields contains maximum 20 characters only. Field type is text
> general with case insensitive.
>
> Thanks,
> John Milton

Improve indexing speed?

2019-01-01 Thread John Milton

Hi to all,

My document contains 65 fields. All the fields needs to be indexed. But for
the 100 documents takes 10 seconds for indexing.
I am using Solr 7.5 (2 cloud instance), with 50 shards.
It's running on Windows OS and it has 32 GB RAM. Java heap space 15 GB.
How to improve indexing speed?
Note :
All the fields contains maximum 20 characters only. Field type is text
general with case insensitive.

Thanks,
John Milton

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-03-01 Thread 苗海泉

Thank you for your advice on gc tools, what do you suggest to me?

2018-02-28 23:57 GMT+08:00 Shawn Heisey :

> On 2/28/2018 2:53 AM, 苗海泉 wrote:
>
>> Thanks for your detailed advice, the monitor product you are talking about
>> is good, but our solr system is running on a private network and seems to
>> be unusable at all, with no single downloadable application for analyzing
>> specific gc logs.
>>
>
> For analyzing GC logs, the GCViewer app is useful.  With some practice
> (learning to disable irrelevent information) you can pinpoint problems.  It
> also compiles statistics about GC intervals, which can be very helpful.  It
> is an executable jar.
>
> https://github.com/chewiebug/GCViewer
>
> But I have found an even easier tool for general use:
>
> http://gceasy.io/
>
> I still find value in GCViewer, but most of the time the information I'm
> after is provided by gceasy, and it's a lot easier to decipher.
>
> Possible disadvantage for gceasy: it's an online tool. So you have to copy
> the log out of disconnected networks into a machine with Internet access.
> I don't anticipate any sort of privacy problems with them -- logs that you
> upload are not kept very long, and GC logs don't contain anything sensitive
> anyway.
>
> Thanks,
> Shawn
>
>


-- 
==
联创科技
知行如一
==

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-28 Thread Shawn Heisey


On 2/28/2018 2:53 AM, 苗海泉 wrote:

Thanks for your detailed advice, the monitor product you are talking about
is good, but our solr system is running on a private network and seems to
be unusable at all, with no single downloadable application for analyzing
specific gc logs.


For analyzing GC logs, the GCViewer app is useful.  With some practice 
(learning to disable irrelevent information) you can pinpoint problems.  
It also compiles statistics about GC intervals, which can be very 
helpful.  It is an executable jar.


https://github.com/chewiebug/GCViewer

But I have found an even easier tool for general use:

http://gceasy.io/

I still find value in GCViewer, but most of the time the information I'm 
after is provided by gceasy, and it's a lot easier to decipher.


Possible disadvantage for gceasy: it's an online tool. So you have to 
copy the log out of disconnected networks into a machine with Internet 
access.  I don't anticipate any sort of privacy problems with them -- 
logs that you upload are not kept very long, and GC logs don't contain 
anything sensitive anyway.


Thanks,
Shawn

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-28 Thread Emir Arnautović

If you are after only visualising GC, there are several tools that you can 
download or upload logs to visualise. If you would like to monitor all 
host/solr/jvm, Sematext’s SPM also comes in on-premises  version, where you 
install and host your own monitoring infrastructure: 
https://sematext.com/spm/#on-premises 

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 28 Feb 2018, at 10:53, 苗海泉  wrote:
> 
> Thanks for your detailed advice, the monitor product you are talking about
> is good, but our solr system is running on a private network and seems to
> be unusable at all, with no single downloadable application for analyzing
> specific gc logs.
> 
> 2018-02-28 16:57 GMT+08:00 Emir Arnautović  >:
> 
>> Hi,
>> I would start with following:
>> 1. have dedicated nodes for ZK ensemble - those do not have to be powerful
>> nodes (maybe 2-4 cores and 8GB RAM)
>> 2. reduce heap size to value below margin where JVM can use compressed
>> oops - 31GB should be safe size
>> 3. shard collection to all nodes
>> 4. increase rollover interval to 2h so you keep shard size/number as it is
>> today.
>> 5. experiment with slightly larger rollover intervals (e.g. 3h) if query
>> latency is still acceptable. That will result in less shards that are
>> slightly larger.
>> 
>> In any case monitor your cluster to see how changes affect it. Not sure
>> what you currently use for monitoring, but manual scanning of GC logs is
>> not fun. You can check out our monitoring tool if you don’t have one or if
>> it does not give you enough visibility: https://sematext.com/spm/ <
>> https://sematext.com/spm/ >
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 28 Feb 2018, at 02:42, 苗海泉  wrote:
>>> 
>>> Thank you, I read under the memory footprint, I set 75% recovery, memory
>>> occupancy at about 76%, the other we zookeeper not on a dedicated server,
>>> perhaps because of this cause instability.
>>> 
>>> What else do you recommend for me to check?
>>> 
>>> 2018-02-27 22:37 GMT+08:00 Emir Arnautović >> :
>>> 
 This does not show much: only that your heap is around 75% (24-25GB). I
 was thinking that you should compare metrics (heap/GC as well) when
>> running
 on without issues and when running with issues and see if something can
>> be
 concluded.
 About instability: Do you run ZK on dedicated nodes?
 
 Emir
 --
 Monitoring - Log Management - Alerting - Anomaly Detection
 Solr & Elasticsearch Consulting Support Training - http://sematext.com/
 
 
 
> On 27 Feb 2018, at 14:43, 苗海泉  wrote:
> 
> Thank you, we were 49 shard 49 nodes, but later found that in this
>> case,
> often disconnect between solr and zookeepr, zookeeper too many nodes
 caused
> solr instability, so reduced to 25 A follow-up performance can not keep
 up
> also need to increase back.
> 
> Very slow when solr and zookeeper not found any errors, just build the
> index slow, automatic commit inside the log display is slow, but the
>> main
> reason may not lie in the commit place.
> 
> I am sorry, I do not know how to look at the utilization of java heap,
> through the gc log, gc time is not long, I posted the log:
> 
> 
> {Heap before GC invocations=1144021 (full 72):
> garbage-first heap   total 33554432K, used 26982419K
>> [0x7f147800,
> 0x7f1478808000, 0x7f1c7800)
> region size 8192K, 204 young (1671168K), 26 survivors (212992K)
> Metaspace   used 41184K, capacity 41752K, committed 67072K,
>> reserved
> 67584K
> 2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation
 Pause)
> (young)
> Desired survivor size 109051904 bytes, new threshold 1 (max 15)
> - age   1:  113878760 bytes,  113878760 total
> - age   2:   21264744 bytes,  135143504 total
> - age   3:   17020096 bytes,  152163600 total
> - age   4:   26870864 bytes,  179034464 total
> , 0.0579794 secs]
> [Parallel Time: 46.9 ms, GC Workers: 18]
>[GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
> 4668016046.4, Diff: 0.3]
>[Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
> Sum: 116.9]
>[Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum:
>> 62.0]
>   [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum:
 113]
>[Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
>[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff:

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-28 Thread 苗海泉

Thanks for your detailed advice, the monitor product you are talking about
is good, but our solr system is running on a private network and seems to
be unusable at all, with no single downloadable application for analyzing
specific gc logs.

2018-02-28 16:57 GMT+08:00 Emir Arnautović :

> Hi,
> I would start with following:
> 1. have dedicated nodes for ZK ensemble - those do not have to be powerful
> nodes (maybe 2-4 cores and 8GB RAM)
> 2. reduce heap size to value below margin where JVM can use compressed
> oops - 31GB should be safe size
> 3. shard collection to all nodes
> 4. increase rollover interval to 2h so you keep shard size/number as it is
> today.
> 5. experiment with slightly larger rollover intervals (e.g. 3h) if query
> latency is still acceptable. That will result in less shards that are
> slightly larger.
>
> In any case monitor your cluster to see how changes affect it. Not sure
> what you currently use for monitoring, but manual scanning of GC logs is
> not fun. You can check out our monitoring tool if you don’t have one or if
> it does not give you enough visibility: https://sematext.com/spm/ <
> https://sematext.com/spm/>
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 28 Feb 2018, at 02:42, 苗海泉  wrote:
> >
> > Thank you, I read under the memory footprint, I set 75% recovery, memory
> > occupancy at about 76%, the other we zookeeper not on a dedicated server,
> > perhaps because of this cause instability.
> >
> > What else do you recommend for me to check?
> >
> > 2018-02-27 22:37 GMT+08:00 Emir Arnautović  >:
> >
> >> This does not show much: only that your heap is around 75% (24-25GB). I
> >> was thinking that you should compare metrics (heap/GC as well) when
> running
> >> on without issues and when running with issues and see if something can
> be
> >> concluded.
> >> About instability: Do you run ZK on dedicated nodes?
> >>
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 27 Feb 2018, at 14:43, 苗海泉  wrote:
> >>>
> >>> Thank you, we were 49 shard 49 nodes, but later found that in this
> case,
> >>> often disconnect between solr and zookeepr, zookeeper too many nodes
> >> caused
> >>> solr instability, so reduced to 25 A follow-up performance can not keep
> >> up
> >>> also need to increase back.
> >>>
> >>> Very slow when solr and zookeeper not found any errors, just build the
> >>> index slow, automatic commit inside the log display is slow, but the
> main
> >>> reason may not lie in the commit place.
> >>>
> >>> I am sorry, I do not know how to look at the utilization of java heap,
> >>> through the gc log, gc time is not long, I posted the log:
> >>>
> >>>
> >>> {Heap before GC invocations=1144021 (full 72):
> >>> garbage-first heap   total 33554432K, used 26982419K
> [0x7f147800,
> >>> 0x7f1478808000, 0x7f1c7800)
> >>> region size 8192K, 204 young (1671168K), 26 survivors (212992K)
> >>> Metaspace   used 41184K, capacity 41752K, committed 67072K,
> reserved
> >>> 67584K
> >>> 2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation
> >> Pause)
> >>> (young)
> >>> Desired survivor size 109051904 bytes, new threshold 1 (max 15)
> >>> - age   1:  113878760 bytes,  113878760 total
> >>> - age   2:   21264744 bytes,  135143504 total
> >>> - age   3:   17020096 bytes,  152163600 total
> >>> - age   4:   26870864 bytes,  179034464 total
> >>> , 0.0579794 secs]
> >>>  [Parallel Time: 46.9 ms, GC Workers: 18]
> >>> [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
> >>> 4668016046.4, Diff: 0.3]
> >>> [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
> >>> Sum: 116.9]
> >>> [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum:
> 62.0]
> >>>[Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum:
> >> 113]
> >>> [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
> >>> [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> >>> Sum: 0.0]
> >>> [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum:
> >>> 428.1]
> >>> [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum:
> >>> 228.9]
> >>>[Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
> >> 18]
> >>> [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4,
> Sum:
> >>> 1.2]
> >>> [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3,
> >>> Sum: 838.0]
> >>> [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max:
> >>> 4668016092.8, Diff: 0.0]
> >>>  [Code Root Fixup: 0.2 ms]
> >>>  [Code Root Purge: 0.0 ms]
> >>>  [Clear CT: 0.3 ms]
> >>>  [Other: 10.7

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-28 Thread Emir Arnautović

Hi,
I would start with following:
1. have dedicated nodes for ZK ensemble - those do not have to be powerful 
nodes (maybe 2-4 cores and 8GB RAM)
2. reduce heap size to value below margin where JVM can use compressed oops - 
31GB should be safe size
3. shard collection to all nodes
4. increase rollover interval to 2h so you keep shard size/number as it is 
today.
5. experiment with slightly larger rollover intervals (e.g. 3h) if query 
latency is still acceptable. That will result in less shards that are slightly 
larger.

In any case monitor your cluster to see how changes affect it. Not sure what 
you currently use for monitoring, but manual scanning of GC logs is not fun. 
You can check out our monitoring tool if you don’t have one or if it does not 
give you enough visibility: https://sematext.com/spm/ 
 

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 28 Feb 2018, at 02:42, 苗海泉  wrote:
> 
> Thank you, I read under the memory footprint, I set 75% recovery, memory
> occupancy at about 76%, the other we zookeeper not on a dedicated server,
> perhaps because of this cause instability.
> 
> What else do you recommend for me to check?
> 
> 2018-02-27 22:37 GMT+08:00 Emir Arnautović :
> 
>> This does not show much: only that your heap is around 75% (24-25GB). I
>> was thinking that you should compare metrics (heap/GC as well) when running
>> on without issues and when running with issues and see if something can be
>> concluded.
>> About instability: Do you run ZK on dedicated nodes?
>> 
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 27 Feb 2018, at 14:43, 苗海泉  wrote:
>>> 
>>> Thank you, we were 49 shard 49 nodes, but later found that in this case,
>>> often disconnect between solr and zookeepr, zookeeper too many nodes
>> caused
>>> solr instability, so reduced to 25 A follow-up performance can not keep
>> up
>>> also need to increase back.
>>> 
>>> Very slow when solr and zookeeper not found any errors, just build the
>>> index slow, automatic commit inside the log display is slow, but the main
>>> reason may not lie in the commit place.
>>> 
>>> I am sorry, I do not know how to look at the utilization of java heap,
>>> through the gc log, gc time is not long, I posted the log:
>>> 
>>> 
>>> {Heap before GC invocations=1144021 (full 72):
>>> garbage-first heap   total 33554432K, used 26982419K [0x7f147800,
>>> 0x7f1478808000, 0x7f1c7800)
>>> region size 8192K, 204 young (1671168K), 26 survivors (212992K)
>>> Metaspace   used 41184K, capacity 41752K, committed 67072K, reserved
>>> 67584K
>>> 2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation
>> Pause)
>>> (young)
>>> Desired survivor size 109051904 bytes, new threshold 1 (max 15)
>>> - age   1:  113878760 bytes,  113878760 total
>>> - age   2:   21264744 bytes,  135143504 total
>>> - age   3:   17020096 bytes,  152163600 total
>>> - age   4:   26870864 bytes,  179034464 total
>>> , 0.0579794 secs]
>>>  [Parallel Time: 46.9 ms, GC Workers: 18]
>>> [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
>>> 4668016046.4, Diff: 0.3]
>>> [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
>>> Sum: 116.9]
>>> [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum: 62.0]
>>>[Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum:
>> 113]
>>> [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
>>> [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>>> Sum: 0.0]
>>> [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum:
>>> 428.1]
>>> [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum:
>>> 228.9]
>>>[Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
>> 18]
>>> [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum:
>>> 1.2]
>>> [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3,
>>> Sum: 838.0]
>>> [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max:
>>> 4668016092.8, Diff: 0.0]
>>>  [Code Root Fixup: 0.2 ms]
>>>  [Code Root Purge: 0.0 ms]
>>>  [Clear CT: 0.3 ms]
>>>  [Other: 10.7 ms]
>>> [Choose CSet: 0.0 ms]
>>> [Ref Proc: 5.9 ms]
>>> [Ref Enq: 0.2 ms]
>>> [Redirty Cards: 0.2 ms]
>>> [Humongous Register: 2.2 ms]
>>> [Humongous Reclaim: 0.4 ms]
>>> [Free CSet: 0.4 ms]
>>>  [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap:
>>> 25.7G(32.0G)->24.3G(32.0G)]
>>> Heap after GC invocations=1144022 (full 72):
>>> garbage-first heap   total 33554432K, used 25489656K [0x7f147800,
>>> 0x7f1478808000, 0x7f1c7800)
>>> region size 8192K, 10 young

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-27 Thread 苗海泉

Thank you, I read under the memory footprint, I set 75% recovery, memory
occupancy at about 76%, the other we zookeeper not on a dedicated server,
perhaps because of this cause instability.

What else do you recommend for me to check?

2018-02-27 22:37 GMT+08:00 Emir Arnautović :

> This does not show much: only that your heap is around 75% (24-25GB). I
> was thinking that you should compare metrics (heap/GC as well) when running
> on without issues and when running with issues and see if something can be
> concluded.
> About instability: Do you run ZK on dedicated nodes?
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 27 Feb 2018, at 14:43, 苗海泉  wrote:
> >
> > Thank you, we were 49 shard 49 nodes, but later found that in this case,
> > often disconnect between solr and zookeepr, zookeeper too many nodes
> caused
> > solr instability, so reduced to 25 A follow-up performance can not keep
> up
> > also need to increase back.
> >
> > Very slow when solr and zookeeper not found any errors, just build the
> > index slow, automatic commit inside the log display is slow, but the main
> > reason may not lie in the commit place.
> >
> > I am sorry, I do not know how to look at the utilization of java heap,
> > through the gc log, gc time is not long, I posted the log:
> >
> >
> > {Heap before GC invocations=1144021 (full 72):
> > garbage-first heap   total 33554432K, used 26982419K [0x7f147800,
> > 0x7f1478808000, 0x7f1c7800)
> >  region size 8192K, 204 young (1671168K), 26 survivors (212992K)
> > Metaspace   used 41184K, capacity 41752K, committed 67072K, reserved
> > 67584K
> > 2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation
> Pause)
> > (young)
> > Desired survivor size 109051904 bytes, new threshold 1 (max 15)
> > - age   1:  113878760 bytes,  113878760 total
> > - age   2:   21264744 bytes,  135143504 total
> > - age   3:   17020096 bytes,  152163600 total
> > - age   4:   26870864 bytes,  179034464 total
> > , 0.0579794 secs]
> >   [Parallel Time: 46.9 ms, GC Workers: 18]
> >  [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
> > 4668016046.4, Diff: 0.3]
> >  [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
> > Sum: 116.9]
> >  [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum: 62.0]
> > [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum:
> 113]
> >  [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
> >  [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> > Sum: 0.0]
> >  [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum:
> > 428.1]
> >  [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum:
> > 228.9]
> > [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
> 18]
> >  [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum:
> > 1.2]
> >  [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3,
> > Sum: 838.0]
> >  [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max:
> > 4668016092.8, Diff: 0.0]
> >   [Code Root Fixup: 0.2 ms]
> >   [Code Root Purge: 0.0 ms]
> >   [Clear CT: 0.3 ms]
> >   [Other: 10.7 ms]
> >  [Choose CSet: 0.0 ms]
> >  [Ref Proc: 5.9 ms]
> >  [Ref Enq: 0.2 ms]
> >  [Redirty Cards: 0.2 ms]
> >  [Humongous Register: 2.2 ms]
> >  [Humongous Reclaim: 0.4 ms]
> >  [Free CSet: 0.4 ms]
> >   [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap:
> > 25.7G(32.0G)->24.3G(32.0G)]
> > Heap after GC invocations=1144022 (full 72):
> > garbage-first heap   total 33554432K, used 25489656K [0x7f147800,
> > 0x7f1478808000, 0x7f1c7800)
> >  region size 8192K, 10 young (81920K), 10 survivors (81920K)
> > Metaspace   used 41184K, capacity 41752K, committed 67072K, reserved
> > 67584K
> > }
> > [Times: user=0.84 sys=0.01, real=0.05 secs]
> > 2018-02-27T21:43:01.851+0800: 4668016.102: Total time for which
> application
> > threads were stopped: 0.0661383 seconds, Stopping threads took: 0.0004141
> > seconds
> > 2018-02-27T21:43:02.092+0800: 4668016.343: [GC concurrent-mark-end,
> > 2.5757061 secs]
> > 2018-02-27T21:43:02.100+0800: 4668016.351: [GC remark
> > 2018-02-27T21:43:02.100+0800: 4668016.351: [Finalize Marking, 0.0016508
> > secs] 2018-02-27T21:43:02.102+0800: 4668016.352: [GC ref-proc, 0.0277818
> > secs] 2018-02-27T21:43:02.129+0800: 4668016.380: [Unloading, 0.0118102
> > secs], 0.0704296 secs]
> > [Times: user=0.85 sys=0.04, real=0.07 secs]
> > 2018-02-27T21:43:02.171+0800: 4668016.422: Total time for which
> application
> > threads were stopped: 0.0785762 seconds, Stopping threads took: 0.0006159
> > seconds
> > 2018-02-27T21:43:02.178+0800: 4668016.429: [GC cleanup 24G->24G(32G),
> > 0.0391915 secs]
> > [Times:

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-27 Thread Emir Arnautović

This does not show much: only that your heap is around 75% (24-25GB). I was 
thinking that you should compare metrics (heap/GC as well) when running on 
without issues and when running with issues and see if something can be 
concluded.
About instability: Do you run ZK on dedicated nodes?

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Feb 2018, at 14:43, 苗海泉  wrote:
> 
> Thank you, we were 49 shard 49 nodes, but later found that in this case,
> often disconnect between solr and zookeepr, zookeeper too many nodes caused
> solr instability, so reduced to 25 A follow-up performance can not keep up
> also need to increase back.
> 
> Very slow when solr and zookeeper not found any errors, just build the
> index slow, automatic commit inside the log display is slow, but the main
> reason may not lie in the commit place.
> 
> I am sorry, I do not know how to look at the utilization of java heap,
> through the gc log, gc time is not long, I posted the log:
> 
> 
> {Heap before GC invocations=1144021 (full 72):
> garbage-first heap   total 33554432K, used 26982419K [0x7f147800,
> 0x7f1478808000, 0x7f1c7800)
>  region size 8192K, 204 young (1671168K), 26 survivors (212992K)
> Metaspace   used 41184K, capacity 41752K, committed 67072K, reserved
> 67584K
> 2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation Pause)
> (young)
> Desired survivor size 109051904 bytes, new threshold 1 (max 15)
> - age   1:  113878760 bytes,  113878760 total
> - age   2:   21264744 bytes,  135143504 total
> - age   3:   17020096 bytes,  152163600 total
> - age   4:   26870864 bytes,  179034464 total
> , 0.0579794 secs]
>   [Parallel Time: 46.9 ms, GC Workers: 18]
>  [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
> 4668016046.4, Diff: 0.3]
>  [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
> Sum: 116.9]
>  [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum: 62.0]
> [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum: 113]
>  [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
>  [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
> Sum: 0.0]
>  [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum:
> 428.1]
>  [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum:
> 228.9]
> [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 18]
>  [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum:
> 1.2]
>  [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3,
> Sum: 838.0]
>  [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max:
> 4668016092.8, Diff: 0.0]
>   [Code Root Fixup: 0.2 ms]
>   [Code Root Purge: 0.0 ms]
>   [Clear CT: 0.3 ms]
>   [Other: 10.7 ms]
>  [Choose CSet: 0.0 ms]
>  [Ref Proc: 5.9 ms]
>  [Ref Enq: 0.2 ms]
>  [Redirty Cards: 0.2 ms]
>  [Humongous Register: 2.2 ms]
>  [Humongous Reclaim: 0.4 ms]
>  [Free CSet: 0.4 ms]
>   [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap:
> 25.7G(32.0G)->24.3G(32.0G)]
> Heap after GC invocations=1144022 (full 72):
> garbage-first heap   total 33554432K, used 25489656K [0x7f147800,
> 0x7f1478808000, 0x7f1c7800)
>  region size 8192K, 10 young (81920K), 10 survivors (81920K)
> Metaspace   used 41184K, capacity 41752K, committed 67072K, reserved
> 67584K
> }
> [Times: user=0.84 sys=0.01, real=0.05 secs]
> 2018-02-27T21:43:01.851+0800: 4668016.102: Total time for which application
> threads were stopped: 0.0661383 seconds, Stopping threads took: 0.0004141
> seconds
> 2018-02-27T21:43:02.092+0800: 4668016.343: [GC concurrent-mark-end,
> 2.5757061 secs]
> 2018-02-27T21:43:02.100+0800: 4668016.351: [GC remark
> 2018-02-27T21:43:02.100+0800: 4668016.351: [Finalize Marking, 0.0016508
> secs] 2018-02-27T21:43:02.102+0800: 4668016.352: [GC ref-proc, 0.0277818
> secs] 2018-02-27T21:43:02.129+0800: 4668016.380: [Unloading, 0.0118102
> secs], 0.0704296 secs]
> [Times: user=0.85 sys=0.04, real=0.07 secs]
> 2018-02-27T21:43:02.171+0800: 4668016.422: Total time for which application
> threads were stopped: 0.0785762 seconds, Stopping threads took: 0.0006159
> seconds
> 2018-02-27T21:43:02.178+0800: 4668016.429: [GC cleanup 24G->24G(32G),
> 0.0391915 secs]
> [Times: user=0.64 sys=0.00, real=0.04 secs]
> 2018-02-27T21:43:02.218+0800: 4668016.469: Total time for which application
> threads were stopped: 0.0470020 seconds, Stopping threads took: 0.0001684
> seconds
> 2018-02-27T21:43:02.540+0800: 4668016.791: Total time for which application
> threads were stopped: 0.0074829 seconds, Stopping threads took: 0.0004834
> seconds
> {Heap before GC invocations=1144023 (full 72):
> garbage-first heap   total 33554432K, used 27078904K [0x7f147800,
> 0x7f1478808000,

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-27 Thread 苗海泉

Thank you, we were 49 shard 49 nodes, but later found that in this case,
often disconnect between solr and zookeepr, zookeeper too many nodes caused
solr instability, so reduced to 25 A follow-up performance can not keep up
also need to increase back.

Very slow when solr and zookeeper not found any errors, just build the
index slow, automatic commit inside the log display is slow, but the main
reason may not lie in the commit place.

I am sorry, I do not know how to look at the utilization of java heap,
through the gc log, gc time is not long, I posted the log:


{Heap before GC invocations=1144021 (full 72):
 garbage-first heap   total 33554432K, used 26982419K [0x7f147800,
0x7f1478808000, 0x7f1c7800)
  region size 8192K, 204 young (1671168K), 26 survivors (212992K)
 Metaspace   used 41184K, capacity 41752K, committed 67072K, reserved
67584K
2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation Pause)
(young)
Desired survivor size 109051904 bytes, new threshold 1 (max 15)
- age   1:  113878760 bytes,  113878760 total
- age   2:   21264744 bytes,  135143504 total
- age   3:   17020096 bytes,  152163600 total
- age   4:   26870864 bytes,  179034464 total
, 0.0579794 secs]
   [Parallel Time: 46.9 ms, GC Workers: 18]
  [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
4668016046.4, Diff: 0.3]
  [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
Sum: 116.9]
  [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum: 62.0]
 [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum: 113]
  [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
  [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
Sum: 0.0]
  [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum:
428.1]
  [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum:
228.9]
 [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 18]
  [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum:
1.2]
  [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3,
Sum: 838.0]
  [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max:
4668016092.8, Diff: 0.0]
   [Code Root Fixup: 0.2 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.3 ms]
   [Other: 10.7 ms]
  [Choose CSet: 0.0 ms]
  [Ref Proc: 5.9 ms]
  [Ref Enq: 0.2 ms]
  [Redirty Cards: 0.2 ms]
  [Humongous Register: 2.2 ms]
  [Humongous Reclaim: 0.4 ms]
  [Free CSet: 0.4 ms]
   [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap:
25.7G(32.0G)->24.3G(32.0G)]
Heap after GC invocations=1144022 (full 72):
 garbage-first heap   total 33554432K, used 25489656K [0x7f147800,
0x7f1478808000, 0x7f1c7800)
  region size 8192K, 10 young (81920K), 10 survivors (81920K)
 Metaspace   used 41184K, capacity 41752K, committed 67072K, reserved
67584K
}
 [Times: user=0.84 sys=0.01, real=0.05 secs]
2018-02-27T21:43:01.851+0800: 4668016.102: Total time for which application
threads were stopped: 0.0661383 seconds, Stopping threads took: 0.0004141
seconds
2018-02-27T21:43:02.092+0800: 4668016.343: [GC concurrent-mark-end,
2.5757061 secs]
2018-02-27T21:43:02.100+0800: 4668016.351: [GC remark
2018-02-27T21:43:02.100+0800: 4668016.351: [Finalize Marking, 0.0016508
secs] 2018-02-27T21:43:02.102+0800: 4668016.352: [GC ref-proc, 0.0277818
secs] 2018-02-27T21:43:02.129+0800: 4668016.380: [Unloading, 0.0118102
secs], 0.0704296 secs]
 [Times: user=0.85 sys=0.04, real=0.07 secs]
2018-02-27T21:43:02.171+0800: 4668016.422: Total time for which application
threads were stopped: 0.0785762 seconds, Stopping threads took: 0.0006159
seconds
2018-02-27T21:43:02.178+0800: 4668016.429: [GC cleanup 24G->24G(32G),
0.0391915 secs]
 [Times: user=0.64 sys=0.00, real=0.04 secs]
2018-02-27T21:43:02.218+0800: 4668016.469: Total time for which application
threads were stopped: 0.0470020 seconds, Stopping threads took: 0.0001684
seconds
2018-02-27T21:43:02.540+0800: 4668016.791: Total time for which application
threads were stopped: 0.0074829 seconds, Stopping threads took: 0.0004834
seconds
{Heap before GC invocations=1144023 (full 72):
 garbage-first heap   total 33554432K, used 27078904K [0x7f147800,
0x7f1478808000, 0x7f1c7800)
  region size 8192K, 204 young (1671168K), 10 survivors (81920K)
 Metaspace   used 41184K, capacity 41752K, committed 67072K, reserved
67584K
2018-02-27T21:43:04.076+0800: 4668018.326: [GC pause (G1 Evacuation Pause)
(young)
Desired survivor size 109051904 bytes, new threshold 15 (max 15)
- age   1:   47719032 bytes,   47719032 total
, 0.0554183 secs]
   [Parallel Time: 48.0 ms, GC Workers: 18]
  [GC Worker Start (ms): Min: 4668018329.0, Avg: 4668018329.1, Max:
4668018329.3, Diff: 0.3]
  [Ext Root Scanning (ms): Min: 2.9, Avg: 5.7, Max: 47.4, Diff: 44.6,
Sum: 103.0]
  [Update RS (ms): Min: 0.0, Avg: 14.3, Max:

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-27 Thread Emir Arnautović

Ah, so there are ~560 shards per node and not all nodes are indexing at the 
same time. Why is that? You can have better throughput if indexing on all 
nodes. If happy with shard size, you can create new collection with 49 shards 
every 2h and have everything the same and index on all nodes.

Back to main question: what is the heap utilisation? When you restart node what 
is heap utilisation? Do you see any errors in your logs? Do you see any errors 
in ZK logs?

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Feb 2018, at 13:22, 苗海泉  wrote:
> 
> Thanks  for you reply again.
> I just said that you may have some misunderstanding, we have 49 solr nodes,
> each collection has 25 shards, each shard has only one replica of the data,
> there is no copy, and I reduce the part of the cache. If you need the
> metric data, I can check Come out to tell you, in addition we are only
> additional system, there will not be any change action.
> 
> 2018-02-27 20:05 GMT+08:00 Emir Arnautović :
> 
>> Hi,
>> It is hard to tell without looking more into your metrics. It seems to me
>> that you are reaching limits of your cluster. I would doublecheck if memory
>> is the issue. If I got it right, you have ~1120 shards per node. It takes
>> some heap just to keep them open. If you have some caches enabled and if it
>> is append only system, old shards will keep caches until reloaded.
>> Probably will not make much diff, but with 25x2=50 shards and 49 nodes,
>> one node will need to handle double indexing load.
>> 
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 27 Feb 2018, at 12:54, 苗海泉  wrote:
>>> 
>>> In addition, we found that the rate was normal when the number of
>>> collections was kept below 936 and the speed was slower and slower at
>> 984.
>>> Therefore, we could only temporarily delete the older collection, but now
>>> we need more Online collection, there has been no good way to confuse us
>>> for a long time, very much hope to give a solution to the problem of
>> ideas,
>>> greatly appreciated
>>> 
>>> 2018-02-27 19:46 GMT+08:00 苗海泉 :
>>> 
 Thank you for reply.
 One collection has 25 shard one replica, one solr node has about 5T on
 desk.
 GC is checked ,and modify as follow :
 SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
 GC_TUNE=" \
 -XX:+UseG1GC \
 -XX:+PerfDisableSharedMem \
 -XX:+ParallelRefProcEnabled \
 -XX:G1HeapRegionSize=8m \
 -XX:MaxGCPauseMillis=250 \
 -XX:InitiatingHeapOccupancyPercent=75 \
 -XX:+UseLargePages \
 -XX:+AggressiveOpts \
 -XX:+UseLargePages"
 
 2018-02-27 19:27 GMT+08:00 Emir Arnautović <
>> emir.arnauto...@sematext.com>:
 
> Hi,
> To get more complete picture, can you tell us how many shards/replicas
>> do
> you have per collection? Also what is index size on disk? Did you
>> check GC?
> 
> BTW, using 32GB heap prevents you from using compressed oops, resulting
> in less memory available than 31GB.
> 
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
> 
> 
> 
>> On 27 Feb 2018, at 11:36, 苗海泉  wrote:
>> 
>> I encountered a more serious problem in the process of using solr. We
> use
>> the solr version is 6.0, our daily amount of data is about 500 billion
>> documents, create a collection every hour, the online collection of
>> more
>> than a thousand, 49 solr nodes. If the collection in less than 800,
>> the
>> speed is still very fast, if the collection of the number of 1100 or
>> so,
>> the construction of solr index will drop sharply, one of the original
>> program speed of about 2-3 million TPS, Dropped to only a few hundred
>> or
>> even tens of TPS, who have encountered a similar situation, there is
>> no
>> good idea to find this issue. By the way, solr a node memory we
>> assigned
>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
>> problem, belong to the normal state. Which friend encountered a
>> similar
>> problem, please inform the solution, thank you very much.
> 
> 
 
 
 --
 ==
 联创科技
 知行如一
 ==
 
>>> 
>>> 
>>> 
>>> --
>>> ==
>>> 联创科技
>>> 知行如一
>>> ==
>> 
>> 
> 
> 
> -- 
> ==
> 联创科技
> 知行如一
> ==

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-27 Thread 苗海泉

Thanks  for you reply again.
I just said that you may have some misunderstanding, we have 49 solr nodes,
each collection has 25 shards, each shard has only one replica of the data,
there is no copy, and I reduce the part of the cache. If you need the
metric data, I can check Come out to tell you, in addition we are only
additional system, there will not be any change action.

2018-02-27 20:05 GMT+08:00 Emir Arnautović :

> Hi,
> It is hard to tell without looking more into your metrics. It seems to me
> that you are reaching limits of your cluster. I would doublecheck if memory
> is the issue. If I got it right, you have ~1120 shards per node. It takes
> some heap just to keep them open. If you have some caches enabled and if it
> is append only system, old shards will keep caches until reloaded.
> Probably will not make much diff, but with 25x2=50 shards and 49 nodes,
> one node will need to handle double indexing load.
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 27 Feb 2018, at 12:54, 苗海泉  wrote:
> >
> > In addition, we found that the rate was normal when the number of
> > collections was kept below 936 and the speed was slower and slower at
> 984.
> > Therefore, we could only temporarily delete the older collection, but now
> > we need more Online collection, there has been no good way to confuse us
> > for a long time, very much hope to give a solution to the problem of
> ideas,
> > greatly appreciated
> >
> > 2018-02-27 19:46 GMT+08:00 苗海泉 :
> >
> >> Thank you for reply.
> >> One collection has 25 shard one replica, one solr node has about 5T on
> >> desk.
> >> GC is checked ,and modify as follow :
> >> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
> >> GC_TUNE=" \
> >> -XX:+UseG1GC \
> >> -XX:+PerfDisableSharedMem \
> >> -XX:+ParallelRefProcEnabled \
> >> -XX:G1HeapRegionSize=8m \
> >> -XX:MaxGCPauseMillis=250 \
> >> -XX:InitiatingHeapOccupancyPercent=75 \
> >> -XX:+UseLargePages \
> >> -XX:+AggressiveOpts \
> >> -XX:+UseLargePages"
> >>
> >> 2018-02-27 19:27 GMT+08:00 Emir Arnautović <
> emir.arnauto...@sematext.com>:
> >>
> >>> Hi,
> >>> To get more complete picture, can you tell us how many shards/replicas
> do
> >>> you have per collection? Also what is index size on disk? Did you
> check GC?
> >>>
> >>> BTW, using 32GB heap prevents you from using compressed oops, resulting
> >>> in less memory available than 31GB.
> >>>
> >>> Thanks,
> >>> Emir
> >>> --
> >>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>
> >>>
> >>>
>  On 27 Feb 2018, at 11:36, 苗海泉  wrote:
> 
>  I encountered a more serious problem in the process of using solr. We
> >>> use
>  the solr version is 6.0, our daily amount of data is about 500 billion
>  documents, create a collection every hour, the online collection of
> more
>  than a thousand, 49 solr nodes. If the collection in less than 800,
> the
>  speed is still very fast, if the collection of the number of 1100 or
> so,
>  the construction of solr index will drop sharply, one of the original
>  program speed of about 2-3 million TPS, Dropped to only a few hundred
> or
>  even tens of TPS, who have encountered a similar situation, there is
> no
>  good idea to find this issue. By the way, solr a node memory we
> assigned
>  32G,We checked the memory, cpu, disk IO, network IO occupancy is no
>  problem, belong to the normal state. Which friend encountered a
> similar
>  problem, please inform the solution, thank you very much.
> >>>
> >>>
> >>
> >>
> >> --
> >> ==
> >> 联创科技
> >> 知行如一
> >> ==
> >>
> >
> >
> >
> > --
> > ==
> > 联创科技
> > 知行如一
> > ==
>
>


-- 
==
联创科技
知行如一
==

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-27 Thread Emir Arnautović

Hi,
It is hard to tell without looking more into your metrics. It seems to me that 
you are reaching limits of your cluster. I would doublecheck if memory is the 
issue. If I got it right, you have ~1120 shards per node. It takes some heap 
just to keep them open. If you have some caches enabled and if it is append 
only system, old shards will keep caches until reloaded.
Probably will not make much diff, but with 25x2=50 shards and 49 nodes, one 
node will need to handle double indexing load.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Feb 2018, at 12:54, 苗海泉  wrote:
> 
> In addition, we found that the rate was normal when the number of
> collections was kept below 936 and the speed was slower and slower at 984.
> Therefore, we could only temporarily delete the older collection, but now
> we need more Online collection, there has been no good way to confuse us
> for a long time, very much hope to give a solution to the problem of ideas,
> greatly appreciated
> 
> 2018-02-27 19:46 GMT+08:00 苗海泉 :
> 
>> Thank you for reply.
>> One collection has 25 shard one replica, one solr node has about 5T on
>> desk.
>> GC is checked ,and modify as follow :
>> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
>> GC_TUNE=" \
>> -XX:+UseG1GC \
>> -XX:+PerfDisableSharedMem \
>> -XX:+ParallelRefProcEnabled \
>> -XX:G1HeapRegionSize=8m \
>> -XX:MaxGCPauseMillis=250 \
>> -XX:InitiatingHeapOccupancyPercent=75 \
>> -XX:+UseLargePages \
>> -XX:+AggressiveOpts \
>> -XX:+UseLargePages"
>> 
>> 2018-02-27 19:27 GMT+08:00 Emir Arnautović :
>> 
>>> Hi,
>>> To get more complete picture, can you tell us how many shards/replicas do
>>> you have per collection? Also what is index size on disk? Did you check GC?
>>> 
>>> BTW, using 32GB heap prevents you from using compressed oops, resulting
>>> in less memory available than 31GB.
>>> 
>>> Thanks,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>> 
>>> 
>>> 
 On 27 Feb 2018, at 11:36, 苗海泉  wrote:
 
 I encountered a more serious problem in the process of using solr. We
>>> use
 the solr version is 6.0, our daily amount of data is about 500 billion
 documents, create a collection every hour, the online collection of more
 than a thousand, 49 solr nodes. If the collection in less than 800, the
 speed is still very fast, if the collection of the number of 1100 or so,
 the construction of solr index will drop sharply, one of the original
 program speed of about 2-3 million TPS, Dropped to only a few hundred or
 even tens of TPS, who have encountered a similar situation, there is no
 good idea to find this issue. By the way, solr a node memory we assigned
 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
 problem, belong to the normal state. Which friend encountered a similar
 problem, please inform the solution, thank you very much.
>>> 
>>> 
>> 
>> 
>> --
>> ==
>> 联创科技
>> 知行如一
>> ==
>> 
> 
> 
> 
> -- 
> ==
> 联创科技
> 知行如一
> ==

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-27 Thread 苗海泉

In addition, we found that the rate was normal when the number of
collections was kept below 936 and the speed was slower and slower at 984.
Therefore, we could only temporarily delete the older collection, but now
we need more Online collection, there has been no good way to confuse us
for a long time, very much hope to give a solution to the problem of ideas,
greatly appreciated

2018-02-27 19:46 GMT+08:00 苗海泉 :

> Thank you for reply.
> One collection has 25 shard one replica, one solr node has about 5T on
> desk.
> GC is checked ,and modify as follow :
> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
> GC_TUNE=" \
> -XX:+UseG1GC \
> -XX:+PerfDisableSharedMem \
> -XX:+ParallelRefProcEnabled \
> -XX:G1HeapRegionSize=8m \
> -XX:MaxGCPauseMillis=250 \
> -XX:InitiatingHeapOccupancyPercent=75 \
> -XX:+UseLargePages \
> -XX:+AggressiveOpts \
> -XX:+UseLargePages"
>
> 2018-02-27 19:27 GMT+08:00 Emir Arnautović :
>
>> Hi,
>> To get more complete picture, can you tell us how many shards/replicas do
>> you have per collection? Also what is index size on disk? Did you check GC?
>>
>> BTW, using 32GB heap prevents you from using compressed oops, resulting
>> in less memory available than 31GB.
>>
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 27 Feb 2018, at 11:36, 苗海泉  wrote:
>> >
>> > I encountered a more serious problem in the process of using solr. We
>> use
>> > the solr version is 6.0, our daily amount of data is about 500 billion
>> > documents, create a collection every hour, the online collection of more
>> > than a thousand, 49 solr nodes. If the collection in less than 800, the
>> > speed is still very fast, if the collection of the number of 1100 or so,
>> > the construction of solr index will drop sharply, one of the original
>> > program speed of about 2-3 million TPS, Dropped to only a few hundred or
>> > even tens of TPS, who have encountered a similar situation, there is no
>> > good idea to find this issue. By the way, solr a node memory we assigned
>> > 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
>> > problem, belong to the normal state. Which friend encountered a similar
>> > problem, please inform the solution, thank you very much.
>>
>>
>
>
> --
> ==
> 联创科技
> 知行如一
> ==
>



-- 
==
联创科技
知行如一
==

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-27 Thread 苗海泉

Thank you for reply.
One collection has 25 shard one replica, one solr node has about 5T on desk.
GC is checked ,and modify as follow :
SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
GC_TUNE=" \
-XX:+UseG1GC \
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=250 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:+UseLargePages \
-XX:+AggressiveOpts \
-XX:+UseLargePages"

2018-02-27 19:27 GMT+08:00 Emir Arnautović :

> Hi,
> To get more complete picture, can you tell us how many shards/replicas do
> you have per collection? Also what is index size on disk? Did you check GC?
>
> BTW, using 32GB heap prevents you from using compressed oops, resulting in
> less memory available than 31GB.
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 27 Feb 2018, at 11:36, 苗海泉  wrote:
> >
> > I encountered a more serious problem in the process of using solr. We use
> > the solr version is 6.0, our daily amount of data is about 500 billion
> > documents, create a collection every hour, the online collection of more
> > than a thousand, 49 solr nodes. If the collection in less than 800, the
> > speed is still very fast, if the collection of the number of 1100 or so,
> > the construction of solr index will drop sharply, one of the original
> > program speed of about 2-3 million TPS, Dropped to only a few hundred or
> > even tens of TPS, who have encountered a similar situation, there is no
> > good idea to find this issue. By the way, solr a node memory we assigned
> > 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
> > problem, belong to the normal state. Which friend encountered a similar
> > problem, please inform the solution, thank you very much.
>
>


-- 
==
联创科技
知行如一
==

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-27 Thread Emir Arnautović

Hi,
To get more complete picture, can you tell us how many shards/replicas do you 
have per collection? Also what is index size on disk? Did you check GC?

BTW, using 32GB heap prevents you from using compressed oops, resulting in less 
memory available than 31GB.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Feb 2018, at 11:36, 苗海泉  wrote:
> 
> I encountered a more serious problem in the process of using solr. We use
> the solr version is 6.0, our daily amount of data is about 500 billion
> documents, create a collection every hour, the online collection of more
> than a thousand, 49 solr nodes. If the collection in less than 800, the
> speed is still very fast, if the collection of the number of 1100 or so,
> the construction of solr index will drop sharply, one of the original
> program speed of about 2-3 million TPS, Dropped to only a few hundred or
> even tens of TPS, who have encountered a similar situation, there is no
> good idea to find this issue. By the way, solr a node memory we assigned
> 32G,We checked the memory, cpu, disk IO, network IO occupancy is no
> problem, belong to the normal state. Which friend encountered a similar
> problem, please inform the solution, thank you very much.

When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

2018-02-27 Thread 苗海泉

I encountered a more serious problem in the process of using solr. We use
the solr version is 6.0, our daily amount of data is about 500 billion
documents, create a collection every hour, the online collection of more
than a thousand, 49 solr nodes. If the collection in less than 800, the
speed is still very fast, if the collection of the number of 1100 or so,
the construction of solr index will drop sharply, one of the original
program speed of about 2-3 million TPS, Dropped to only a few hundred or
even tens of TPS, who have encountered a similar situation, there is no
good idea to find this issue. By the way, solr a node memory we assigned
32G,We checked the memory, cpu, disk IO, network IO occupancy is no
problem, belong to the normal state. Which friend encountered a similar
problem, please inform the solution, thank you very much.

Re: Slow indexing speed when collection size is large

2017-05-07 Thread Zheng Lin Edwin Yeo

Hi Shawn,

Are the two types of indexing (ERH with OCR, and indexing from a DB)
happening on the same Solr server?
A) Yes, they are happening on the same Solr server, but currently, only the
indexing from a DB is running.

Is Solr in a virtual machine?
A) No

Is the 384GB at the hypervisor level, or the virtual machine level?
A) The hypervisor level. The virtual machine for the Sybase is allocated
64GB of memory.

Is the 22GB heap the total heap memory, or is that per Solr instance?
A) Per Solr instance.

It's only the Sybase database that is running on a virtual machine under
Hyper-V. Solr is running on the main server.
The main server is running on Windows 2012, while the virtual machine is
running on SUSE Linux 9. Both Solr instances are running on SSD drive,
while the virtual machine is running on normal hard disk.

What is the best suggestion for the 5TB of indexes The searching speed is
quite fast currently, even during indexing. It is the indexing speed that
is slow.

Regards,
Edwin

On 7 May 2017 at 21:14, Shawn Heisey <apa...@elyograg.org> wrote:

> On 5/6/2017 6:49 PM, Zheng Lin Edwin Yeo wrote:
> > For my rich documentation handling, I'm using Extracting Request
> Handler, and it requires OCR.
> >
> > However, currently, for the slow indexing speed which I'm experiencing,
> the indexing is done directly from the Sybase database. I will fetch about
> 1000 records at a time from Sybase, and stored in into a CacheRowSet for it
> to be indexed. The query to the Sybase database is quite fast, and most of
> the time is spend on processes in the CacheRowSet.
> 
> > A) 384 GB
> 
> > A) 22 GB
> 
> > A) 5 TB
> 
> > A) A virtual machine with Sybase database is running on the server
>
> The discussion about the drawbacks of the Extracting Request Handler has
> already taken place.  Tika should be running on separate hardware, not
> embedded in Solr.  Having high-impact Tika processing run on the Solr
> server is going to slow everything down.
>
> Are the two types of indexing (ERH with OCR, and indexing from a DB)
> happening on the same Solr server?
>
> As soon as you mention virtual machines, my mental picture of the setup
> becomes much less clear.  You'll need to fully describe the OS and
> hardware setup, at both the hypervisor and virtual machine level.  Then
> I will know what questions to ask for more detailed information.
>
> Is Solr in a virtual machine?
> Is the 384GB at the hypervisor level, or the virtual machine level?
> Is the 22GB heap the total heap memory, or is that per Solr instance?
>
> If the 5TB is Solr index data, then there's no way you're going to get
> fast performance.  Putting enough memory in one machine to effectively
> cache that much data is impractically expensive, and most server
> hardware doesn't have enough memory slots even if you do have the
> money.  384GB wouldn't be enough for 5TB of index, and that's not even
> taking into account the memory needed by your software, including Solr
> and Sybase.
>
> Thanks,
> Shawn
>
>

Re: Slow indexing speed when collection size is large

2017-05-07 Thread Shawn Heisey

On 5/6/2017 6:49 PM, Zheng Lin Edwin Yeo wrote:
> For my rich documentation handling, I'm using Extracting Request Handler, and 
> it requires OCR.
>
> However, currently, for the slow indexing speed which I'm experiencing, the 
> indexing is done directly from the Sybase database. I will fetch about 1000 
> records at a time from Sybase, and stored in into a CacheRowSet for it to be 
> indexed. The query to the Sybase database is quite fast, and most of the time 
> is spend on processes in the CacheRowSet.

> A) 384 GB

> A) 22 GB

> A) 5 TB

> A) A virtual machine with Sybase database is running on the server

The discussion about the drawbacks of the Extracting Request Handler has
already taken place.  Tika should be running on separate hardware, not
embedded in Solr.  Having high-impact Tika processing run on the Solr
server is going to slow everything down.

Are the two types of indexing (ERH with OCR, and indexing from a DB)
happening on the same Solr server?

As soon as you mention virtual machines, my mental picture of the setup
becomes much less clear.  You'll need to fully describe the OS and
hardware setup, at both the hypervisor and virtual machine level.  Then
I will know what questions to ask for more detailed information.

Is Solr in a virtual machine?
Is the 384GB at the hypervisor level, or the virtual machine level?
Is the 22GB heap the total heap memory, or is that per Solr instance?

If the 5TB is Solr index data, then there's no way you're going to get
fast performance.  Putting enough memory in one machine to effectively
cache that much data is impractically expensive, and most server
hardware doesn't have enough memory slots even if you do have the
money.  384GB wouldn't be enough for 5TB of index, and that's not even
taking into account the memory needed by your software, including Solr
and Sybase.

Thanks,
Shawn

Re: Slow indexing speed when collection size is large

2017-05-06 Thread Zheng Lin Edwin Yeo

Hi Shawn,

For my rich documentation handling, I'm using Extracting Request Handler,
and it requires OCR.

However, currently, for the slow indexing speed which I'm experiencing, the
indexing is done directly from the Sybase database. I will fetch about 1000
records at a time from Sybase, and stored in into a CacheRowSet for it to
be indexed. The query to the Sybase database is quite fast, and most of the
time is spend on processes in the CacheRowSet.

Here are the answers to the other questions:

On a single Solr server, how much total memory is installed?
A) 384 GB

What is the total amount of memory reserved for Solr heaps on that server?
A) 22 GB

What is the total on-disk size of all the Solr indexes on that server?
A) 5 TB

-- Multiple replicas must be included if they are present on one machine.
>From the core (shard replica) perspective, how many documents are on
that server?
A) About 200 million documents for both replica. Each replica is about 100
million. Currently, both replicas are in the same server, but different
disk.

-- Multiple replicas must be included here too.
Is there software other than the Solr server process(es) running on that
server?
A) A virtual machine with Sybase database is running on the server

Are you making queries at the same time you're indexing?
A) Only occasionally. Most of the time, there is no queries made.

Regards,
Edwin

On 6 May 2017 at 20:41, Shawn Heisey <apa...@elyograg.org> wrote:

> On 5/1/2017 10:17 AM, Zheng Lin Edwin Yeo wrote:
> > I'm using Solrj for the indexing, not using curl. Normally I bundle
> > about 1000 documents for each POST. There's more than 300GB of RAM for
> > that server, and I do not use any sharing at the moment.
>
> Looking over your email history on the list, I was able to determine
> some information, but not everything I was wondering about.  I have some
> questions.
>
> Are you still using the Extracting Request Handler for your rich
> document handling, or have you moved Tika processing outside Solr?
> If it's outside Solr, is it on different machines?
> Are your rich documents still requiring OCR?
>
> Other questions:
>
> On a single Solr server, how much total memory is installed?
> What is the total amount of memory reserved for Solr heaps on that server?
> What is the total on-disk size of all the Solr indexes on that server?
> -- Multiple replicas must be included if they are present on one machine.
> From the core (shard replica) perspective, how many documents are on
> that server?
> -- Multiple replicas must be included here too.
> Is there software other than the Solr server process(es) running on that
> server?
> Are you making queries at the same time you're indexing?
>
> Thanks,
> Shawn
>
>

Re: Slow indexing speed when collection size is large

2017-05-06 Thread Shawn Heisey

On 5/1/2017 10:17 AM, Zheng Lin Edwin Yeo wrote:
> I'm using Solrj for the indexing, not using curl. Normally I bundle
> about 1000 documents for each POST. There's more than 300GB of RAM for
> that server, and I do not use any sharing at the moment.

Looking over your email history on the list, I was able to determine
some information, but not everything I was wondering about.  I have some
questions.

Are you still using the Extracting Request Handler for your rich
document handling, or have you moved Tika processing outside Solr?
If it's outside Solr, is it on different machines?
Are your rich documents still requiring OCR?

Other questions:

On a single Solr server, how much total memory is installed?
What is the total amount of memory reserved for Solr heaps on that server?
What is the total on-disk size of all the Solr indexes on that server?
-- Multiple replicas must be included if they are present on one machine.
>From the core (shard replica) perspective, how many documents are on
that server?
-- Multiple replicas must be included here too.
Is there software other than the Solr server process(es) running on that
server?
Are you making queries at the same time you're indexing?

Thanks,
Shawn

Re: Slow indexing speed when collection size is large

2017-05-01 Thread Zheng Lin Edwin Yeo

Hi Rick,

I'm using Solrj for the indexing, not using curl.
Normally I bundle about 1000 documents for each POST.
There's more than 300GB of RAM for that server, and I do not use any
sharing at the moment.

Regards,
Edwin


On 1 May 2017 at 19:08, Rick Leir <rl...@leirtech.com> wrote:

> Zheng,
> Are you POSTing using curl? Get several processes working in parallel to
> get a small boost. Solrj should speed you up a bit too (numbers anyone?).
> How many documents do you bundle in a POST?
>
> Do you have lots of RAM? Sharding?
> Cheers -- Rick
>
> On April 30, 2017 10:39:29 PM EDT, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com> wrote:
> >Hi,
> >
> >I'm using Solr 6.4.2.
> >
> >Would like to check, if there are alot of collections in my Solr which
> >has
> >very large index size, will the indexing speed be affected?
> >
> >Currently, I have created a new collections in Solr which has several
> >collections with very large index size, and the indexing speed is much
> >slower than expected.
> >
> >Regards,
> >Edwin
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com

Re: Slow indexing speed when collection size is large

2017-05-01 Thread Rick Leir

Zheng,
Are you POSTing using curl? Get several processes working in parallel to get a 
small boost. Solrj should speed you up a bit too (numbers anyone?). How many 
documents do you bundle in a POST? 

Do you have lots of RAM? Sharding?
Cheers -- Rick

On April 30, 2017 10:39:29 PM EDT, Zheng Lin Edwin Yeo <edwinye...@gmail.com> 
wrote:
>Hi,
>
>I'm using Solr 6.4.2.
>
>Would like to check, if there are alot of collections in my Solr which
>has
>very large index size, will the indexing speed be affected?
>
>Currently, I have created a new collections in Solr which has several
>collections with very large index size, and the indexing speed is much
>slower than expected.
>
>Regards,
>Edwin

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com

Slow indexing speed when collection size is large

2017-04-30 Thread Zheng Lin Edwin Yeo

Hi,

I'm using Solr 6.4.2.

Would like to check, if there are alot of collections in my Solr which has
very large index size, will the indexing speed be affected?

Currently, I have created a new collections in Solr which has several
collections with very large index size, and the indexing speed is much
slower than expected.

Regards,
Edwin

Re: Indexing speed reduced significantly with OCR

2017-03-31 Thread Zheng Lin Edwin Yeo

This is my comparison of the indexing speed with and without Tesseract OCR.
The smaller file is taking longer to index, probably because there are more
text to do the OCR, as compared to the bigger file, which has lesser text.
Is that usually the case?

*With Tesseract OCR*

174KB - 5.20 sec

446KB - 2.45 sec



*Without Tesseract OCR*

174KB - 0.77 sec

446KB - 0.23 sec


Regards,

Edwin

On 31 March 2017 at 03:57, Phil Scadden <p.scad...@gns.cri.nz> wrote:

> Yes, that would seem an accurate assessment of the problem.
>
> -Original Message-
> From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
> Sent: Thursday, 30 March 2017 4:53 p.m.
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing speed reduced significantly with OCR
>
> Thanks for your reply.
>
> From what I see, getting more hardware to do the OCR is inevitable?
>
> Even if we run the OCR outside of Solr indexing stream, it will still take
> a long time to process it if it is on just one machine. And we still need
> to wait for the OCR to finish converting before we can run the indexing to
> Solr.
>
> Regards,
> Edwin
> Notice: This email and any attachments are confidential and may not be
> used, published or redistributed without the prior written consent of the
> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
> received in error please destroy and immediately notify GNS Science. Do not
> copy or disclose the contents.
>

RE: Indexing speed reduced significantly with OCR

2017-03-30 Thread Phil Scadden

Yes, that would seem an accurate assessment of the problem.

-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
Sent: Thursday, 30 March 2017 4:53 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Indexing speed reduced significantly with OCR

Thanks for your reply.

From what I see, getting more hardware to do the OCR is inevitable?

Even if we run the OCR outside of Solr indexing stream, it will still take a 
long time to process it if it is on just one machine. And we still need to wait 
for the OCR to finish converting before we can run the indexing to Solr.

Regards,
Edwin
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Re: Indexing speed reduced significantly with OCR

2017-03-30 Thread Walter Underwood

As I said before, this is a great application for pay-as-needed cloud servers.

Netflix’s first use of Amazon EC2 was encoding movies for different screen 
sizes, data rates, codecs, and DRM. They would fire up a hundred or a thousand 
instances, feed movies to them, pick up the encodes, then release the 
instances. 

Ten years later, Amazon offers a service to do that (Elastic Transcoder): 
https://aws.amazon.com/elastictranscoder/ 
<https://aws.amazon.com/elastictranscoder/>

Here is an example of configuring OCR using Amazon Lambda, which is how I would 
do it, both for OCR and PDF.

http://stackoverflow.com/questions/33588262/tesseract-ocr-on-aws-lambda-via-virtualenv/35724894#35724894
 
<http://stackoverflow.com/questions/33588262/tesseract-ocr-on-aws-lambda-via-virtualenv/35724894#35724894>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 30, 2017, at 5:50 AM, Allison, Timothy B. <talli...@mitre.org> wrote:
> 
>> Note that the OCRing is a separate task from Solr indexing, and is best done 
>> on separate machines.
> 
> +1
> 
> -Original Message-
> From: Rick Leir [mailto:rl...@leirtech.com] 
> Sent: Thursday, March 30, 2017 7:37 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing speed reduced significantly with OCR
> 
> The workflow is
> -/ OCR new documents
> -/ check quality and tune until you get good output text -/ keep the output 
> text in the file system
> 
> -/ index and re-index to Solr as necessary from the file system 
> 
> Note that the OCRing is a separate task from Solr indexing, and is best done 
> on separate machines. I used all the old 'surplus' servers for OCR.
> Cheers -- Rick
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.

RE: Indexing speed reduced significantly with OCR

2017-03-30 Thread Allison, Timothy B.

> Note that the OCRing is a separate task from Solr indexing, and is best done 
> on separate machines.

+1

-Original Message-
From: Rick Leir [mailto:rl...@leirtech.com] 
Sent: Thursday, March 30, 2017 7:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing speed reduced significantly with OCR

The workflow is
-/ OCR new documents
-/ check quality and tune until you get good output text -/ keep the output 
text in the file system

-/ index and re-index to Solr as necessary from the file system 

Note that the OCRing is a separate task from Solr indexing, and is best done on 
separate machines. I used all the old 'surplus' servers for OCR.
Cheers -- Rick
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Indexing speed reduced significantly with OCR

2017-03-30 Thread Rick Leir

The workflow is
-/ OCR new documents
-/ check quality and tune until you get good output text 
-/ keep the output text in the file system

-/ index and re-index to Solr as necessary from the file system 

Note that the OCRing is a separate task from Solr indexing, and is best done on 
separate machines. I used all the old 'surplus' servers for OCR.
Cheers -- Rick
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Indexing speed reduced significantly with OCR

2017-03-29 Thread Zheng Lin Edwin Yeo

Thanks for your reply.

>From what I see, getting more hardware to do the OCR is inevitable?

Even if we run the OCR outside of Solr indexing stream, it will still take
a long time to process it if it is on just one machine. And we still need
to wait for the OCR to finish converting before we can run the indexing to
Solr.

Regards,
Edwin


On 29 March 2017 at 04:40, Phil Scadden  wrote:

> Well I haven’t had to deal with a problem that size, but it seems to me
> that you have little alternative except through more computer hardware at
> it. For the job I did, I OCRed to convert PDF to searchable PDF outside the
> indexing workflow. I used pdftotext utility to extract text from pdf. If
> text extracted was <1% document size, then I assumed it needed to be OCRed
> otherwise didn’t bother. You could look at a more sophisticated method to
> determine whether OCR was necessary. Doing it outside indexing stream means
> you can use different hardware for OCR. Converting to searchable PDF means
> you do it only once - a reindex doesn’t need to do OCR.
> Notice: This email and any attachments are confidential and may not be
> used, published or redistributed without the prior written consent of the
> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
> received in error please destroy and immediately notify GNS Science. Do not
> copy or disclose the contents.
>

RE: Indexing speed reduced significantly with OCR

2017-03-28 Thread Phil Scadden

Well I haven’t had to deal with a problem that size, but it seems to me that 
you have little alternative except through more computer hardware at it. For 
the job I did, I OCRed to convert PDF to searchable PDF outside the indexing 
workflow. I used pdftotext utility to extract text from pdf. If text extracted 
was <1% document size, then I assumed it needed to be OCRed otherwise didn’t 
bother. You could look at a more sophisticated method to determine whether OCR 
was necessary. Doing it outside indexing stream means you can use different 
hardware for OCR. Converting to searchable PDF means you do it only once - a 
reindex doesn’t need to do OCR.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Re: Indexing speed reduced significantly with OCR

2017-03-28 Thread Walter Underwood

Converting from PDF to text is embarrassingly parallel. You can throw as many 
machines at it as you want. This is a great time to use a cloud computing 
service. Need 1000 machines? No problem.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 28, 2017, at 2:52 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote:
> 
> Hi,
> 
> Do you have suggestions that we can do to cope with the expensive process
> of indexing documents which requires OCR.
> 
> For my current situation, the indexing takes about 2 weeks to complete. If
> the average indexing speed is say to be 50 times slower, it means it will
> require 100 weeks to index the same amount of documents, which is not
> viable. I have several terabytes of PDF documents to index for the actual
> data, and many of them are scanned image, which requires OCR.
> 
> Regards,
> Edwin
> 
> 
> On 28 March 2017 at 13:20, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote:
> 
>> Yes, the sample document sizes are not very big. And also, the sample
>> documents have a mixture of documents that consists of inline images, and
>> also documents which are searchable (text extractable without OCR)
>> 
>> I suppose only those documents which requires OCR will slow down the
>> indexing? Which is why the total average is only slowing down by 10 times.
>> 
>> Regards,
>> Edwin
>> 
>> 
>> On 28 March 2017 at 12:06, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>> 
>>> Only by 10? You must have quite small documents. OCR is extremely
>>> expensive process. Indexing is trivial by comparison. For quite large
>>> documents I am working with OCR can be 100 times slower than indexing a PDF
>>> that is searchable (text extractable without OCR).
>>> 
>>> -Original Message-
>>> From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
>>> Sent: Tuesday, 28 March 2017 4:13 p.m.
>>> To: solr-user@lucene.apache.org
>>> Subject: Indexing speed reduced significantly with OCR
>>> 
>>> Hi,
>>> 
>>> Does the indexing speed of Solr reduced significantly when we are using
>>> Tesseract OCR to extract scanned inline images from PDF?
>>> 
>>> I found that after I implement the solution to extract those scanned
>>> images from PDF, the indexing speed is now slower by almost more than 10
>>> times.
>>> 
>>> I'm using Solr 6.4.2, and Tika App 1.1.4.
>>> 
>>> Regards,
>>> Edwin
>>> Notice: This email and any attachments are confidential and may not be
>>> used, published or redistributed without the prior written consent of the
>>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
>>> received in error please destroy and immediately notify GNS Science. Do not
>>> copy or disclose the contents.
>>> 
>> 
>>

Re: Indexing speed reduced significantly with OCR

2017-03-28 Thread Zheng Lin Edwin Yeo

Hi,

Do you have suggestions that we can do to cope with the expensive process
of indexing documents which requires OCR.

For my current situation, the indexing takes about 2 weeks to complete. If
the average indexing speed is say to be 50 times slower, it means it will
require 100 weeks to index the same amount of documents, which is not
viable. I have several terabytes of PDF documents to index for the actual
data, and many of them are scanned image, which requires OCR.

Regards,
Edwin


On 28 March 2017 at 13:20, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote:

> Yes, the sample document sizes are not very big. And also, the sample
> documents have a mixture of documents that consists of inline images, and
> also documents which are searchable (text extractable without OCR)
>
> I suppose only those documents which requires OCR will slow down the
> indexing? Which is why the total average is only slowing down by 10 times.
>
> Regards,
> Edwin
>
>
> On 28 March 2017 at 12:06, Phil Scadden <p.scad...@gns.cri.nz> wrote:
>
>> Only by 10? You must have quite small documents. OCR is extremely
>> expensive process. Indexing is trivial by comparison. For quite large
>> documents I am working with OCR can be 100 times slower than indexing a PDF
>> that is searchable (text extractable without OCR).
>>
>> -Original Message-
>> From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
>> Sent: Tuesday, 28 March 2017 4:13 p.m.
>> To: solr-user@lucene.apache.org
>> Subject: Indexing speed reduced significantly with OCR
>>
>> Hi,
>>
>> Does the indexing speed of Solr reduced significantly when we are using
>> Tesseract OCR to extract scanned inline images from PDF?
>>
>> I found that after I implement the solution to extract those scanned
>> images from PDF, the indexing speed is now slower by almost more than 10
>> times.
>>
>> I'm using Solr 6.4.2, and Tika App 1.1.4.
>>
>> Regards,
>> Edwin
>> Notice: This email and any attachments are confidential and may not be
>> used, published or redistributed without the prior written consent of the
>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
>> received in error please destroy and immediately notify GNS Science. Do not
>> copy or disclose the contents.
>>
>
>

Re: Indexing speed reduced significantly with OCR

2017-03-27 Thread Zheng Lin Edwin Yeo

Yes, the sample document sizes are not very big. And also, the sample
documents have a mixture of documents that consists of inline images, and
also documents which are searchable (text extractable without OCR)

I suppose only those documents which requires OCR will slow down the
indexing? Which is why the total average is only slowing down by 10 times.

Regards,
Edwin


On 28 March 2017 at 12:06, Phil Scadden <p.scad...@gns.cri.nz> wrote:

> Only by 10? You must have quite small documents. OCR is extremely
> expensive process. Indexing is trivial by comparison. For quite large
> documents I am working with OCR can be 100 times slower than indexing a PDF
> that is searchable (text extractable without OCR).
>
> -Original Message-
> From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
> Sent: Tuesday, 28 March 2017 4:13 p.m.
> To: solr-user@lucene.apache.org
> Subject: Indexing speed reduced significantly with OCR
>
> Hi,
>
> Does the indexing speed of Solr reduced significantly when we are using
> Tesseract OCR to extract scanned inline images from PDF?
>
> I found that after I implement the solution to extract those scanned
> images from PDF, the indexing speed is now slower by almost more than 10
> times.
>
> I'm using Solr 6.4.2, and Tika App 1.1.4.
>
> Regards,
> Edwin
> Notice: This email and any attachments are confidential and may not be
> used, published or redistributed without the prior written consent of the
> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
> received in error please destroy and immediately notify GNS Science. Do not
> copy or disclose the contents.
>

RE: Indexing speed reduced significantly with OCR

2017-03-27 Thread Phil Scadden

Only by 10? You must have quite small documents. OCR is extremely expensive 
process. Indexing is trivial by comparison. For quite large documents I am 
working with OCR can be 100 times slower than indexing a PDF that is searchable 
(text extractable without OCR).

-Original Message-
From: Zheng Lin Edwin Yeo [mailto:edwinye...@gmail.com]
Sent: Tuesday, 28 March 2017 4:13 p.m.
To: solr-user@lucene.apache.org
Subject: Indexing speed reduced significantly with OCR

Hi,

Does the indexing speed of Solr reduced significantly when we are using 
Tesseract OCR to extract scanned inline images from PDF?

I found that after I implement the solution to extract those scanned images 
from PDF, the indexing speed is now slower by almost more than 10 times.

I'm using Solr 6.4.2, and Tika App 1.1.4.

Regards,
Edwin
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Indexing speed reduced significantly with OCR

2017-03-27 Thread Zheng Lin Edwin Yeo

Hi,

Does the indexing speed of Solr reduced significantly when we are using
Tesseract OCR to extract scanned inline images from PDF?

I found that after I implement the solution to extract those scanned images
from PDF, the indexing speed is now slower by almost more than 10 times.

I'm using Solr 6.4.2, and Tika App 1.1.4.

Regards,
Edwin

Re: Slow indexing speed when index size is large?

2016-10-16 Thread Zheng Lin Edwin Yeo

Hi Shawn,

Thanks for the information.

Regards,
Edwin


On 14 October 2016 at 20:19, Shawn Heisey  wrote:

> On 10/13/2016 9:58 PM, Zheng Lin Edwin Yeo wrote:
> > Thanks for the reply Shawn. Currently, my heap allocation to each Solr
> > instance is 22GB. Is that big enough?
>
> I can't answer that question.  I know little about your install.  Even
> if I *did* know a few more things about your install, I could only make
> a *guess* about how much heap you need, and I'd probably be wrong.
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
> the-abstract-why-we-dont-have-a-definitive-answer/
>
> I did write down what I consider to be a good way to figure out a
> correct heap size, but it requires experimentation with your live
> system, which might cause disruption of your search service:
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#How_
> much_heap_space_do_I_need.3F
>
> Thanks,
> Shawn
>
>

Re: Slow indexing speed when index size is large?

2016-10-14 Thread Shawn Heisey

On 10/13/2016 9:58 PM, Zheng Lin Edwin Yeo wrote:
> Thanks for the reply Shawn. Currently, my heap allocation to each Solr
> instance is 22GB. Is that big enough? 

I can't answer that question.  I know little about your install.  Even
if I *did* know a few more things about your install, I could only make
a *guess* about how much heap you need, and I'd probably be wrong.

https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

I did write down what I consider to be a good way to figure out a
correct heap size, but it requires experimentation with your live
system, which might cause disruption of your search service:

https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F

Thanks,
Shawn

Re: Slow indexing speed when index size is large?

2016-10-13 Thread Zheng Lin Edwin Yeo

Thanks for the reply Shawn.

Currently, my heap allocation to each Solr instance is 22GB.
Is that big enough?

Regards,
Edwin


On 13 October 2016 at 23:56, Shawn Heisey <apa...@elyograg.org> wrote:

> On 10/13/2016 9:20 AM, Zheng Lin Edwin Yeo wrote:
> > Would like to find out, will the indexing speed in a collection with a
> > very large index size be much slower than one which is still empty or
> > a very small index size? This is assuming that the configurations,
> > indexing code and the files to be indexed are the same. Currently, I
> > have a setup in which the collection is still empty, and I managed to
> > achieve an indexing speed of more than 7GB/hr. I also have another
> > setup in which the collection has an index size of 1.6TB, and when I
> > tried to index new documents to it, the indexing speed is less than
> > 0.7GB/hr.
>
> I have noticed this phenomenon myself.  As the amount of index data
> already present increases, indexing slows down.  Best guess as to the
> cause: more frequent and longer-lasting garbage collections.
>
> Indexing involves a LOT of memory allocation.  Most of the memory chunks
> that get allocated are quickly discarded because they do not need to be
> retained.
>
> If you understand how the Java memory model works, then you know that
> this means there will be a lot of garbage collection.  Each GC will tend
> to take longer if there are a large number of objects allocated that are
> NOT garbage.
>
> When the index is large, Lucene/Solr must allocate and retain a larger
> amount of memory just to ensure that everything works properly.  This
> leaves less free memory, so indexing will cause more frequent garbage
> collections ... and because the amount of retained memory is
> correspondingly larger, each garbage collection will take longer than it
> would with a smaller index.  A ten to one difference in speed does seem
> extreme, though.
>
> You might want to increase the heap allocated to each Solr instance, so
> GC is less frequent.  This can take memory away from the OS disk cache,
> though.  If the amount of OS disk cache drops too low, general
> performance may suffer.
>
> Thanks,
> Shawn
>
>

Re: Slow indexing speed when index size is large?

2016-10-13 Thread Shawn Heisey

On 10/13/2016 9:20 AM, Zheng Lin Edwin Yeo wrote:
> Would like to find out, will the indexing speed in a collection with a
> very large index size be much slower than one which is still empty or
> a very small index size? This is assuming that the configurations,
> indexing code and the files to be indexed are the same. Currently, I
> have a setup in which the collection is still empty, and I managed to
> achieve an indexing speed of more than 7GB/hr. I also have another
> setup in which the collection has an index size of 1.6TB, and when I
> tried to index new documents to it, the indexing speed is less than
> 0.7GB/hr. 

I have noticed this phenomenon myself.  As the amount of index data
already present increases, indexing slows down.  Best guess as to the
cause: more frequent and longer-lasting garbage collections.

Indexing involves a LOT of memory allocation.  Most of the memory chunks
that get allocated are quickly discarded because they do not need to be
retained.

If you understand how the Java memory model works, then you know that
this means there will be a lot of garbage collection.  Each GC will tend
to take longer if there are a large number of objects allocated that are
NOT garbage.

When the index is large, Lucene/Solr must allocate and retain a larger
amount of memory just to ensure that everything works properly.  This
leaves less free memory, so indexing will cause more frequent garbage
collections ... and because the amount of retained memory is
correspondingly larger, each garbage collection will take longer than it
would with a smaller index.  A ten to one difference in speed does seem
extreme, though.

You might want to increase the heap allocated to each Solr instance, so
GC is less frequent.  This can take memory away from the OS disk cache,
though.  If the amount of OS disk cache drops too low, general
performance may suffer.

Thanks,
Shawn

Slow indexing speed when index size is large?

2016-10-13 Thread Zheng Lin Edwin Yeo

Hi,

Would like to find out, will the indexing speed in a collection with a very
large index size be much slower than one which is still empty or a very
small index size? This is assuming that the configurations, indexing code
and the files to be indexed are the same.

Currently, I have a setup in which the collection is still empty, and I
managed to achieve an indexing speed of more than 7GB/hr. I also have
another setup in which the collection has an index size of 1.6TB, and when
I tried to index new documents to it, the indexing speed is less than
0.7GB/hr.

This setup was done with Solr 5.4.0

Regards,
Edwin

Re: Does EML files with inline images affect the indexing speed

2016-05-03 Thread Zheng Lin Edwin Yeo

Yes should be, as it is the Tika extract handler that does the extracting
of the content for indexing.

Thank you.

Regards,
Edwin


On 3 May 2016 at 19:12, Alexandre Rafalovitch <arafa...@gmail.com> wrote:

> This is an extract handler, right?
>
> If so, this is a question better for the Apache Tina list. That's what
> doing the parsing.
>
> Regards,
> Alex
> On 3 May 2016 7:53 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com> wrote:
>
> > Hi,
> >
> > I would like to find out, if the presence of inline images in EML files
> > will slow down the indexing speed significantly?
> >
> > Even though the content of the EML files are in Plain Text instead of
> HTML.
> > but I still found that the indexing performance is not up to expectation
> > yet. Average speed which I'm getting are around 0.3GB/hr.
> >
> > I'm using Solr 5.4.0 on SolrCloud.
> >
> > Regards,
> > Edwin
> >
>

Re: Does EML files with inline images affect the indexing speed

2016-05-03 Thread Alexandre Rafalovitch

This is an extract handler, right?

If so, this is a question better for the Apache Tina list. That's what
doing the parsing.

Regards,
Alex
On 3 May 2016 7:53 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com> wrote:

> Hi,
>
> I would like to find out, if the presence of inline images in EML files
> will slow down the indexing speed significantly?
>
> Even though the content of the EML files are in Plain Text instead of HTML.
> but I still found that the indexing performance is not up to expectation
> yet. Average speed which I'm getting are around 0.3GB/hr.
>
> I'm using Solr 5.4.0 on SolrCloud.
>
> Regards,
> Edwin
>

Does EML files with inline images affect the indexing speed

2016-05-03 Thread Zheng Lin Edwin Yeo

Hi,

I would like to find out, if the presence of inline images in EML files
will slow down the indexing speed significantly?

Even though the content of the EML files are in Plain Text instead of HTML.
but I still found that the indexing performance is not up to expectation
yet. Average speed which I'm getting are around 0.3GB/hr.

I'm using Solr 5.4.0 on SolrCloud.

Regards,
Edwin

Re: Optimal indexing speed in Solr

2016-04-14 Thread John Bickerstaff

Stupid phone autocorrect...

If you add updated documents of the same ID over time, optimizing your
collection(s) may help.

On Thu, Apr 14, 2016 at 7:50 AM, John Bickerstaff <j...@johnbickerstaff.com>
wrote:

> If you delete a lot of documents over time, or if you add updated
> documents of the same I'd over time, optimizing your collection(s) may help.
> On Apr 14, 2016 3:52 AM, "Emir Arnautovic" <emir.arnauto...@sematext.com>
> wrote:
>
>> Hi Edwin,
>> Indexing speed depends on multiple factors: HW, Solr configurations and
>> load, documents, indexing client: More complex documents, more CPU time to
>> process each document before indexing structure is written down to disk.
>> Bigger the index, more heap is used, more frequent GCs. Maybe you are just
>> not sending enough doc to Solr to have such throughput.
>> The best way to pinpoint bottleneck is to use some monitoring tool. One
>> such tool is our SPM (http://sematext.com/spm) - it allows you to
>> monitor both Solr and OS metrics.
>>
>> HTH,
>> Emir
>>
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On 14.04.2016 05:29, Zheng Lin Edwin Yeo wrote:
>>
>>> Hi,
>>>
>>> Would like to find out, what is the optimal indexing speed in Solr?
>>>
>>> Previously, I managed to get more than 3GB/hour, but now the speed has
>>> drop
>>> to 0.7GB/hr. What could be the potential reason behind this?
>>>
>>> Besides the index size getting bigger, I have only added in more
>>> collections into the core and added another field. Other than that
>>> nothing
>>> else has been changed..
>>>
>>> Could the source file which I'm indexing made a difference in the
>>> indexing
>>> speed?
>>>
>>> I'm using Solr 5.4.0 for now, but will be planning to migrate to Solr
>>> 6.0.0.
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>

Re: Optimal indexing speed in Solr

2016-04-14 Thread John Bickerstaff

If you delete a lot of documents over time, or if you add updated documents
of the same I'd over time, optimizing your collection(s) may help.
On Apr 14, 2016 3:52 AM, "Emir Arnautovic" <emir.arnauto...@sematext.com>
wrote:

> Hi Edwin,
> Indexing speed depends on multiple factors: HW, Solr configurations and
> load, documents, indexing client: More complex documents, more CPU time to
> process each document before indexing structure is written down to disk.
> Bigger the index, more heap is used, more frequent GCs. Maybe you are just
> not sending enough doc to Solr to have such throughput.
> The best way to pinpoint bottleneck is to use some monitoring tool. One
> such tool is our SPM (http://sematext.com/spm) - it allows you to monitor
> both Solr and OS metrics.
>
> HTH,
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On 14.04.2016 05:29, Zheng Lin Edwin Yeo wrote:
>
>> Hi,
>>
>> Would like to find out, what is the optimal indexing speed in Solr?
>>
>> Previously, I managed to get more than 3GB/hour, but now the speed has
>> drop
>> to 0.7GB/hr. What could be the potential reason behind this?
>>
>> Besides the index size getting bigger, I have only added in more
>> collections into the core and added another field. Other than that nothing
>> else has been changed..
>>
>> Could the source file which I'm indexing made a difference in the indexing
>> speed?
>>
>> I'm using Solr 5.4.0 for now, but will be planning to migrate to Solr
>> 6.0.0.
>>
>> Regards,
>> Edwin
>>
>>
>

Re: Optimal indexing speed in Solr

2016-04-14 Thread Emir Arnautovic


Hi Edwin,
Indexing speed depends on multiple factors: HW, Solr configurations and 
load, documents, indexing client: More complex documents, more CPU time 
to process each document before indexing structure is written down to 
disk. Bigger the index, more heap is used, more frequent GCs. Maybe you 
are just not sending enough doc to Solr to have such throughput.
The best way to pinpoint bottleneck is to use some monitoring tool. One 
such tool is our SPM (http://sematext.com/spm) - it allows you to 
monitor both Solr and OS metrics.


HTH,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 14.04.2016 05:29, Zheng Lin Edwin Yeo wrote:

Hi,

Would like to find out, what is the optimal indexing speed in Solr?

Previously, I managed to get more than 3GB/hour, but now the speed has drop
to 0.7GB/hr. What could be the potential reason behind this?

Besides the index size getting bigger, I have only added in more
collections into the core and added another field. Other than that nothing
else has been changed..

Could the source file which I'm indexing made a difference in the indexing
speed?

I'm using Solr 5.4.0 for now, but will be planning to migrate to Solr 6.0.0.

Regards,
Edwin

Optimal indexing speed in Solr

2016-04-13 Thread Zheng Lin Edwin Yeo

Hi,

Would like to find out, what is the optimal indexing speed in Solr?

Previously, I managed to get more than 3GB/hour, but now the speed has drop
to 0.7GB/hr. What could be the potential reason behind this?

Besides the index size getting bigger, I have only added in more
collections into the core and added another field. Other than that nothing
else has been changed..

Could the source file which I'm indexing made a difference in the indexing
speed?

I'm using Solr 5.4.0 for now, but will be planning to migrate to Solr 6.0.0.

Regards,
Edwin

Re: Single-sharded SolrCloud vs Lucene indexing speed

2015-11-29 Thread Erick Erickson

Of course Lucene will be faster in all cases when replicas are
present. Solr is built on Lucene so any overhead at all that Solr adds
will cause the total round-trip to be slower.

Lucene doesn't have to concern itself with distributing updates to
replicas for instance as happens in your first two cases. The raw
overhead imposed by Solr is probably your third case.

Yes, slowest replica determines indexing speed. To guarantee data
isn't lost, the process is:
> leader receives updates.
> leader indexes locally _and_ forwards docs to follower
> follower acks back to leader when the docs are written to tlog (at least).
> leader acks back to client.

If it were otherwise, the follower couldn't guarantee that it had all
updates, so that's an early design decision.

If the slowest replica is the leader... Hmmm, forwarding updates to
the followers is done in parallel, but there is some additional work
done on the leader that the follower doesn't have to do, possibly this
is what you're seeing?

Solr will scale nearly linearly with additional shards. SolrJ
(assuming you're using CloudSolrClient) routes documents up-front so
you get a significant amount of parallelization. Of course this won't
be true if you only index one doc at a time single threaded

Best,
Erick

On Sat, Nov 28, 2015 at 10:58 AM, Zisis Tachtsidis <zist...@runbox.com> wrote:
> I'm conducting some indexing experiments in SolrCloud and I want to confirm
> my conclusions and ask for suggestions on how to improve performance.
>
> My setup includes a single-sharded collection with 1 additional replica in
> SolrCloud 5.3.1. I'm using SolrJ and the indexing speed refers to the actual
> SolrJ call that adds the document. I've run some indexing tests and it seems
> that Lucene indexing is equal to or better than Solr's in all cases. In all
> cases the same documents are sent to both Lucene and the same analysis
> is performed on the documents.
>
> - 2 replicas, leader is a replica on a machine under heavy load => ~3x
> slower than Lucene.
> - 2 replicas, leader is a replica on a machine under light load => ~2x
> slower than Lucene.
> - 1 replica on a machine under light load => indexing speed similar to
> Lucene.
>
> Conclusions
> (*) It seems that the slowest replica determines the indexing speed.
> (*) It gets even worse if the slowest replica is the leader. This is
> justified if it's true that only after the leader finishes indexing it
> forwards the request to the remaining replicas.
>
> Regarding improvements
> (*) I'm indexing pretty big documents 0.5MB<DocSize<1MB so batch updates do
> not offer significant performance gain.
> (*) Can I see improvement if I use a multi-sharded collection?
>
> Thanks
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Single-sharded-SolrCloud-vs-Lucene-indexing-speed-tp4242568.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Single-sharded SolrCloud vs Lucene indexing speed

2015-11-28 Thread Zisis Tachtsidis

I'm conducting some indexing experiments in SolrCloud and I want to confirm
my conclusions and ask for suggestions on how to improve performance.

My setup includes a single-sharded collection with 1 additional replica in
SolrCloud 5.3.1. I'm using SolrJ and the indexing speed refers to the actual
SolrJ call that adds the document. I've run some indexing tests and it seems
that Lucene indexing is equal to or better than Solr's in all cases. In all
cases the same documents are sent to both Lucene and the same analysis
is performed on the documents.

- 2 replicas, leader is a replica on a machine under heavy load => ~3x
slower than Lucene.
- 2 replicas, leader is a replica on a machine under light load => ~2x
slower than Lucene.
- 1 replica on a machine under light load => indexing speed similar to
Lucene.

Conclusions
(*) It seems that the slowest replica determines the indexing speed.
(*) It gets even worse if the slowest replica is the leader. This is
justified if it's true that only after the leader finishes indexing it
forwards the request to the remaining replicas.

Regarding improvements
(*) I'm indexing pretty big documents 0.5MB<DocSize<1MB so batch updates do
not offer significant performance gain.
(*) Can I see improvement if I use a multi-sharded collection?

Thanks

--
View this message in context:
http://lucene.472066.n3.nabble.com/Single-sharded-SolrCloud-vs-Lucene-indexing-speed-tp4242568.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Slow Indexing speed for csv files, multi-threaded indexing

2013-11-07 Thread Erick Erickson

Vikram:

An experiment I've found useful: Just comment out the
server.add() bit and run it. That won't index anything, but if
that's also slow then your problem is acquiring the data and
you know where to concentrate your efforts. I've seen this
be the problem with slow indexing more often than not actually.


Here's another thing to try: do it locally. Just spin up
a small Solr instance on your workstation and try your
test. My guess is you'll see vastly improved performance
in which case we're talking network latency here.

Alternatively, you can monitor your CPU utilization on
your ec2 instances and see if you're using it heavily. I
suspect you'll see you're not really exercising Solr, the
bottleneck is the network transmission or some such.

Your point 3 is a bit puzzling. CUSS threads and queue
size is really about network I/O. The idea here is that the
multiple threads are trying to simultaneously send packets
to Solr. Are you batching up documents you're sending
or sending them one at a time? I.e. use the server.add(doclist)
rather then the server.add(doc). What happens if you send, say
1,000 docs at a time?

Best,
Erick



On Tue, Nov 5, 2013 at 12:45 AM, Vikram Srinivasan 
vikram.sriniva...@zettata.com wrote:

 Hello,

   I know this has been discussed extensively in past posts. I have tried a
 bunch of suggestions and I still have a few questions.

  I am using solr4.4 from tomcat 7. I am using openjdk1.7 and I am using 1
 solr core
  I am trying to index a bunch of csv files (total size 13GB). Each csv file
 contains a long list of tuples - ( word1 word2, frequency) as shown below.
 (bigram frequencies)

 E.g: blue sky, 2500
green grass, 300

 My schema.xml is as  simple as can be: I am trying to index these two
 fields of type string and long and do not use any tokenizer or analyzer
 factories as shown below.


  fields
 field name=_version_ type=long indexed=true stored=true
 multiValued=false omitNorms=true /
 field name=word type=string indexed=true
 stored=true multiValued=false omitNorms=true /

   field name=frequency type=long indexed=true stored=true
 multiValued=false omitNorms=true /


 /fields

 In my solrconfig.xml:

 My rambuffer size is 100MB, merge factor is 10, maxIndexingThreads is 8.

 I am using solrj and concurrentupdatesolrserver (CUSS) to index. I have set
 the queue size to 1 and number of threads to 10 and javabin format.

 I run my solrj instance by providing the path to the directory where the
 csv files are stored.

 I start one instance of CUSS and have multiple threads reading from the
 various files simultaneously and writing into the CUSS threads
 simutaneously. I do a commit only after all the records have been indexed.
 Also my autocommit values for number of documents and commit time are set
 to very large numbers.

 I have tried indexing a test set of csv files which contains 1.44M records
 (total size 21MB).  All my tests have been on different types of Amazon ec2
 instances - e.g. m1.xlarge (4vCPU, 15GB RAM) and m3.2xlarge(8vCPU, 30GB
 RAM).

 I have set my jvm heap size large enough and tuned gc parameters as seen on
 various forums.

 Observations:

 1. My indexing speed for 1.44M records (or row in CSV file) is 240s on the
 m1.xlarge instance and 160s on the m3.2xlarge instance.
 2. The indexing speed is independent of whether I have one large file with
 1.44M rows or 2 files with 720K rows each.
 3. My indexing speed is independent of the number of threads and queue size
 I specify for CUSS. I have kept set these parameters as low as 1 for both
 queue size and number of threads with no difference..
 4. My indexing speed is independent of merge factor, rambuffer and number
 of indexing threads. I've tried various settings.
 5. It appears that I am not really indexing my files in parallel if I use a
 single solr core. Is this not possible? What exactly does maxindexthreads
 in solrconfig control?
 6. My concern is that my indexing speed is way slower than what I've seen
 claimed on various forums (e.g., 29GB wikipedia in 13 minutes, 50GB in 39
 minutes etc.) even with a single solr core.

 What am I doing wrong? How do I speed up my indexing? Any suggestions will
 be appreciated.

 Thanks,
 Vikram

Slow Indexing speed for csv files, multi-threaded indexing

2013-11-04 Thread Vikram Srinivasan

Hello,

  I know this has been discussed extensively in past posts. I have tried a
bunch of suggestions and I still have a few questions.

 I am using solr4.4 from tomcat 7. I am using openjdk1.7 and I am using 1
solr core
 I am trying to index a bunch of csv files (total size 13GB). Each csv file
contains a long list of tuples - ( word1 word2, frequency) as shown below.
(bigram frequencies)

E.g: blue sky, 2500
   green grass, 300

My schema.xml is as  simple as can be: I am trying to index these two
fields of type string and long and do not use any tokenizer or analyzer
factories as shown below.


 fields
field name=_version_ type=long indexed=true stored=true
multiValued=false omitNorms=true /
field name=word type=string indexed=true
stored=true multiValued=false omitNorms=true /

  field name=frequency type=long indexed=true stored=true
multiValued=false omitNorms=true /


/fields

In my solrconfig.xml:

My rambuffer size is 100MB, merge factor is 10, maxIndexingThreads is 8.

I am using solrj and concurrentupdatesolrserver (CUSS) to index. I have set
the queue size to 1 and number of threads to 10 and javabin format.

I run my solrj instance by providing the path to the directory where the
csv files are stored.

I start one instance of CUSS and have multiple threads reading from the
various files simultaneously and writing into the CUSS threads
simutaneously. I do a commit only after all the records have been indexed.
Also my autocommit values for number of documents and commit time are set
to very large numbers.

I have tried indexing a test set of csv files which contains 1.44M records
(total size 21MB).  All my tests have been on different types of Amazon ec2
instances - e.g. m1.xlarge (4vCPU, 15GB RAM) and m3.2xlarge(8vCPU, 30GB
RAM).

I have set my jvm heap size large enough and tuned gc parameters as seen on
various forums.

Observations:

1. My indexing speed for 1.44M records (or row in CSV file) is 240s on the
m1.xlarge instance and 160s on the m3.2xlarge instance.
2. The indexing speed is independent of whether I have one large file with
1.44M rows or 2 files with 720K rows each.
3. My indexing speed is independent of the number of threads and queue size
I specify for CUSS. I have kept set these parameters as low as 1 for both
queue size and number of threads with no difference..
4. My indexing speed is independent of merge factor, rambuffer and number
of indexing threads. I've tried various settings.
5. It appears that I am not really indexing my files in parallel if I use a
single solr core. Is this not possible? What exactly does maxindexthreads
in solrconfig control?
6. My concern is that my indexing speed is way slower than what I've seen
claimed on various forums (e.g., 29GB wikipedia in 13 minutes, 50GB in 39
minutes etc.) even with a single solr core.

What am I doing wrong? How do I speed up my indexing? Any suggestions will
be appreciated.

Thanks,
Vikram

howto increase indexing speed?

2013-10-16 Thread Giovanni Bricconi

I have a small solr setup, not even on a physical machine but a vmware
virtual machine with a single cpu that reads data using DIH from a
database. The machine has no phisical disks attached but stores data on a
netapp nas.

Currently this machine indexes 320 documents/sec, not bad but we plan to
double the index and we would like to keep nearly the same.

Doing some basic checks during the indexing I have found with iostat that
the usage of the disks is nearly 8% and the source database is running
fine, instead the  virtual cpu is 95% running on solr.

Now I can quite easily add another virtual cpu to the solr box, but as far
as I know this won't help because DIH doesn't work in parallel. Am I wrong?

What would you do? Rewrite the feeding process quitting dih and using solrj
to feed data in parallel? Would you instead keep DIH and switch to a
sharded configuration?

Thank you for any hints

Giovanni

Re: howto increase indexing speed?

2013-10-16 Thread primoz . skale

I think DIH uses only one core per instance. IMHO 300 doc/sec is quite 
good. If you would like to use more cores you need to use solrj. Or maybe 
more than one DIH and more cores of course.

Primoz

From:   Giovanni Bricconi giovanni.bricc...@banzai.it
To: solr-user solr-user@lucene.apache.org
Date:   16.10.2013 16:25
Subject:howto increase indexing speed?

I have a small solr setup, not even on a physical machine but a vmware
virtual machine with a single cpu that reads data using DIH from a
database. The machine has no phisical disks attached but stores data on a
netapp nas.

Currently this machine indexes 320 documents/sec, not bad but we plan to
double the index and we would like to keep nearly the same.

Doing some basic checks during the indexing I have found with iostat that
the usage of the disks is nearly 8% and the source database is running
fine, instead the  virtual cpu is 95% running on solr.

Now I can quite easily add another virtual cpu to the solr box, but as far
as I know this won't help because DIH doesn't work in parallel. Am I 
wrong?

What would you do? Rewrite the feeding process quitting dih and using 
solrj
to feed data in parallel? Would you instead keep DIH and switch to a
sharded configuration?

Thank you for any hints

Giovanni

Re: howto increase indexing speed?

2013-10-16 Thread Walter Underwood

You might consider local disks. I once ran Solr with the indexes on an 
NFS-mounted volume and the slowdown was severe.

wunder

On Oct 16, 2013, at 7:40 AM, primoz.sk...@policija.si wrote:

 I think DIH uses only one core per instance. IMHO 300 doc/sec is quite 
 good. If you would like to use more cores you need to use solrj. Or maybe 
 more than one DIH and more cores of course.
 
 Primoz
 
 
 
 From:   Giovanni Bricconi giovanni.bricc...@banzai.it
 To: solr-user solr-user@lucene.apache.org
 Date:   16.10.2013 16:25
 Subject:howto increase indexing speed?
 
 
 
 I have a small solr setup, not even on a physical machine but a vmware
 virtual machine with a single cpu that reads data using DIH from a
 database. The machine has no phisical disks attached but stores data on a
 netapp nas.
 
 Currently this machine indexes 320 documents/sec, not bad but we plan to
 double the index and we would like to keep nearly the same.
 
 Doing some basic checks during the indexing I have found with iostat that
 the usage of the disks is nearly 8% and the source database is running
 fine, instead the  virtual cpu is 95% running on solr.
 
 Now I can quite easily add another virtual cpu to the solr box, but as far
 as I know this won't help because DIH doesn't work in parallel. Am I 
 wrong?
 
 What would you do? Rewrite the feeding process quitting dih and using 
 solrj
 to feed data in parallel? Would you instead keep DIH and switch to a
 sharded configuration?
 
 Thank you for any hints
 
 Giovanni
 

--
Walter Underwood
wun...@wunderwood.org

Re: Storing/indexing speed drops quickly

2013-09-23 Thread Per Steffensen

Now running the tests on a slightly reduced setup (2 machines, quadcore, 
8GB ram ...), but that doesnt matter


We see that storing/indexing speed drops when using 
IndexWriter.updateDocument in DirectUpdateHandler2.addDoc. But it does 
not drop when just using IndexWriter.addDocument (update-requests with 
overwrite=false)
Using addDocument: 
https://dl.dropboxusercontent.com/u/25718039/AddDocument_2Solr8GB_DocCount.png
Using updateDocument: 
https://dl.dropboxusercontent.com/u/25718039/UpdateDocument_2Solr8GB_DocCount.png
We are not too happy about having to use addDocument, because that 
allows for duplicates, and we would really want to avoid that (on 
Solr/Lucene level)


We have confirmed that doubling amount of total RAM will double the 
amount of documents in the index where the indexing-speed starts 
dropping (when we use updateDocument)
On 
https://dl.dropboxusercontent.com/u/25718039/UpdateDocument_2Solr8GB_DocCount.png 
you can see that the speed drops at around 120M documents. Running the 
same test, but with Solr machine having 16GB RAM (instead of 8GB) the 
speed drops at around 240M documents.


Any comments on why indexing speed drops with IndexWriter.updateDocument 
but not with IndexWriter.addDocument?


Regards, Per Steffensen

On 9/12/13 10:14 AM, Per Steffensen wrote:

Seems like the attachments didnt make it through to this mailing list

https://dl.dropboxusercontent.com/u/25718039/doccount.png
https://dl.dropboxusercontent.com/u/25718039/iowait.png


On 9/12/13 8:25 AM, Per Steffensen wrote:

Hi

SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node 
on each, one collection across the 6 nodes, 4 shards per node
Storing/indexing from 100 threads on external machines, each thread 
one doc at the time, full speed (they always have a new doc to 
store/index)

See attached images
* iowait.png: Measured I/O wait on the Solr machines
* doccount.png: Measured number of doc in Solr collection

Starting from an empty collection. Things are fine wrt 
storing/indexing speed for the first two-three hours (100M docs per 
hour), then speed goes down dramatically, to an, for us, unacceptable 
level (max 10M per hour). At the same time as speed goes down, we see 
that I/O wait increases dramatically. I am not 100% sure, but quick 
investigation has shown that this is due to almost constant merging.


What to do about this problem?
Know that you can play around with mergeFactor and commit-rate, but 
earlier tests shows that this really do not seem to do the job - it 
might postpone the time where the problem occurs, but basically it is 
just a matter of time before merging exhaust the system.
Is there a way to totally avoid merging, and keep indexing speed at a 
high level, while still making sure that searches will perform fairly 
well when data-amounts become big? (guess without merging you will 
end up with lots and lots of small files, and I guess this is not 
good for search response-time)


Regards, Per Steffensen

Re: Storing/indexing speed drops quickly

2013-09-16 Thread Toke Eskildsen

On Fri, 2013-09-13 at 17:32 +0200, Shawn Heisey wrote:
Put your OS and Solr itself on regular disks in RAID1 and your Solr data
on the SSD. Due to the eventual decay caused by writes, SSD will
eventually die, so be ready for SSD failures to take out shard replicas.

One of the very useful properties of wear-levelling on SSD's is the wear
status of the drive can be queried. When the drive nears its EOL,
replace it.

As Lucene mainly uses bulk writes when updating the index, I will add
that the chances of wearing out a SSD by using it primarily for
Lucene/Solr is pretty hard to do, unless one constructs a pathological
setup.

Your failure argument is thus really a claim that SSDs are not reliable
technology. That is a fair argument as there has been some really rotten
apples among the offerings. This is coupled with the fact that is is
still a very rapidly changing technology, which makes it hard to pick an
older proven drive that is not markedly surpassed by the bleeding edge.

So far I'm not aware of any RAID solutions that offer TRIM support,
and without TRIM support, an SSD eventually has performance problems.

Search speed is not affected as only write performance suffers without
trim, but index update speed will be affected. Also, while it is
possible to get TRIM in RAID, there is currently only a single hardware
option:

http://www.anandtech.com/show/6161/intel-brings-trim-to-raid0-ssd-arrays-on-7series-motherboards-we-test-it

Regards,
- Toke Eskildsen, State and University Library, Denmark

Re: Storing/indexing speed drops quickly

2013-09-13 Thread Per Steffensen


On 9/12/13 4:26 PM, Shawn Heisey wrote:

On 9/12/2013 2:14 AM, Per Steffensen wrote:

Starting from an empty collection. Things are fine wrt
storing/indexing speed for the first two-three hours (100M docs per
hour), then speed goes down dramatically, to an, for us, unacceptable
level (max 10M per hour). At the same time as speed goes down, we see
that I/O wait increases dramatically. I am not 100% sure, but quick
investigation has shown that this is due to almost constant merging.

While constant merging is contributing to the slowdown, I would guess
that your index is simply too big for the amount of RAM that you have.
Let's ignore for a minute that you're distributed and just concentrate
on one machine.

After three hours of indexing, you have nearly 300 million documents.
If you have a replicationFactor of 1, that's still 50 million documents
per machine.  If your replicationFactor is 2, you've got 100 million
documents per machine.  Let's focus on the smaller number for a minute.
replicationFactor is 1, so that is about 50 million docs per machine at 
this point


50 million documents in an index, even if they are small documents, is
probably going to result in an index size of at least 20GB, and quite
possibly larger.  In order to make Solr function with that many
documents, I would guess that you have a heap that's at least 4GB in size.
Currently I have 2,5GB heap, on the 8GB machine - to leave something for 
the OS cache


With only 8GB on the machine, this doesn't leave much RAM for the OS
disk cache.  If we assume that you have 4GB left for caching, then I
would expect to see problems about the time your per-machine indexes hit
15GB in size.  If you are making it beyond that with a total of 300
million documents, then I am impressed.

Two things are going to happen when you have enough documents:  1) You
are going to fill up your Java heap and Java will need to do frequent
collections to free up enough RAM for normal operation.  When this
problem gets bad enough, the frequent collections will be *full* GCs,
which are REALLY slow.
What is it that will fill my heap? I am trying to avoid the FieldCache. 
For now, I am actually not doing any searches - focus on indexing for 
now - and certainly not group/facet/sort searches that will use the 
FieldCache.

   2) The index will be so big that the OS disk
cache cannot effectively cache it.  I suspect that the latter is more of
the problem, but both might be happening at nearly the same time.




When dealing with an index of this size, you want as much RAM as you can
possibly afford.  I don't think I would try what you are doing without
at least 64GB per machine, and I would probably use at least an 8GB heap
on each one, quite possibly larger.  With a heap that large, extreme GC
tuning becomes a necessity.
More RAM will probably help, but only for a while. I want billions of 
documents in my collections - and also on each machine. Currently we are 
aiming 15 billion documents per month (500 million per day) and keep at 
least two years of data in the system. Currently we use one collection 
for each month, so when the system has been running for two years it 
will be 24 collections with 15 billion documents each. Indexing will 
only go on in the collection corresponding to the current month, but 
searching will (potentially) be across all 24 collections. The documents 
are very small. I know that 6 machines will not do in the long run - 
currently this is only testing - but number of machines should not be 
higher than about 20-40. In general it is a problem if Solr/Lucene will 
not perform fairly well if data does not fit RAM - then it cannot really 
be used for big data. I would have to buy hundreds or even thousands 
of machines with 64GB+ RAM. That is not realistic.


To cut down on the amount of merging, I go with a fairly large
mergeFactor, but mergeFactor is basically deprecated for
TieredMergePolicy, there's a new way to configure it now.  Here's the
indexConfig settings that I use on my dev server:

indexConfig
   mergePolicy class=org.apache.lucene.index.TieredMergePolicy
 int name=maxMergeAtOnce35/int
 int name=segmentsPerTier35/int
 int name=maxMergeAtOnceExplicit105/int
   /mergePolicy
   mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler
 int name=maxThreadCount1/int
 int name=maxMergeCount6/int
   /mergeScheduler
   ramBufferSizeMB48/ramBufferSizeMB
   infoStream file=INFOSTREAM-${solr.core.name}.txtfalse/infoStream
/indexConfig

Thanks,
Shawn



Thanks!

Re: Storing/indexing speed drops quickly

2013-09-13 Thread Shawn Heisey


On 9/13/2013 12:03 AM, Per Steffensen wrote:

What is it that will fill my heap? I am trying to avoid the FieldCache.
For now, I am actually not doing any searches - focus on indexing for
now - and certainly not group/facet/sort searches that will use the
FieldCache.


I don't know what makes up the heap when you have lots of documents.  I 
am not really using any RAM hungry features and I wouldn't be able to 
get away with a 4GB heap on my Solr servers.  Uncollectable (and 
collectable) RAM usage is heaviest during indexing.  I sort on one or 
two fields and we don't use facets.


Here's a screenshot of my index status page showing how big my indexes 
are on each machine, it's a couple of months old now.  These machines 
have a 6GB heap, and I don't dare make it any smaller, or I'll get OOM 
errors during indexing.  They have 64GB total RAM.


https://dl.dropboxusercontent.com/u/97770508/statuspagescreenshot.png


More RAM will probably help, but only for a while. I want billions of
documents in my collections - and also on each machine. Currently we are
aiming 15 billion documents per month (500 million per day) and keep at
least two years of data in the system. Currently we use one collection
for each month, so when the system has been running for two years it
will be 24 collections with 15 billion documents each. Indexing will
only go on in the collection corresponding to the current month, but
searching will (potentially) be across all 24 collections. The documents
are very small. I know that 6 machines will not do in the long run -
currently this is only testing - but number of machines should not be
higher than about 20-40. In general it is a problem if Solr/Lucene will
not perform fairly well if data does not fit RAM - then it cannot really
be used for big data. I would have to buy hundreds or even thousands
of machines with 64GB+ RAM. That is not realistic.


To lower your overall RAM requirements, use SSD, and store as little 
data as possible - only the id used to retrieve data from another 
source, ideally.  That will lower your RAM requirements.  You'll 
probably still want 10-25% of your index size for the disk cache.  With 
regular disks, that's 50-100%.


Put your OS and Solr itself on regular disks in RAID1 and your Solr data 
on the SSD.  Due to the eventual decay caused by writes, SSD will 
eventually die, so be ready for SSD failures to take out shard replicas. 
 So far I'm not aware of any RAID solutions that offer TRIM support, 
and without TRIM support, an SSD eventually has performance problems. 
Without RAID, a failure will take out that replica.  That's one of the 
points of SolrCloud - having replicas so single failures don't bring 
down your index.


If you can't use SSD or get tons of RAM, you're going to have 
performance problems.  Solr (and any other Lucene-based search product) 
does really well with super-large indexes if you have the system 
resources available.  If you don't, it sucks.


Thanks,
Shawn

Storing/indexing speed drops quickly

2013-09-12 Thread Per Steffensen


Hi

SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node on 
each, one collection across the 6 nodes, 4 shards per node
Storing/indexing from 100 threads on external machines, each thread one 
doc at the time, full speed (they always have a new doc to store/index)

See attached images
* iowait.png: Measured I/O wait on the Solr machines
* doccount.png: Measured number of doc in Solr collection

Starting from an empty collection. Things are fine wrt storing/indexing 
speed for the first two-three hours (100M docs per hour), then speed 
goes down dramatically, to an, for us, unacceptable level (max 10M per 
hour). At the same time as speed goes down, we see that I/O wait 
increases dramatically. I am not 100% sure, but quick investigation has 
shown that this is due to almost constant merging.


What to do about this problem?
Know that you can play around with mergeFactor and commit-rate, but 
earlier tests shows that this really do not seem to do the job - it 
might postpone the time where the problem occurs, but basically it is 
just a matter of time before merging exhaust the system.
Is there a way to totally avoid merging, and keep indexing speed at a 
high level, while still making sure that searches will perform fairly 
well when data-amounts become big? (guess without merging you will end 
up with lots and lots of small files, and I guess this is not good for 
search response-time)


Regards, Per Steffensen

Re: Storing/indexing speed drops quickly

2013-09-12 Thread Per Steffensen

Maybe the fact that we are never ever going to delete or update 
documents, can be used for something. If we delete we will delete entire 
collections.


Regards, Per Steffensen

On 9/12/13 8:25 AM, Per Steffensen wrote:

Hi

SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node 
on each, one collection across the 6 nodes, 4 shards per node
Storing/indexing from 100 threads on external machines, each thread 
one doc at the time, full speed (they always have a new doc to 
store/index)

See attached images
* iowait.png: Measured I/O wait on the Solr machines
* doccount.png: Measured number of doc in Solr collection

Starting from an empty collection. Things are fine wrt 
storing/indexing speed for the first two-three hours (100M docs per 
hour), then speed goes down dramatically, to an, for us, unacceptable 
level (max 10M per hour). At the same time as speed goes down, we see 
that I/O wait increases dramatically. I am not 100% sure, but quick 
investigation has shown that this is due to almost constant merging.


What to do about this problem?
Know that you can play around with mergeFactor and commit-rate, but 
earlier tests shows that this really do not seem to do the job - it 
might postpone the time where the problem occurs, but basically it is 
just a matter of time before merging exhaust the system.
Is there a way to totally avoid merging, and keep indexing speed at a 
high level, while still making sure that searches will perform fairly 
well when data-amounts become big? (guess without merging you will end 
up with lots and lots of small files, and I guess this is not good 
for search response-time)


Regards, Per Steffensen

Re: Storing/indexing speed drops quickly

2013-09-12 Thread Per Steffensen


Seems like the attachments didnt make it through to this mailing list

https://dl.dropboxusercontent.com/u/25718039/doccount.png
https://dl.dropboxusercontent.com/u/25718039/iowait.png


On 9/12/13 8:25 AM, Per Steffensen wrote:

Hi

SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node 
on each, one collection across the 6 nodes, 4 shards per node
Storing/indexing from 100 threads on external machines, each thread 
one doc at the time, full speed (they always have a new doc to 
store/index)

See attached images
* iowait.png: Measured I/O wait on the Solr machines
* doccount.png: Measured number of doc in Solr collection

Starting from an empty collection. Things are fine wrt 
storing/indexing speed for the first two-three hours (100M docs per 
hour), then speed goes down dramatically, to an, for us, unacceptable 
level (max 10M per hour). At the same time as speed goes down, we see 
that I/O wait increases dramatically. I am not 100% sure, but quick 
investigation has shown that this is due to almost constant merging.


What to do about this problem?
Know that you can play around with mergeFactor and commit-rate, but 
earlier tests shows that this really do not seem to do the job - it 
might postpone the time where the problem occurs, but basically it is 
just a matter of time before merging exhaust the system.
Is there a way to totally avoid merging, and keep indexing speed at a 
high level, while still making sure that searches will perform fairly 
well when data-amounts become big? (guess without merging you will end 
up with lots and lots of small files, and I guess this is not good 
for search response-time)


Regards, Per Steffensen

Re: Storing/indexing speed drops quickly

2013-09-12 Thread Shawn Heisey

On 9/12/2013 2:14 AM, Per Steffensen wrote:
 Starting from an empty collection. Things are fine wrt
 storing/indexing speed for the first two-three hours (100M docs per
 hour), then speed goes down dramatically, to an, for us, unacceptable
 level (max 10M per hour). At the same time as speed goes down, we see
 that I/O wait increases dramatically. I am not 100% sure, but quick
 investigation has shown that this is due to almost constant merging.

While constant merging is contributing to the slowdown, I would guess
that your index is simply too big for the amount of RAM that you have.
Let's ignore for a minute that you're distributed and just concentrate
on one machine.

After three hours of indexing, you have nearly 300 million documents.
If you have a replicationFactor of 1, that's still 50 million documents
per machine.  If your replicationFactor is 2, you've got 100 million
documents per machine.  Let's focus on the smaller number for a minute.

50 million documents in an index, even if they are small documents, is
probably going to result in an index size of at least 20GB, and quite
possibly larger.  In order to make Solr function with that many
documents, I would guess that you have a heap that's at least 4GB in size.

With only 8GB on the machine, this doesn't leave much RAM for the OS
disk cache.  If we assume that you have 4GB left for caching, then I
would expect to see problems about the time your per-machine indexes hit
15GB in size.  If you are making it beyond that with a total of 300
million documents, then I am impressed.

Two things are going to happen when you have enough documents:  1) You
are going to fill up your Java heap and Java will need to do frequent
collections to free up enough RAM for normal operation.  When this
problem gets bad enough, the frequent collections will be *full* GCs,
which are REALLY slow.  2) The index will be so big that the OS disk
cache cannot effectively cache it.  I suspect that the latter is more of
the problem, but both might be happening at nearly the same time.

When dealing with an index of this size, you want as much RAM as you can
possibly afford.  I don't think I would try what you are doing without
at least 64GB per machine, and I would probably use at least an 8GB heap
on each one, quite possibly larger.  With a heap that large, extreme GC
tuning becomes a necessity.

To cut down on the amount of merging, I go with a fairly large
mergeFactor, but mergeFactor is basically deprecated for
TieredMergePolicy, there's a new way to configure it now.  Here's the
indexConfig settings that I use on my dev server:

indexConfig
  mergePolicy class=org.apache.lucene.index.TieredMergePolicy
int name=maxMergeAtOnce35/int
int name=segmentsPerTier35/int
int name=maxMergeAtOnceExplicit105/int
  /mergePolicy
  mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler
int name=maxThreadCount1/int
int name=maxMergeCount6/int
  /mergeScheduler
  ramBufferSizeMB48/ramBufferSizeMB
  infoStream file=INFOSTREAM-${solr.core.name}.txtfalse/infoStream
/indexConfig

Thanks,
Shawn

Re: Indexing-speed issues (chart included)

2011-06-21 Thread Mathias Hodler

Sorry, here are some details:

requestHandler: XmlUpdateRequesetHandler
protocol: http (10 concurrend threads)
document: 1kb size, 15 fields

cpu load: 20%
memory usage: 50%

But generally speaking, is that normal or must be something wrong with my
configuration, ...



2011/6/17 Erick Erickson erickerick...@gmail.com

 Well, it's kinda hard to say anything pertinent with so little
 information. How are you indexing things? What kind of documents?
 How are you feeding docs to Solr?

 You might review:
 http://wiki.apache.org/solr/UsingMailingLists

 Best
 Erick

 On Fri, Jun 17, 2011 at 8:10 AM, Mark Schoy hei...@gmx.de wrote:
  Hi,
 
  If I start indexing documents it getting slower the more documents were
  added without commiting and optimizing:
 
  http://imageshack.us/photo/my-images/695/solrchart.png/
 
  I've changed the mergeFactor from 10 to 30, changed maxDocs
 (100,1000,1)
  but it always getting slower the more documents were added.
  If I'm using elasticsearch which is also based on lucene I'm getting
  constant indexing rates (without commiting and optimizing too)
 
  Does anybody know whats wrong?

Re: Indexing-speed issues (chart included)

2011-06-17 Thread Erick Erickson

Well, it's kinda hard to say anything pertinent with so little
information. How are you indexing things? What kind of documents?
How are you feeding docs to Solr?

You might review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Fri, Jun 17, 2011 at 8:10 AM, Mark Schoy hei...@gmx.de wrote:
 Hi,

 If I start indexing documents it getting slower the more documents were
 added without commiting and optimizing:

 http://imageshack.us/photo/my-images/695/solrchart.png/

 I've changed the mergeFactor from 10 to 30, changed maxDocs (100,1000,1)
 but it always getting slower the more documents were added.
 If I'm using elasticsearch which is also based on lucene I'm getting
 constant indexing rates (without commiting and optimizing too)

 Does anybody know whats wrong?

Re: Indexing-speed issues (chart included)

2011-06-17 Thread Mark Schoy

Sorry, here are some details:

requestHandler: XmlUpdateRequesetHandler
protocol: http (10 concurrend threads)
document: 1kb size, 15 fields

cpu load: 20%
memory usage: 50%

But generally speaking, is that normal or must be something wrong with my
configuration, ...

2011/6/17 Erick Erickson erickerick...@gmail.com

 Well, it's kinda hard to say anything pertinent with so little
 information. How are you indexing things? What kind of documents?
 How are you feeding docs to Solr?

 You might review:
 http://wiki.apache.org/solr/UsingMailingLists

 Best
 Erick

 On Fri, Jun 17, 2011 at 8:10 AM, Mark Schoy hei...@gmx.de wrote:
  Hi,
 
  If I start indexing documents it getting slower the more documents were
  added without commiting and optimizing:
 
  http://imageshack.us/photo/my-images/695/solrchart.png/
 
  I've changed the mergeFactor from 10 to 30, changed maxDocs
 (100,1000,1)
  but it always getting slower the more documents were added.
  If I'm using elasticsearch which is also based on lucene I'm getting
  constant indexing rates (without commiting and optimizing too)
 
  Does anybody know whats wrong?

Re: Indexing-speed issues (chart included)

2011-06-17 Thread Erick Erickson

No, generally this isn't what I'd expect. There will be periodic
slowdowns when segments are flushed (I'm assuming
you're not using trunk, there have been speedups here, see:

http://blog.jteam.nl/2011/04/01/gimme-all-resources-you-have-i-can-use-them/)

Does your config have any autocommit parameters set? You
might be committing without knowing you are.

Best
Erick

On Fri, Jun 17, 2011 at 8:34 AM, Mark Schoy hei...@gmx.de wrote:
 Sorry, here are some details:

 requestHandler: XmlUpdateRequesetHandler
 protocol: http (10 concurrend threads)
 document: 1kb size, 15 fields

 cpu load: 20%
 memory usage: 50%

 But generally speaking, is that normal or must be something wrong with my
 configuration, ...

 2011/6/17 Erick Erickson erickerick...@gmail.com

 Well, it's kinda hard to say anything pertinent with so little
 information. How are you indexing things? What kind of documents?
 How are you feeding docs to Solr?

 You might review:
 http://wiki.apache.org/solr/UsingMailingLists

 Best
 Erick

 On Fri, Jun 17, 2011 at 8:10 AM, Mark Schoy hei...@gmx.de wrote:
  Hi,
 
  If I start indexing documents it getting slower the more documents were
  added without commiting and optimizing:
 
  http://imageshack.us/photo/my-images/695/solrchart.png/
 
  I've changed the mergeFactor from 10 to 30, changed maxDocs
 (100,1000,1)
  but it always getting slower the more documents were added.
  If I'm using elasticsearch which is also based on lucene I'm getting
  constant indexing rates (without commiting and optimizing too)
 
  Does anybody know whats wrong?

RE: Solr index - Size and indexing speed

2009-08-29 Thread engy.ali


Hi, 

Thanks for your reply.

I will work on your suggestion for using only one solr instance.

I tried to merge the 15 indexes again, and I found out that the new merged
index (without opitmization) size was about 351 GB , but when I optimize it
the size return back to 411 GB, Why?

I thought that optimization would decrease or at least be equal to the same
index size before optimization



Funtick wrote:
 
 Hi,
 
 Can you try to use single SOLR instance with heavy RAM (so that
 ramBufferSizeMB=8192 for instance) and mergeFactor=10? Single SOLR
 instance
 is fast enough ( 100 client threads of Tomcat; configurable) - I usually
 prefer single instance for single writable box with heavy RAM allocation
 and good I/O.
 
 Merging 15 indexes and 4-times larger size could happen, for instance,
 because of differences in SOLR Schema and Lucene; ensure that schema is
 the
 same (using Luke for instance). SOLR 1.4 has some new powerful features
 such
 as document-term cache stored somewhere (uninverted index) (Yonik), term
 vectors, stored=true, copyField, etc. 
 
 Do not do commit per 100; do it once at the end...
 
 
 
 -Original Message-
 From: engy.ali [mailto:omeshm...@hotmail.com] 
 Sent: August-25-09 3:31 PM
 To: solr-user@lucene.apache.org
 Subject: Solr index - Size and indexing speed
 
 
  Summary
 ===
 
 I had about 120,000 object of total size 71.2 GB, those objects are
 already
 indexed using Lucene. The index size is about 111 GB.
 
 I tried to use solr 1.4 nightly build to index the same collection. I
 divided collection on three servers, each server had 5 solr instances (not
 solr cores) up and running. 
 
 After collection had been indexed, i merge the 15 indexes.
 
 Problems
 ==
 
 1. The new merged index size is about 411 GB (i.e: 4 times larger than old
 index using lucene)
 
 I tried to index only on object using lucene and same object using solr to
 verify the size and the result was that the new index is about twice size
 of
 old index.
 
 DO you have any idea what might be the reason?
 
 
 2. the indexing speed is slow, 100 object on single solr instance were
 indexed in 1 hour so i estimated that 1000 on single instance can be done
 in
 10 hours, but that was not the case, the indexing time exceeds estimated
 time by about 12 hour.
 
 is that might be related to the growth of index?if not, so what might be
 the
 reason.
 
 Note: I do a commit/100 object and an optimize by the end of the whole
 operation. I also changed the mergeFactor from 10 to 15.
 
 
 3.  I google and found out that solr is using an inverted index, but I
 want
 to know what is the internal structure of solr index,for example if i have
 a
 word and its stems, how it will be store in the index 
 
 Thanks, 
 Engy
 -- 
 View this message in context:
 http://www.nabble.com/Solr-index---Size-and-indexing-speed-tp25140702p251407
 02.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Solr-index---Size-and-indexing-speed-tp25140702p25201981.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr index - Size and indexing speed

2009-08-29 Thread Yonik Seeley

On Sat, Aug 29, 2009 at 7:09 AM, engy.aliomeshm...@hotmail.com wrote:
 I thought that optimization would decrease or at least be equal to the same
 index size before optimization

Some index structures like norms are non-sparse.  Index one unique
field with norms and there is a byte allocated for every document in
the index.  Merge that with another index, and the size for the norms
goes to byte[maxDoc()]

-Yonik
http://www.lucidimagination.com

Re: Solr index - Size and indexing speed

2009-08-29 Thread Yonik Seeley

On Tue, Aug 25, 2009 at 3:30 PM, engy.aliomeshm...@hotmail.com wrote:

  Summary
 ===

 I had about 120,000 object of total size 71.2 GB, those objects are already
 indexed using Lucene. The index size is about 111 GB.

 I tried to use solr 1.4 nightly build to index the same collection. I
 divided collection on three servers, each server had 5 solr instances (not
 solr cores) up and running.

 After collection had been indexed, i merge the 15 indexes.

 Problems
 ==

 1. The new merged index size is about 411 GB (i.e: 4 times larger than old
 index using lucene)

 I tried to index only on object using lucene and same object using solr to
 verify the size and the result was that the new index is about twice size of
 old index.

 DO you have any idea what might be the reason?

Check out the schema you are using - it may contain copyFields, etc.
You should be able to get to exactly the same size of index as you had
with Lucene (Solr just uses Lucene for indexing after all).

-Yonik
http://www.lucidimagination.com

RE: Solr index - Size and indexing speed

2009-08-29 Thread Fuad Efendi

I tried to merge the 15 indexes again, and I found out that the new merged
index (without opitmization) size was about 351 GB , but when I optimize it
the size return back to 411 GB, Why?


Just as a sample, IOT in Oracle... 


Ok, just in a kids-lang, what 'optimization' means? It means that Map is
physically sorted by Key... For Lucene, 'map' is 'term - documentIDs'.

Ok, still no any problem... but what if KEY is compressed? (or, for
instance, 'normalized' if you are still with RDBMS) And we need to
decompress it for uniting 15 maps?

-Fuad

Solr index - Size and indexing speed

2009-08-25 Thread engy.ali


 Summary
===

I had about 120,000 object of total size 71.2 GB, those objects are already
indexed using Lucene. The index size is about 111 GB.

I tried to use solr 1.4 nightly build to index the same collection. I
divided collection on three servers, each server had 5 solr instances (not
solr cores) up and running. 

After collection had been indexed, i merge the 15 indexes.

Problems
==

1. The new merged index size is about 411 GB (i.e: 4 times larger than old
index using lucene)

I tried to index only on object using lucene and same object using solr to
verify the size and the result was that the new index is about twice size of
old index.

DO you have any idea what might be the reason?


2. the indexing speed is slow, 100 object on single solr instance were
indexed in 1 hour so i estimated that 1000 on single instance can be done in
10 hours, but that was not the case, the indexing time exceeds estimated
time by about 12 hour.

is that might be related to the growth of index?if not, so what might be the
reason.

Note: I do a commit/100 object and an optimize by the end of the whole
operation. I also changed the mergeFactor from 10 to 15.


3.  I google and found out that solr is using an inverted index, but I want
to know what is the internal structure of solr index,for example if i have a
word and its stems, how it will be store in the index 

Thanks, 
Engy
-- 
View this message in context: 
http://www.nabble.com/Solr-index---Size-and-indexing-speed-tp25140702p25140702.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr index - Size and indexing speed

2009-08-25 Thread Fuad Efendi

Hi,

Can you try to use single SOLR instance with heavy RAM (so that
ramBufferSizeMB=8192 for instance) and mergeFactor=10? Single SOLR instance
is fast enough ( 100 client threads of Tomcat; configurable) - I usually
prefer single instance for single writable box with heavy RAM allocation
and good I/O.

Merging 15 indexes and 4-times larger size could happen, for instance,
because of differences in SOLR Schema and Lucene; ensure that schema is the
same (using Luke for instance). SOLR 1.4 has some new powerful features such
as document-term cache stored somewhere (uninverted index) (Yonik), term
vectors, stored=true, copyField, etc. 

Do not do commit per 100; do it once at the end...



-Original Message-
From: engy.ali [mailto:omeshm...@hotmail.com] 
Sent: August-25-09 3:31 PM
To: solr-user@lucene.apache.org
Subject: Solr index - Size and indexing speed


 Summary
===

I had about 120,000 object of total size 71.2 GB, those objects are already
indexed using Lucene. The index size is about 111 GB.

I tried to use solr 1.4 nightly build to index the same collection. I
divided collection on three servers, each server had 5 solr instances (not
solr cores) up and running. 

After collection had been indexed, i merge the 15 indexes.

Problems
==

1. The new merged index size is about 411 GB (i.e: 4 times larger than old
index using lucene)

I tried to index only on object using lucene and same object using solr to
verify the size and the result was that the new index is about twice size of
old index.

DO you have any idea what might be the reason?


2. the indexing speed is slow, 100 object on single solr instance were
indexed in 1 hour so i estimated that 1000 on single instance can be done in
10 hours, but that was not the case, the indexing time exceeds estimated
time by about 12 hour.

is that might be related to the growth of index?if not, so what might be the
reason.

Note: I do a commit/100 object and an optimize by the end of the whole
operation. I also changed the mergeFactor from 10 to 15.


3.  I google and found out that solr is using an inverted index, but I want
to know what is the internal structure of solr index,for example if i have a
word and its stems, how it will be store in the index 

Thanks, 
Engy
-- 
View this message in context:
http://www.nabble.com/Solr-index---Size-and-indexing-speed-tp25140702p251407
02.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: mergeFactor / indexing speed

2009-08-09 Thread Avlesh Singh


 And - indexing 160k documents now takes 5min instead of 1.5h!

Awesome! It works for all!

(Now I can go relaxed on vacation. :-D )

Take me along!

Cheers
Avlesh

On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann 
chantal.ackerm...@btelligent.de wrote:

 Juhu, great news, guys. I merged my child entity into the root entity, and
 changed the custom entityprocessor to handle the additional columns
 correctly.
 And - indexing 160k documents now takes 5min instead of 1.5h!

 (Now I can go relaxed on vacation. :-D )


 Conclusion:
 In my case performance was so bad because of constantly querying a database
 on a different machine (network traffic + db query per document).


 Thanks for all your help!
 Chantal


 Avlesh Singh schrieb:

 does DIH call commit periodically, or are things done in one big batch?

  AFAIK, one big batch.


 yes. There is no index available once the full-import started (and the
 searcher has no cache, other wise it still reads from that). There is no
 data (i.e. in the Admin/Luke frontend) visible until the import is finished
 correctly.

Re: mergeFactor / indexing speed

2009-08-07 Thread Chantal Ackermann

Juhu, great news, guys. I merged my child entity into the root entity, 
and changed the custom entityprocessor to handle the additional columns 
correctly.

And - indexing 160k documents now takes 5min instead of 1.5h!

(Now I can go relaxed on vacation. :-D )


Conclusion:
In my case performance was so bad because of constantly querying a 
database on a different machine (network traffic + db query per document).



Thanks for all your help!
Chantal


Avlesh Singh schrieb:

does DIH call commit periodically, or are things done in one big batch?


AFAIK, one big batch.


yes. There is no index available once the full-import started (and the 
searcher has no cache, other wise it still reads from that). There is no 
data (i.e. in the Admin/Luke frontend) visible until the import is 
finished correctly.

Re: mergeFactor / indexing speed

2009-08-07 Thread Shalin Shekhar Mangar

On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann 
chantal.ackerm...@btelligent.de wrote:

 Juhu, great news, guys. I merged my child entity into the root entity, and
 changed the custom entityprocessor to handle the additional columns
 correctly.
 And - indexing 160k documents now takes 5min instead of 1.5h!


I'm a little late to the party but you may also want to look at
CachedSqlEntityProcessor.

-- 
Regards,
Shalin Shekhar Mangar.

Re: mergeFactor / indexing speed

2009-08-07 Thread Chantal Ackermann

Thanks for the tip, Shalin. I'm happy with 6 indexes running in parallel 
and completing in less than 10min, right now, but I'll have look anyway.



Shalin Shekhar Mangar schrieb:

On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann 
chantal.ackerm...@btelligent.de wrote:


Juhu, great news, guys. I merged my child entity into the root entity, and
changed the custom entityprocessor to handle the additional columns
correctly.
And - indexing 160k documents now takes 5min instead of 1.5h!



I'm a little late to the party but you may also want to look at
CachedSqlEntityProcessor.

--
Regards,
Shalin Shekhar Mangar.

Re: mergeFactor / indexing speed

2009-08-06 Thread Chantal Ackermann


Hi all,

to keep this thread up to date... ;-)


d) jdbc batch size
changed to 10. (Was default: 500, then 1000)

The problem with my dih setup is that the root entity query returns a 
huge set (all ids that shall be indexed). A larger fetchsize would be 
good for that query.
The nested entity, however, returns only up 9 rows, ever. The 
constraints are so strict (by id) that there is no way that any 
additional data could be pre-fetched.
(Actually, anynone using DIH with nested entities should run into that 
problem?)


After changing to 10, I cannot see that this low batch size slowed the 
indexer down (significantly).


As I would like to stick with DIH (instead of dumping the data into CSV 
and import it then) here is my question:


Do you think it's possible to return (in the nested entity) rows 
independent of the unique id, and let the processor decide when a 
document is complete?
The examples in the wiki always use an ID to get the data for the nested 
entity, so I'm not sure it was planned with that in mind. But as I'm 
already handling multiple db rows for one document, it might not be too 
difficult to change to handling the unique id correctly, as well?
Of course, I would need something like a look ahead to know whether the 
next row is already part of the next document.



Cheers,
Chantal



Concerning the other settings (just fyi):

a) mergeFactor 10 (and also tried 100)
I don't think that changed anything to the worse, rather to the better. 
So, I'll stick with 10 from now on.


b) ramBufferSizeMB
tried 512, 1024. RAM usage went up when I increased from 256 to 512. Not 
sure about 1024. I'll stick to 512.

Re: mergeFactor / indexing speed

2009-08-06 Thread Yonik Seeley

On Mon, Aug 3, 2009 at 12:32 PM, Chantal
Ackermannchantal.ackerm...@btelligent.de wrote:
 avg-cpu:  %user   %nice    %sys %iowait   %idle
           1.23    0.00    0.03    0.03   98.71

 Basically, it is doing very little? *scratch*

How often is commit being called?  (a  Lucene commit sync's all of the
index files so a crash won't result in a corrupted index... this can
be costly).

Guys - does DIH call commit periodically, or are things done in one big batch?
Chantal - is autocommit configured in solrconfig.xml?

-Yonik
http://www.lucidimagination.com

Re: mergeFactor / indexing speed

2009-08-06 Thread Avlesh Singh


 does DIH call commit periodically, or are things done in one big batch?

AFAIK, one big batch.

Cheers
Avlesh

On Thu, Aug 6, 2009 at 11:23 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Mon, Aug 3, 2009 at 12:32 PM, Chantal
 Ackermannchantal.ackerm...@btelligent.de wrote:
  avg-cpu:  %user   %nice%sys %iowait   %idle
1.230.000.030.03   98.71
 
  Basically, it is doing very little? *scratch*

 How often is commit being called?  (a  Lucene commit sync's all of the
 index files so a crash won't result in a corrupted index... this can
 be costly).

 Guys - does DIH call commit periodically, or are things done in one big
 batch?
 Chantal - is autocommit configured in solrconfig.xml?

 -Yonik
 http://www.lucidimagination.com

Re: mergeFactor / indexing speed

2009-08-04 Thread Chantal Ackermann


Hi Avlesh,
hi Otis,
hi Grant,
hi all,


(enumerating to keep track of all the input)

a) mergeFactor 1000 too high
I'll change that back to 10. I thought it would make Lucene use more RAM 
before starting IO.


b) ramBufferSize:
OK, or maybe more. I'll keep that in mind.

c) solrconfig.xml - default and main index:
I've always changed both sections, the default and the main index one.

d) JDBC batch size:
I haven't set it. I'll do that.

e) DB server performance:
I agree, ping is definitely not much information. I also did queries 
from my own computer towards it (while the indexer ran) which came back 
as fast as usual.
Currently, I don't have any login to ssh to that machine, but I'm going 
to try get one.


f) Network:
I'll definitely need to have a look at that once I have access to the db 
machine.



g) the data

g.1) nested entity in DIH conf
there is only the root and one nested entity. However, that nested 
entity returns multiple rows (about 10) for one query. (Fetched rows is 
about 10 times the number of processed documents.)


g.2) my custom EntityProcessor
( The code is pasted at the very end of this e-mail. )
- iterates over those multiple rows,
- uses one column to create a key in a map,
- uses two other columns to create the corresponding value (String 
concatenation),
- if a key already exists, it gets the value, if that value is a list, 
it adds the new value to that list, if it's not a list, it creates one 
and adds the old and the new value to it.
I refrained from adding any business logic to that processor. It treats 
all rows alike, no matter whether they hold values that can appear 
multiple or values that must appear only once.


g.3) the two transformers
- to split one value into two (regex)
field column=person /
field column=participant sourceColName=person regex=([^\|]+)\|.*/
field column=role sourceColName=person 
regex=[^\|]+\|\d+,\d+,\d+,(.*)/


- to create extract a number from an existing number (bit calculation 
using the script transformer). As that one works on a field that is 
potentially multiValued, it needs to take care of creating and 
populating a list, as well.

field column=cat name=cat /
script![CDATA[
function getMainCategory(row) {
var cat = row.get('cat');
var mainCat;
if (cat != null) {
// check whether cat is an array
if (cat instanceof java.util.List) {
var arr = java.util.ArrayList();
for (var i=0; icat.size(); i++) {
mainCat = new java.lang.Integer(cat.get(i)8);
if (!arr.contains(mainCat)) {
arr.add(mainCat);
}
}
row.put('maincat', arr);
} else { // it is a single value
var mainCat = new java.lang.Integer(cat8);
row.put('maincat', mainCat);
}
}
return row;
}
]]/script
(The EpgValueEntityProcessor decides on creating lists on a case by case 
basis: only if a value is specified multiple times for a certain data 
set does it create a list. This is because I didn't want to put any 
complex configuration or business logic into it.)


g.4) fields
the DIH extracts 5 fields from the root entity, 11 fields from the 
nested entity, and the transformers might create additional 3 (multiValued).
schema.xml defines 21 fields (two additional fields: the timestamp field 
(default=NOW) and a field collecting three other text fields for 
default search (using copy field)):

- 2 long
- 3 integer
- 3 sint
- 3 date
- 6 text_cs (class=solr.TextField positionIncrementGap=100):
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0
generateWordParts=0 generateNumberParts=0 catenateWords=0 
catenateNumbers=0 catenateAll=0 /

/analyzer
- 4 text_de (one is the field populated by copying from the 3 others):
analyzer type=index
tokenizer class=solr.StandardTokenizerFactory /
filter class=solr.LengthFilterFactory min=2 max=5000 /
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_de.txt /

filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1 
catenateAll=0 splitOnCaseChange=1 /

filter class=solr.LowerCaseFilterFactory /
filter class=solr.SnowballPorterFilterFactory language=German /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
/analyzer


Thank you for taking your time!
Cheers,
Chantal





** EpgValueEntityProcessor.java ***

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.logging.Logger;

import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.SqlEntityProcessor;

public class

Re: mergeFactor / indexing speed

2009-08-03 Thread Chantal Ackermann


Hi all,

I'm still struggling with the index performance. I've moved the indexer
to a different machine, now, which is faster and less occupied.

The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
running with those settings (and others):
-server -Xms1G -Xmx7G

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
It has been processing roughly 70k documents in half an hour, so far. 
Which means 1,5 hours at least for 200k - which is as fast/slow as 
before (on the less performant machine).


The machine is not swapping. It is only using 13% of the memory.
iostat gives me:
 iostat
Linux 2.6.9-67.ELsmp  08/03/2009

avg-cpu:  %user   %nice%sys %iowait   %idle
   1.230.000.030.03   98.71

Basically, it is doing very little? *scratch*

The sourcing database is responding as fast as ever. (I checked that 
from my own machine, and did only a ping from the linux box to the db 
server.)


Any help, any hint on where to look would be greatly appreciated.


Thanks!
Chantal


Chantal Ackermann schrieb:

Hi again!

Thanks for the answer, Grant.

  It could very well be the case that you aren't seeing any merges with
  only 20K docs.  Ultimately, if you really want to, you can look in
  your data.dir and count the files.  If you have indexed a lot and have
  an MF of 100 and haven't done an optimize, you will see a lot more
  index files.

Do you mean that 20k is not representative enough to test those settings?
I've chosen the smaller data set so that the index can run completely
but doesn't take too long at the same time.
If it would be faster to begin with, I could use a larger data set, of
course. I still can't believe that 11 minutes is normal (I haven't
managed to make it run faster or slower than that, that duration is very
stable).

It feels kinda slow to me...
Out of your experience - what would you expect as duration for an index
with:
- 21 fields, some using a text type with 6 filters
- database access using DataImportHandler with a query of (far) less
than 20ms
- 2 transformers

If I knew that indexing time should be shorter than that, at least, I
would know that something is definitely wrong with what I am doing or
with the environment I am using.

  Likely, but not guaranteed.  Typically, larger merge factors are good
  for batch indexing, but a lot of that has changed with Lucene's new
  background merger, such that I don't know if it matters as much anymore.

Ok. I also read some posting where it basically said that the default
parameters are ok. And one shouldn't mess around with them.

The thing is that our current search setup uses Lucene directly, and the
indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
fields are different, the complete setup is different. But it will be
hard to advertise a new implementation/setup where indexing is three
times slower - unless I can give some reasons why that is.

The full index should be fairly fast because the backing data is update
every few hours. I want to put in place an incremental/partial update as
main process, but full indexing might have to be done at certain times
if data has changed completely, or the schema has to be changed/extended.

  No, those are separate things.  The ramBufferSizeMB (although, I like
  the thought of a rumBufferSizeMB too!  ;-)  ) controls how many docs
  Lucene holds in memory before it has to flush.  MF controls how many
  segments are on disk

alas! the rum. I had that typo on the commandline before. that's my
subconscious telling me what I should do when I get home, tonight...

So, increasing ramBufferSize should lead to higher memory usage,
shouldn't it? I'm not seeing that. :-(

I'll try once more with MF 10 and a higher rum... well, you know... ;-)

Cheers,
Chantal

Grant Ingersoll schrieb:

On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:


Dear all,

I want to find out which settings give the best full index
performance for my setup.
Therefore, I have been running a small index (less than 20k
documents) with a mergeFactor of 10 and 100.
In both cases, indexing took about 11.5 min:

mergeFactor: 10
str name=Time taken 0:11:46.792/str
mergeFactor: 100
/admin/cores?action=RELOAD
str name=Time taken 0:11:44.441/str
Tomcat restart
str name=Time taken 0:11:34.143/str

This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
ATA disk).


Now, I have three questions:

1. How can I check which mergeFactor is really being used? The
solrconfig.xml that is displayed in the admin application is the up-
to-date view on the file system. I tested that. But it's not
necessarily what the current SOLR core is using, isn't it?
Is there a way to check on the actually used mergeFactor (while the
index is running)?

It could very well be the case that you aren't seeing any merges with
only 20K docs.  Ultimately, if you really want to, you can look in
your data.dir and count the

Re: mergeFactor / indexing speed

2009-08-03 Thread Avlesh Singh


 avg-cpu:  %user   %nice%sys %iowait   %idle
   1.230.000.030.03   98.71

I agree, real bad statistics, actually.

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.

To me the former appears to be too high and latter too low (for your machine
configuration). You can safely increase the ramBufferSize (or
maxBufferedDocs) to a higher value.

Couple of things -

   1. The stock solrconfig.xml comes with two sections indexDefaults and
   mainIndex. Options in the latter override the former. Just make sure that
   you have right values at the right place.
   2. Do you have too many nested entities inside the DIH's data-config? If
   yes, a database level optimization (creating views, in memory tables ...)
   might hold the answer.
   3. Tried playing around with jdbc paramters in the data source? Setting
   batchSize property to a considerable value might help.

Cheers
Avlesh

On Mon, Aug 3, 2009 at 10:02 PM, Chantal Ackermann 
chantal.ackerm...@btelligent.de wrote:

 Hi all,

 I'm still struggling with the index performance. I've moved the indexer
 to a different machine, now, which is faster and less occupied.

 The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
 running with those settings (and others):
 -server -Xms1G -Xmx7G

 Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
 It has been processing roughly 70k documents in half an hour, so far. Which
 means 1,5 hours at least for 200k - which is as fast/slow as before (on the
 less performant machine).

 The machine is not swapping. It is only using 13% of the memory.
 iostat gives me:
  iostat
 Linux 2.6.9-67.ELsmp  08/03/2009

 avg-cpu:  %user   %nice%sys %iowait   %idle
   1.230.000.030.03   98.71

 Basically, it is doing very little? *scratch*

 The sourcing database is responding as fast as ever. (I checked that from
 my own machine, and did only a ping from the linux box to the db server.)

 Any help, any hint on where to look would be greatly appreciated.


 Thanks!
 Chantal


 Chantal Ackermann schrieb:

 Hi again!

 Thanks for the answer, Grant.

   It could very well be the case that you aren't seeing any merges with
   only 20K docs.  Ultimately, if you really want to, you can look in
   your data.dir and count the files.  If you have indexed a lot and have
   an MF of 100 and haven't done an optimize, you will see a lot more
   index files.

 Do you mean that 20k is not representative enough to test those settings?
 I've chosen the smaller data set so that the index can run completely
 but doesn't take too long at the same time.
 If it would be faster to begin with, I could use a larger data set, of
 course. I still can't believe that 11 minutes is normal (I haven't
 managed to make it run faster or slower than that, that duration is very
 stable).

 It feels kinda slow to me...
 Out of your experience - what would you expect as duration for an index
 with:
 - 21 fields, some using a text type with 6 filters
 - database access using DataImportHandler with a query of (far) less
 than 20ms
 - 2 transformers

 If I knew that indexing time should be shorter than that, at least, I
 would know that something is definitely wrong with what I am doing or
 with the environment I am using.

   Likely, but not guaranteed.  Typically, larger merge factors are good
   for batch indexing, but a lot of that has changed with Lucene's new
   background merger, such that I don't know if it matters as much
 anymore.

 Ok. I also read some posting where it basically said that the default
 parameters are ok. And one shouldn't mess around with them.

 The thing is that our current search setup uses Lucene directly, and the
 indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
 fields are different, the complete setup is different. But it will be
 hard to advertise a new implementation/setup where indexing is three
 times slower - unless I can give some reasons why that is.

 The full index should be fairly fast because the backing data is update
 every few hours. I want to put in place an incremental/partial update as
 main process, but full indexing might have to be done at certain times
 if data has changed completely, or the schema has to be changed/extended.

   No, those are separate things.  The ramBufferSizeMB (although, I like
   the thought of a rumBufferSizeMB too!  ;-)  ) controls how many docs
   Lucene holds in memory before it has to flush.  MF controls how many
   segments are on disk

 alas! the rum. I had that typo on the commandline before. that's my
 subconscious telling me what I should do when I get home, tonight...

 So, increasing ramBufferSize should lead to higher memory usage,
 shouldn't it? I'm not seeing that. :-(

 I'll try once more with MF 10 and a higher rum... well, you know... ;-)

 Cheers,
 Chantal

 Grant Ingersoll schrieb:

 On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:

  Dear all,

 I want to find

Re: mergeFactor / indexing speed

2009-08-03 Thread Otis Gospodnetic

Hi,

I'd have to poke around the machine(s) to give you better guidance, but here is 
some initial feedback:

- mergeFactor of 1000 seems crazy.  mergeFactor is probably not your problem.  
I'd go back to default of 10.
- 256 MB for ramBufferSizeMB sounds OK.
- pinging the DB won't tell you much about the DB server's performance - ssh to 
the machine and check its CPU load, memory usage, disk IO

Other things to look into:
- Network as the bottleneck?
- Field analysis as the bottleneck?


Otis 
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Chantal Ackermann chantal.ackerm...@btelligent.de
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Monday, August 3, 2009 12:32:12 PM
 Subject: Re: mergeFactor / indexing speed
 
 Hi all,
 
 I'm still struggling with the index performance. I've moved the indexer
 to a different machine, now, which is faster and less occupied.
 
 The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
 running with those settings (and others):
 -server -Xms1G -Xmx7G
 
 Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
 It has been processing roughly 70k documents in half an hour, so far. 
 Which means 1,5 hours at least for 200k - which is as fast/slow as 
 before (on the less performant machine).
 
 The machine is not swapping. It is only using 13% of the memory.
 iostat gives me:
   iostat
 Linux 2.6.9-67.ELsmp  08/03/2009
 
 avg-cpu:  %user   %nice%sys %iowait   %idle
 1.230.000.030.03   98.71
 
 Basically, it is doing very little? *scratch*
 
 The sourcing database is responding as fast as ever. (I checked that 
 from my own machine, and did only a ping from the linux box to the db 
 server.)
 
 Any help, any hint on where to look would be greatly appreciated.
 
 
 Thanks!
 Chantal
 
 
 Chantal Ackermann schrieb:
  Hi again!
 
  Thanks for the answer, Grant.
 
It could very well be the case that you aren't seeing any merges with
only 20K docs.  Ultimately, if you really want to, you can look in
your data.dir and count the files.  If you have indexed a lot and have
an MF of 100 and haven't done an optimize, you will see a lot more
index files.
 
  Do you mean that 20k is not representative enough to test those settings?
  I've chosen the smaller data set so that the index can run completely
  but doesn't take too long at the same time.
  If it would be faster to begin with, I could use a larger data set, of
  course. I still can't believe that 11 minutes is normal (I haven't
  managed to make it run faster or slower than that, that duration is very
  stable).
 
  It feels kinda slow to me...
  Out of your experience - what would you expect as duration for an index
  with:
  - 21 fields, some using a text type with 6 filters
  - database access using DataImportHandler with a query of (far) less
  than 20ms
  - 2 transformers
 
  If I knew that indexing time should be shorter than that, at least, I
  would know that something is definitely wrong with what I am doing or
  with the environment I am using.
 
Likely, but not guaranteed.  Typically, larger merge factors are good
for batch indexing, but a lot of that has changed with Lucene's new
background merger, such that I don't know if it matters as much anymore.
 
  Ok. I also read some posting where it basically said that the default
  parameters are ok. And one shouldn't mess around with them.
 
  The thing is that our current search setup uses Lucene directly, and the
  indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
  fields are different, the complete setup is different. But it will be
  hard to advertise a new implementation/setup where indexing is three
  times slower - unless I can give some reasons why that is.
 
  The full index should be fairly fast because the backing data is update
  every few hours. I want to put in place an incremental/partial update as
  main process, but full indexing might have to be done at certain times
  if data has changed completely, or the schema has to be changed/extended.
 
No, those are separate things.  The ramBufferSizeMB (although, I like
the thought of a rumBufferSizeMB too!  ;-)  ) controls how many docs
Lucene holds in memory before it has to flush.  MF controls how many
segments are on disk
 
  alas! the rum. I had that typo on the commandline before. that's my
  subconscious telling me what I should do when I get home, tonight...
 
  So, increasing ramBufferSize should lead to higher memory usage,
  shouldn't it? I'm not seeing that. :-(
 
  I'll try once more with MF 10 and a higher rum... well, you know... ;-)
 
  Cheers,
  Chantal
 
  Grant Ingersoll schrieb:
  On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
 
  Dear all,
 
  I want to find out which settings give the best full index
  performance for my setup

Re: mergeFactor / indexing speed

2009-08-03 Thread Grant Ingersoll

How big are your documents?  I haven't benchmarked DIH, so I am not  
sure what to expect, but it does seem like something isn't right.  Can  
you fully describe how you are indexing?  Have you done any profiling?


On Aug 3, 2009, at 12:32 PM, Chantal Ackermann wrote:


Hi all,

I'm still struggling with the index performance. I've moved the  
indexer

to a different machine, now, which is faster and less occupied.

The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
running with those settings (and others):
-server -Xms1G -Xmx7G

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
It has been processing roughly 70k documents in half an hour, so  
far. Which means 1,5 hours at least for 200k - which is as fast/slow  
as before (on the less performant machine).


The machine is not swapping. It is only using 13% of the memory.
iostat gives me:
iostat
Linux 2.6.9-67.ELsmp  08/03/2009

avg-cpu:  %user   %nice%sys %iowait   %idle
  1.230.000.030.03   98.71

Basically, it is doing very little? *scratch*

The sourcing database is responding as fast as ever. (I checked that  
from my own machine, and did only a ping from the linux box to the  
db server.)


Any help, any hint on where to look would be greatly appreciated.


Thanks!
Chantal


Chantal Ackermann schrieb:

Hi again!

Thanks for the answer, Grant.

 It could very well be the case that you aren't seeing any merges  
with

 only 20K docs.  Ultimately, if you really want to, you can look in
 your data.dir and count the files.  If you have indexed a lot and  
have

 an MF of 100 and haven't done an optimize, you will see a lot more
 index files.

Do you mean that 20k is not representative enough to test those  
settings?

I've chosen the smaller data set so that the index can run completely
but doesn't take too long at the same time.
If it would be faster to begin with, I could use a larger data set,  
of

course. I still can't believe that 11 minutes is normal (I haven't
managed to make it run faster or slower than that, that duration is  
very

stable).

It feels kinda slow to me...
Out of your experience - what would you expect as duration for an  
index

with:
- 21 fields, some using a text type with 6 filters
- database access using DataImportHandler with a query of (far) less
than 20ms
- 2 transformers

If I knew that indexing time should be shorter than that, at least, I
would know that something is definitely wrong with what I am doing or
with the environment I am using.

 Likely, but not guaranteed.  Typically, larger merge factors are  
good

 for batch indexing, but a lot of that has changed with Lucene's new
 background merger, such that I don't know if it matters as much  
anymore.


Ok. I also read some posting where it basically said that the default
parameters are ok. And one shouldn't mess around with them.

The thing is that our current search setup uses Lucene directly,  
and the

indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
fields are different, the complete setup is different. But it will be
hard to advertise a new implementation/setup where indexing is three
times slower - unless I can give some reasons why that is.

The full index should be fairly fast because the backing data is  
update
every few hours. I want to put in place an incremental/partial  
update as
main process, but full indexing might have to be done at certain  
times
if data has changed completely, or the schema has to be changed/ 
extended.


 No, those are separate things.  The ramBufferSizeMB (although, I  
like
 the thought of a rumBufferSizeMB too!  ;-)  ) controls how many  
docs
 Lucene holds in memory before it has to flush.  MF controls how  
many

 segments are on disk

alas! the rum. I had that typo on the commandline before. that's my
subconscious telling me what I should do when I get home, tonight...

So, increasing ramBufferSize should lead to higher memory usage,
shouldn't it? I'm not seeing that. :-(

I'll try once more with MF 10 and a higher rum... well, you  
know... ;-)


Cheers,
Chantal

Grant Ingersoll schrieb:

On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:


Dear all,

I want to find out which settings give the best full index
performance for my setup.
Therefore, I have been running a small index (less than 20k
documents) with a mergeFactor of 10 and 100.
In both cases, indexing took about 11.5 min:

mergeFactor: 10
str name=Time taken 0:11:46.792/str
mergeFactor: 100
/admin/cores?action=RELOAD
str name=Time taken 0:11:44.441/str
Tomcat restart
str name=Time taken 0:11:34.143/str

This is a Tomcat 5.5.20, started with a max heap size of 1GB. But  
it
always used much less. No swapping (RedHat Linux 32bit, 3GB RAM,  
old

ATA disk).


Now, I have three questions:

1. How can I check which mergeFactor is really being used? The
solrconfig.xml that is displayed in the admin application is the  
up-

to-date view on the file system. I tested that. But

1 2 >

1 - 100 of 106 matches

Mail list logo