Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Li, Ryan
HI Shawn,

Thanks for your reply.

The memory setting of my Solr box is

12G physically memory.
4G for java (-Xmx4096m)
The index size is around 4G in Solr 4.9, I think it was over 6G in Solr 4.0.

I do think the RAM size of java is one of the reasons for this slowness. I'm 
doing one big commit and when the ingestion process finished 50%, I can see the 
solr server already used over 90% of full memory.

I'll try to assign more RAM to Solr Java. But from your experience, does 4G 
sounds like a good number for Java heap size for my scenario? Is there any way 
to reduce memory usage during index time? (One thing I know is do a few commits 
instead of one commit. )  My concern is providing I have 12 G in total, If I 
assign too much to Solr server, I may not have enough for the OS to cache Solr 
index file.

I had a look to solr config file, but couldn't find anything that obviously 
wrong, Just wondering which part of that config file would impact the index 
time?

Thanks,
Ryan





One possible source of problems with that particular upgrade is the fact
that stored field compression was added in 4.1, and termvector
compression was added in 4.2.  They are on by default and cannot be
turned off.  The compression is typically fast, but with very large
documents like yours, it might result in pretty major computational
overhead.  It can also require additional java heap, which ties into
what follows:

Another problem might be RAM-related.

If your java heap is very large, or just a little bit too small, there
can be major performance issues from garbage collection.  Based on the
fact that the earlier version performed well, a too-small heap is more
likely than a very large heap.

If your index size is such that it can't be effectively cached by the
amount of total RAM on the machine (minus the java heap assigned to
Solr), that can cause performance problems.  Your index size is likely
to be several gigabytes, and might even reach double-digit gigabytes.
Can you relate those numbers -- index size, java heap size, and total
system RAM?  If you can, it would also be a good idea to share your
solrconfig.xml.

Here's a wiki page that goes into more detail about possible performance
issues.  It doesn't mention the possible compression problem:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn


RE: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Li, Ryan
Hi Erick,

As Ryan Ernst noticed, those big fields (eg majorTextSignalStem)  is not 
stored. There are a few stored fields in my schema, but they are very small 
fields basically name or id for that document.  I tried turn them off(only 
store id filed) and that didn't make any difference.

Thanks,
Ryan

Ryan:

As it happens, there's a discssion on the dev list about this.

If at all possible, could you try a brief experiment? Turn off
all the storage, i.e. set stored=false on all fields. It's a lot
to ask, but it'd help the discussion.

Or join the discussion at https://issues.apache.org/jira/browse/LUCENE-5914.

Best,
Erick


From: Li, Ryan
Sent: Friday, September 05, 2014 3:28 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr add document over 20 times slower after upgrade from 4.0 to 
4.9


HI Shawn,

Thanks for your reply.

The memory setting of my Solr box is

12G physically memory.
4G for java (-Xmx4096m)
The index size is around 4G in Solr 4.9, I think it was over 6G in Solr 4.0.

I do think the RAM size of java is one of the reasons for this slowness. I'm 
doing one big commit and when the ingestion process finished 50%, I can see the 
solr server already used over 90% of full memory.

I'll try to assign more RAM to Solr Java. But from your experience, does 4G 
sounds like a good number for Java heap size for my scenario? Is there any way 
to reduce memory usage during index time? (One thing I know is do a few commits 
instead of one commit. )  My concern is providing I have 12 G in total, If I 
assign too much to Solr server, I may not have enough for the OS to cache Solr 
index file.

I had a look to solr config file, but couldn't find anything that obviously 
wrong, Just wondering which part of that config file would impact the index 
time?

Thanks,
Ryan





One possible source of problems with that particular upgrade is the fact
that stored field compression was added in 4.1, and termvector
compression was added in 4.2.  They are on by default and cannot be
turned off.  The compression is typically fast, but with very large
documents like yours, it might result in pretty major computational
overhead.  It can also require additional java heap, which ties into
what follows:

Another problem might be RAM-related.

If your java heap is very large, or just a little bit too small, there
can be major performance issues from garbage collection.  Based on the
fact that the earlier version performed well, a too-small heap is more
likely than a very large heap.

If your index size is such that it can't be effectively cached by the
amount of total RAM on the machine (minus the java heap assigned to
Solr), that can cause performance problems.  Your index size is likely
to be several gigabytes, and might even reach double-digit gigabytes.
Can you relate those numbers -- index size, java heap size, and total
system RAM?  If you can, it would also be a good idea to share your
solrconfig.xml.

Here's a wiki page that goes into more detail about possible performance
issues.  It doesn't mention the possible compression problem:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn


Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Li, Ryan
Hi Guys,

Just some update.

I've tried with Solr 4.10 (same code for Solr 4.9). And that has the same index 
speed as 4.0. The only problem left now is that Solr 4.10 takes more memory 
than 4.0 so I'm trying to figure out what is the best number for Java heap size.

I think that proves there is some performance issue with Solr 4.9 when index 
big document (even just over 1mb).

Thanks,
Ryan


Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Alexandre Rafalovitch
Why do one big commit? You could do hard commits along the way but keep
searcher open and not see the changes until the end.

Obviously a separate issue from memory consumption discussion, but thought
I'll add it anyway.

Regards,
 Alex
On 05/09/2014 3:30 am, Li, Ryan ryan...@sensis.com.au wrote:

 HI Shawn,

 Thanks for your reply.

 The memory setting of my Solr box is

 12G physically memory.
 4G for java (-Xmx4096m)
 The index size is around 4G in Solr 4.9, I think it was over 6G in Solr
 4.0.

 I do think the RAM size of java is one of the reasons for this slowness.
 I'm doing one big commit and when the ingestion process finished 50%, I can
 see the solr server already used over 90% of full memory.

 I'll try to assign more RAM to Solr Java. But from your experience, does
 4G sounds like a good number for Java heap size for my scenario? Is there
 any way to reduce memory usage during index time? (One thing I know is do a
 few commits instead of one commit. )  My concern is providing I have 12 G
 in total, If I assign too much to Solr server, I may not have enough for
 the OS to cache Solr index file.

 I had a look to solr config file, but couldn't find anything that
 obviously wrong, Just wondering which part of that config file would impact
 the index time?

 Thanks,
 Ryan





 One possible source of problems with that particular upgrade is the fact
 that stored field compression was added in 4.1, and termvector
 compression was added in 4.2.  They are on by default and cannot be
 turned off.  The compression is typically fast, but with very large
 documents like yours, it might result in pretty major computational
 overhead.  It can also require additional java heap, which ties into
 what follows:

 Another problem might be RAM-related.

 If your java heap is very large, or just a little bit too small, there
 can be major performance issues from garbage collection.  Based on the
 fact that the earlier version performed well, a too-small heap is more
 likely than a very large heap.

 If your index size is such that it can't be effectively cached by the
 amount of total RAM on the machine (minus the java heap assigned to
 Solr), that can cause performance problems.  Your index size is likely
 to be several gigabytes, and might even reach double-digit gigabytes.
 Can you relate those numbers -- index size, java heap size, and total
 system RAM?  If you can, it would also be a good idea to share your
 solrconfig.xml.

 Here's a wiki page that goes into more detail about possible performance
 issues.  It doesn't mention the possible compression problem:

 http://wiki.apache.org/solr/SolrPerformanceProblems

 Thanks,
 Shawn



Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Mikhail Khludnev
On Fri, Sep 5, 2014 at 3:22 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 Why do one big commit? You could do hard commits along the way but keep
 searcher open and not see the changes until the end.


Alexandre,
I don't think it's can happen in solr-user list, next search pickups the
new searcher.

Ryan,
Regularly, commit is judged by application requirement, ie. when to make
updates visible. Memory consumption is judged by ramBufferSizeMB and
maxIndexingThreads. Exceeding the buffer, causes flush to disk, but doesn't
trigger commit.


 Obviously a separate issue from memory consumption discussion, but thought
 I'll add it anyway.

 Regards,
  Alex
 On 05/09/2014 3:30 am, Li, Ryan ryan...@sensis.com.au wrote:

  HI Shawn,
 
  Thanks for your reply.
 
  The memory setting of my Solr box is
 
  12G physically memory.
  4G for java (-Xmx4096m)
  The index size is around 4G in Solr 4.9, I think it was over 6G in Solr
  4.0.
 
  I do think the RAM size of java is one of the reasons for this slowness.
  I'm doing one big commit and when the ingestion process finished 50%, I
 can
  see the solr server already used over 90% of full memory.
 
  I'll try to assign more RAM to Solr Java. But from your experience, does
  4G sounds like a good number for Java heap size for my scenario? Is there
  any way to reduce memory usage during index time? (One thing I know is
 do a
  few commits instead of one commit. )  My concern is providing I have 12 G
  in total, If I assign too much to Solr server, I may not have enough for
  the OS to cache Solr index file.
 
  I had a look to solr config file, but couldn't find anything that
  obviously wrong, Just wondering which part of that config file would
 impact
  the index time?
 
  Thanks,
  Ryan
 
 
 
 
 
  One possible source of problems with that particular upgrade is the fact
  that stored field compression was added in 4.1, and termvector
  compression was added in 4.2.  They are on by default and cannot be
  turned off.  The compression is typically fast, but with very large
  documents like yours, it might result in pretty major computational
  overhead.  It can also require additional java heap, which ties into
  what follows:
 
  Another problem might be RAM-related.
 
  If your java heap is very large, or just a little bit too small, there
  can be major performance issues from garbage collection.  Based on the
  fact that the earlier version performed well, a too-small heap is more
  likely than a very large heap.
 
  If your index size is such that it can't be effectively cached by the
  amount of total RAM on the machine (minus the java heap assigned to
  Solr), that can cause performance problems.  Your index size is likely
  to be several gigabytes, and might even reach double-digit gigabytes.
  Can you relate those numbers -- index size, java heap size, and total
  system RAM?  If you can, it would also be a good idea to share your
  solrconfig.xml.
 
  Here's a wiki page that goes into more detail about possible performance
  issues.  It doesn't mention the possible compression problem:
 
  http://wiki.apache.org/solr/SolrPerformanceProblems
 
  Thanks,
  Shawn
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Alexandre Rafalovitch
On Fri, Sep 5, 2014 at 9:55 AM, Mikhail Khludnev
mkhlud...@griddynamics.com wrote:
 Why do one big commit? You could do hard commits along the way but keep
 searcher open and not see the changes until the end.


 Alexandre,
 I don't think it's can happen in solr-user list, next search pickups the
 new searcher.

Why not? Isn't that what the Solr example configuration doing at:
https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/solr/collection1/conf/solrconfig.xml#L386
?
Hard commit does not reopen the searcher. The soft commit does
(further down), but that can be disabled to get the effect I am
proposing.

What am I missing?

Regards,
   Alex.

Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Erick Erickson
Alexandre:

It Depends (tm) of course. It all hinges on the setting in autocommit,
whether openSearcher is true or false.

In the former case, you, well, open a new searcher. In the latter you don't.

I agree, though, this is all tangential to the memory consumption issue since
the RAM buffer will be flushed regardless of these settings.

FWIW,
Erick

On Fri, Sep 5, 2014 at 7:11 AM, Alexandre Rafalovitch
arafa...@gmail.com wrote:
 On Fri, Sep 5, 2014 at 9:55 AM, Mikhail Khludnev
 mkhlud...@griddynamics.com wrote:
 Why do one big commit? You could do hard commits along the way but keep
 searcher open and not see the changes until the end.


 Alexandre,
 I don't think it's can happen in solr-user list, next search pickups the
 new searcher.

 Why not? Isn't that what the Solr example configuration doing at:
 https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/solr/collection1/conf/solrconfig.xml#L386
 ?
 Hard commit does not reopen the searcher. The soft commit does
 (further down), but that can be disabled to get the effect I am
 proposing.

 What am I missing?

 Regards,
Alex.

 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-04 Thread Shawn Heisey
On 9/3/2014 8:14 PM, Li, Ryan wrote:
 I have a Solr server  indexes 2500 documents (up to 50MB each, ave 3MB) to 
 Solr server. When running on Solr 4.0 I managed to finish index in 3 hours.
 
 However after we upgrade to Solr 4.9, the index need 3 days to finish.
 
 I've done some profiling, numbers I get are:
 size figure of document,time for adding to Solr server (4.0), time for 
 adding to Solr server (4.9)
 1.18,   6 sec,
123 sec
 2.26   12sec  
  444 sec
 3.35   18sec  
  over 600 sec
 9.6546sec 
  timeout.
 
 From what I can see index seems has an o(n) performance for Solr 4.0 and is 
 almost o(log n) for Solr 4.9. I also tried to comment out some copied fields 
 to narrow down the problem, seems size of the document after index(we copy 
 fields and the more fields we copy, the bigger the index size is)  is the 
 dominating factor for index time.
 
 Just wondering has any one experience similar problem? Does that sound like a 
 bug of Solr or just we have use Solr 4.9 wrong?

One possible source of problems with that particular upgrade is the fact
that stored field compression was added in 4.1, and termvector
compression was added in 4.2.  They are on by default and cannot be
turned off.  The compression is typically fast, but with very large
documents like yours, it might result in pretty major computational
overhead.  It can also require additional java heap, which ties into
what follows:

Another problem might be RAM-related.

If your java heap is very large, or just a little bit too small, there
can be major performance issues from garbage collection.  Based on the
fact that the earlier version performed well, a too-small heap is more
likely than a very large heap.

If your index size is such that it can't be effectively cached by the
amount of total RAM on the machine (minus the java heap assigned to
Solr), that can cause performance problems.  Your index size is likely
to be several gigabytes, and might even reach double-digit gigabytes.
Can you relate those numbers -- index size, java heap size, and total
system RAM?  If you can, it would also be a good idea to share your
solrconfig.xml.

Here's a wiki page that goes into more detail about possible performance
issues.  It doesn't mention the possible compression problem:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-04 Thread Erick Erickson
Ryan:

As it happens, there's a discssion on the dev list about this.

If at all possible, could you try a brief experiment? Turn off
all the storage, i.e. set stored=false on all fields. It's a lot
to ask, but it'd help the discussion.

Or join the discussion at https://issues.apache.org/jira/browse/LUCENE-5914.

Best,
Erick

On Thu, Sep 4, 2014 at 1:08 AM, Shawn Heisey s...@elyograg.org wrote:
 On 9/3/2014 8:14 PM, Li, Ryan wrote:
 I have a Solr server  indexes 2500 documents (up to 50MB each, ave 3MB) to 
 Solr server. When running on Solr 4.0 I managed to finish index in 3 hours.

 However after we upgrade to Solr 4.9, the index need 3 days to finish.

 I've done some profiling, numbers I get are:
 size figure of document,time for adding to Solr server (4.0), time for 
 adding to Solr server (4.9)
 1.18,   6 sec,   
 123 sec
 2.26   12sec 
   444 sec
 3.35   18sec 
   over 600 sec
 9.6546sec
   timeout.

 From what I can see index seems has an o(n) performance for Solr 4.0 and is 
 almost o(log n) for Solr 4.9. I also tried to comment out some copied fields 
 to narrow down the problem, seems size of the document after index(we copy 
 fields and the more fields we copy, the bigger the index size is)  is the 
 dominating factor for index time.

 Just wondering has any one experience similar problem? Does that sound like 
 a bug of Solr or just we have use Solr 4.9 wrong?

 One possible source of problems with that particular upgrade is the fact
 that stored field compression was added in 4.1, and termvector
 compression was added in 4.2.  They are on by default and cannot be
 turned off.  The compression is typically fast, but with very large
 documents like yours, it might result in pretty major computational
 overhead.  It can also require additional java heap, which ties into
 what follows:

 Another problem might be RAM-related.

 If your java heap is very large, or just a little bit too small, there
 can be major performance issues from garbage collection.  Based on the
 fact that the earlier version performed well, a too-small heap is more
 likely than a very large heap.

 If your index size is such that it can't be effectively cached by the
 amount of total RAM on the machine (minus the java heap assigned to
 Solr), that can cause performance problems.  Your index size is likely
 to be several gigabytes, and might even reach double-digit gigabytes.
 Can you relate those numbers -- index size, java heap size, and total
 system RAM?  If you can, it would also be a good idea to share your
 solrconfig.xml.

 Here's a wiki page that goes into more detail about possible performance
 issues.  It doesn't mention the possible compression problem:

 http://wiki.apache.org/solr/SolrPerformanceProblems

 Thanks,
 Shawn



Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-03 Thread Li, Ryan
I have a Solr server  indexes 2500 documents (up to 50MB each, ave 3MB) to Solr 
server. When running on Solr 4.0 I managed to finish index in 3 hours.

However after we upgrade to Solr 4.9, the index need 3 days to finish.

I've done some profiling, numbers I get are:
size figure of document,time for adding to Solr server (4.0), time for 
adding to Solr server (4.9)
1.18,   6 sec,  
 123 sec
2.26   12sec
   444 sec
3.35   18sec
   over 600 sec
9.6546sec   
   timeout.

From what I can see index seems has an o(n) performance for Solr 4.0 and is 
almost o(log n) for Solr 4.9. I also tried to comment out some copied fields 
to narrow down the problem, seems size of the document after index(we copy 
fields and the more fields we copy, the bigger the index size is)  is the 
dominating factor for index time.

Just wondering has any one experience similar problem? Does that sound like a 
bug of Solr or just we have use Solr 4.9 wrong?

Here is one example of  field definition in my schema file.
fieldType name=text_stem class=solr.TextField 
positionIncrementGap=100
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
charFilter class=solr.PatternReplaceCharFilterFactory 
pattern='+ replacement= / !-- strip off all apostrophe (') characters --
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory expand=true 
ignoreCase=true synonyms=../../resources/type-index-synonyms.txt/
filter class=solr.SnowballPorterFilterFactory 
language=English /
!-- Used to have  language=English - seems this param is 
gone in 4.9 --
filter class=solr.RemoveDuplicatesTokenFilterFactory /
/analyzer
analyzer type=query
charFilter class=solr.HTMLStripCharFilterFactory/
charFilter class=solr.PatternReplaceCharFilterFactory 
pattern='+ replacement= / !-- strip off all apostrophe (') characters --
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory expand=true 
ignoreCase=true synonyms=../../resources/type-query-colloq-synonyms.txt/
filter class=solr.SnowballPorterFilterFactory 
language=English /
!-- Used to have  language=English - seems this param is 
gone in 4.9 --
filter class=solr.RemoveDuplicatesTokenFilterFactory /
/analyzer
/fieldType
Field:
field name=majorTextSignalStem type=text_stem indexed=true 
stored=false multiValued=true omitNorms=false/
Copy:
 copyField dest=majorTextSignalStem source=majorTextSignalRaw /

Thanks,
Ryan