Re: mergeFactor / indexing speed

2009-08-09 Thread Avlesh Singh

 And - indexing 160k documents now takes 5min instead of 1.5h!

Awesome! It works for all!

(Now I can go relaxed on vacation. :-D )

Take me along!

Cheers
Avlesh

On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann 
chantal.ackerm...@btelligent.de wrote:

 Juhu, great news, guys. I merged my child entity into the root entity, and
 changed the custom entityprocessor to handle the additional columns
 correctly.
 And - indexing 160k documents now takes 5min instead of 1.5h!

 (Now I can go relaxed on vacation. :-D )


 Conclusion:
 In my case performance was so bad because of constantly querying a database
 on a different machine (network traffic + db query per document).


 Thanks for all your help!
 Chantal


 Avlesh Singh schrieb:

 does DIH call commit periodically, or are things done in one big batch?

  AFAIK, one big batch.


 yes. There is no index available once the full-import started (and the
 searcher has no cache, other wise it still reads from that). There is no
 data (i.e. in the Admin/Luke frontend) visible until the import is finished
 correctly.



Re: mergeFactor / indexing speed

2009-08-07 Thread Chantal Ackermann
Juhu, great news, guys. I merged my child entity into the root entity, 
and changed the custom entityprocessor to handle the additional columns 
correctly.

And - indexing 160k documents now takes 5min instead of 1.5h!

(Now I can go relaxed on vacation. :-D )


Conclusion:
In my case performance was so bad because of constantly querying a 
database on a different machine (network traffic + db query per document).



Thanks for all your help!
Chantal


Avlesh Singh schrieb:

does DIH call commit periodically, or are things done in one big batch?


AFAIK, one big batch.


yes. There is no index available once the full-import started (and the 
searcher has no cache, other wise it still reads from that). There is no 
data (i.e. in the Admin/Luke frontend) visible until the import is 
finished correctly.


Re: mergeFactor / indexing speed

2009-08-07 Thread Shalin Shekhar Mangar
On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann 
chantal.ackerm...@btelligent.de wrote:

 Juhu, great news, guys. I merged my child entity into the root entity, and
 changed the custom entityprocessor to handle the additional columns
 correctly.
 And - indexing 160k documents now takes 5min instead of 1.5h!


I'm a little late to the party but you may also want to look at
CachedSqlEntityProcessor.

-- 
Regards,
Shalin Shekhar Mangar.


Re: mergeFactor / indexing speed

2009-08-07 Thread Chantal Ackermann
Thanks for the tip, Shalin. I'm happy with 6 indexes running in parallel 
and completing in less than 10min, right now, but I'll have look anyway.



Shalin Shekhar Mangar schrieb:

On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann 
chantal.ackerm...@btelligent.de wrote:


Juhu, great news, guys. I merged my child entity into the root entity, and
changed the custom entityprocessor to handle the additional columns
correctly.
And - indexing 160k documents now takes 5min instead of 1.5h!



I'm a little late to the party but you may also want to look at
CachedSqlEntityProcessor.

--
Regards,
Shalin Shekhar Mangar.


Re: mergeFactor / indexing speed

2009-08-06 Thread Chantal Ackermann

Hi all,

to keep this thread up to date... ;-)


d) jdbc batch size
changed to 10. (Was default: 500, then 1000)

The problem with my dih setup is that the root entity query returns a 
huge set (all ids that shall be indexed). A larger fetchsize would be 
good for that query.
The nested entity, however, returns only up 9 rows, ever. The 
constraints are so strict (by id) that there is no way that any 
additional data could be pre-fetched.
(Actually, anynone using DIH with nested entities should run into that 
problem?)


After changing to 10, I cannot see that this low batch size slowed the 
indexer down (significantly).


As I would like to stick with DIH (instead of dumping the data into CSV 
and import it then) here is my question:


Do you think it's possible to return (in the nested entity) rows 
independent of the unique id, and let the processor decide when a 
document is complete?
The examples in the wiki always use an ID to get the data for the nested 
entity, so I'm not sure it was planned with that in mind. But as I'm 
already handling multiple db rows for one document, it might not be too 
difficult to change to handling the unique id correctly, as well?
Of course, I would need something like a look ahead to know whether the 
next row is already part of the next document.



Cheers,
Chantal



Concerning the other settings (just fyi):

a) mergeFactor 10 (and also tried 100)
I don't think that changed anything to the worse, rather to the better. 
So, I'll stick with 10 from now on.


b) ramBufferSizeMB
tried 512, 1024. RAM usage went up when I increased from 256 to 512. Not 
sure about 1024. I'll stick to 512.





Re: mergeFactor / indexing speed

2009-08-06 Thread Yonik Seeley
On Mon, Aug 3, 2009 at 12:32 PM, Chantal
Ackermannchantal.ackerm...@btelligent.de wrote:
 avg-cpu:  %user   %nice    %sys %iowait   %idle
           1.23    0.00    0.03    0.03   98.71

 Basically, it is doing very little? *scratch*

How often is commit being called?  (a  Lucene commit sync's all of the
index files so a crash won't result in a corrupted index... this can
be costly).

Guys - does DIH call commit periodically, or are things done in one big batch?
Chantal - is autocommit configured in solrconfig.xml?

-Yonik
http://www.lucidimagination.com


Re: mergeFactor / indexing speed

2009-08-06 Thread Avlesh Singh

 does DIH call commit periodically, or are things done in one big batch?

AFAIK, one big batch.

Cheers
Avlesh

On Thu, Aug 6, 2009 at 11:23 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Mon, Aug 3, 2009 at 12:32 PM, Chantal
 Ackermannchantal.ackerm...@btelligent.de wrote:
  avg-cpu:  %user   %nice%sys %iowait   %idle
1.230.000.030.03   98.71
 
  Basically, it is doing very little? *scratch*

 How often is commit being called?  (a  Lucene commit sync's all of the
 index files so a crash won't result in a corrupted index... this can
 be costly).

 Guys - does DIH call commit periodically, or are things done in one big
 batch?
 Chantal - is autocommit configured in solrconfig.xml?

 -Yonik
 http://www.lucidimagination.com



Re: mergeFactor / indexing speed

2009-08-04 Thread Chantal Ackermann

Hi Avlesh,
hi Otis,
hi Grant,
hi all,


(enumerating to keep track of all the input)

a) mergeFactor 1000 too high
I'll change that back to 10. I thought it would make Lucene use more RAM 
before starting IO.


b) ramBufferSize:
OK, or maybe more. I'll keep that in mind.

c) solrconfig.xml - default and main index:
I've always changed both sections, the default and the main index one.

d) JDBC batch size:
I haven't set it. I'll do that.

e) DB server performance:
I agree, ping is definitely not much information. I also did queries 
from my own computer towards it (while the indexer ran) which came back 
as fast as usual.
Currently, I don't have any login to ssh to that machine, but I'm going 
to try get one.


f) Network:
I'll definitely need to have a look at that once I have access to the db 
machine.



g) the data

g.1) nested entity in DIH conf
there is only the root and one nested entity. However, that nested 
entity returns multiple rows (about 10) for one query. (Fetched rows is 
about 10 times the number of processed documents.)


g.2) my custom EntityProcessor
( The code is pasted at the very end of this e-mail. )
- iterates over those multiple rows,
- uses one column to create a key in a map,
- uses two other columns to create the corresponding value (String 
concatenation),
- if a key already exists, it gets the value, if that value is a list, 
it adds the new value to that list, if it's not a list, it creates one 
and adds the old and the new value to it.
I refrained from adding any business logic to that processor. It treats 
all rows alike, no matter whether they hold values that can appear 
multiple or values that must appear only once.


g.3) the two transformers
- to split one value into two (regex)
field column=person /
field column=participant sourceColName=person regex=([^\|]+)\|.*/
field column=role sourceColName=person 
regex=[^\|]+\|\d+,\d+,\d+,(.*)/


- to create extract a number from an existing number (bit calculation 
using the script transformer). As that one works on a field that is 
potentially multiValued, it needs to take care of creating and 
populating a list, as well.

field column=cat name=cat /
script![CDATA[
function getMainCategory(row) {
var cat = row.get('cat');
var mainCat;
if (cat != null) {
// check whether cat is an array
if (cat instanceof java.util.List) {
var arr = java.util.ArrayList();
for (var i=0; icat.size(); i++) {
mainCat = new java.lang.Integer(cat.get(i)8);
if (!arr.contains(mainCat)) {
arr.add(mainCat);
}
}
row.put('maincat', arr);
} else { // it is a single value
var mainCat = new java.lang.Integer(cat8);
row.put('maincat', mainCat);
}
}
return row;
}
]]/script
(The EpgValueEntityProcessor decides on creating lists on a case by case 
basis: only if a value is specified multiple times for a certain data 
set does it create a list. This is because I didn't want to put any 
complex configuration or business logic into it.)


g.4) fields
the DIH extracts 5 fields from the root entity, 11 fields from the 
nested entity, and the transformers might create additional 3 (multiValued).
schema.xml defines 21 fields (two additional fields: the timestamp field 
(default=NOW) and a field collecting three other text fields for 
default search (using copy field)):

- 2 long
- 3 integer
- 3 sint
- 3 date
- 6 text_cs (class=solr.TextField positionIncrementGap=100):
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0
generateWordParts=0 generateNumberParts=0 catenateWords=0 
catenateNumbers=0 catenateAll=0 /

/analyzer
- 4 text_de (one is the field populated by copying from the 3 others):
analyzer type=index
tokenizer class=solr.StandardTokenizerFactory /
filter class=solr.LengthFilterFactory min=2 max=5000 /
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_de.txt /

filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1 
catenateAll=0 splitOnCaseChange=1 /

filter class=solr.LowerCaseFilterFactory /
filter class=solr.SnowballPorterFilterFactory language=German /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
/analyzer


Thank you for taking your time!
Cheers,
Chantal





** EpgValueEntityProcessor.java ***

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.logging.Logger;

import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.SqlEntityProcessor;

public class 

Re: mergeFactor / indexing speed

2009-08-03 Thread Chantal Ackermann

Hi all,

I'm still struggling with the index performance. I've moved the indexer
to a different machine, now, which is faster and less occupied.

The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
running with those settings (and others):
-server -Xms1G -Xmx7G

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
It has been processing roughly 70k documents in half an hour, so far. 
Which means 1,5 hours at least for 200k - which is as fast/slow as 
before (on the less performant machine).


The machine is not swapping. It is only using 13% of the memory.
iostat gives me:
 iostat
Linux 2.6.9-67.ELsmp  08/03/2009

avg-cpu:  %user   %nice%sys %iowait   %idle
   1.230.000.030.03   98.71

Basically, it is doing very little? *scratch*

The sourcing database is responding as fast as ever. (I checked that 
from my own machine, and did only a ping from the linux box to the db 
server.)


Any help, any hint on where to look would be greatly appreciated.


Thanks!
Chantal


Chantal Ackermann schrieb:

Hi again!

Thanks for the answer, Grant.

  It could very well be the case that you aren't seeing any merges with
  only 20K docs.  Ultimately, if you really want to, you can look in
  your data.dir and count the files.  If you have indexed a lot and have
  an MF of 100 and haven't done an optimize, you will see a lot more
  index files.

Do you mean that 20k is not representative enough to test those settings?
I've chosen the smaller data set so that the index can run completely
but doesn't take too long at the same time.
If it would be faster to begin with, I could use a larger data set, of
course. I still can't believe that 11 minutes is normal (I haven't
managed to make it run faster or slower than that, that duration is very
stable).

It feels kinda slow to me...
Out of your experience - what would you expect as duration for an index
with:
- 21 fields, some using a text type with 6 filters
- database access using DataImportHandler with a query of (far) less
than 20ms
- 2 transformers

If I knew that indexing time should be shorter than that, at least, I
would know that something is definitely wrong with what I am doing or
with the environment I am using.

  Likely, but not guaranteed.  Typically, larger merge factors are good
  for batch indexing, but a lot of that has changed with Lucene's new
  background merger, such that I don't know if it matters as much anymore.

Ok. I also read some posting where it basically said that the default
parameters are ok. And one shouldn't mess around with them.

The thing is that our current search setup uses Lucene directly, and the
indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
fields are different, the complete setup is different. But it will be
hard to advertise a new implementation/setup where indexing is three
times slower - unless I can give some reasons why that is.

The full index should be fairly fast because the backing data is update
every few hours. I want to put in place an incremental/partial update as
main process, but full indexing might have to be done at certain times
if data has changed completely, or the schema has to be changed/extended.

  No, those are separate things.  The ramBufferSizeMB (although, I like
  the thought of a rumBufferSizeMB too!  ;-)  ) controls how many docs
  Lucene holds in memory before it has to flush.  MF controls how many
  segments are on disk

alas! the rum. I had that typo on the commandline before. that's my
subconscious telling me what I should do when I get home, tonight...

So, increasing ramBufferSize should lead to higher memory usage,
shouldn't it? I'm not seeing that. :-(

I'll try once more with MF 10 and a higher rum... well, you know... ;-)

Cheers,
Chantal

Grant Ingersoll schrieb:

On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:


Dear all,

I want to find out which settings give the best full index
performance for my setup.
Therefore, I have been running a small index (less than 20k
documents) with a mergeFactor of 10 and 100.
In both cases, indexing took about 11.5 min:

mergeFactor: 10
str name=Time taken 0:11:46.792/str
mergeFactor: 100
/admin/cores?action=RELOAD
str name=Time taken 0:11:44.441/str
Tomcat restart
str name=Time taken 0:11:34.143/str

This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
ATA disk).


Now, I have three questions:

1. How can I check which mergeFactor is really being used? The
solrconfig.xml that is displayed in the admin application is the up-
to-date view on the file system. I tested that. But it's not
necessarily what the current SOLR core is using, isn't it?
Is there a way to check on the actually used mergeFactor (while the
index is running)?

It could very well be the case that you aren't seeing any merges with
only 20K docs.  Ultimately, if you really want to, you can look in
your data.dir and count the 

Re: mergeFactor / indexing speed

2009-08-03 Thread Avlesh Singh

 avg-cpu:  %user   %nice%sys %iowait   %idle
   1.230.000.030.03   98.71

I agree, real bad statistics, actually.

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.

To me the former appears to be too high and latter too low (for your machine
configuration). You can safely increase the ramBufferSize (or
maxBufferedDocs) to a higher value.

Couple of things -

   1. The stock solrconfig.xml comes with two sections indexDefaults and
   mainIndex. Options in the latter override the former. Just make sure that
   you have right values at the right place.
   2. Do you have too many nested entities inside the DIH's data-config? If
   yes, a database level optimization (creating views, in memory tables ...)
   might hold the answer.
   3. Tried playing around with jdbc paramters in the data source? Setting
   batchSize property to a considerable value might help.

Cheers
Avlesh

On Mon, Aug 3, 2009 at 10:02 PM, Chantal Ackermann 
chantal.ackerm...@btelligent.de wrote:

 Hi all,

 I'm still struggling with the index performance. I've moved the indexer
 to a different machine, now, which is faster and less occupied.

 The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
 running with those settings (and others):
 -server -Xms1G -Xmx7G

 Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
 It has been processing roughly 70k documents in half an hour, so far. Which
 means 1,5 hours at least for 200k - which is as fast/slow as before (on the
 less performant machine).

 The machine is not swapping. It is only using 13% of the memory.
 iostat gives me:
  iostat
 Linux 2.6.9-67.ELsmp  08/03/2009

 avg-cpu:  %user   %nice%sys %iowait   %idle
   1.230.000.030.03   98.71

 Basically, it is doing very little? *scratch*

 The sourcing database is responding as fast as ever. (I checked that from
 my own machine, and did only a ping from the linux box to the db server.)

 Any help, any hint on where to look would be greatly appreciated.


 Thanks!
 Chantal


 Chantal Ackermann schrieb:

 Hi again!

 Thanks for the answer, Grant.

   It could very well be the case that you aren't seeing any merges with
   only 20K docs.  Ultimately, if you really want to, you can look in
   your data.dir and count the files.  If you have indexed a lot and have
   an MF of 100 and haven't done an optimize, you will see a lot more
   index files.

 Do you mean that 20k is not representative enough to test those settings?
 I've chosen the smaller data set so that the index can run completely
 but doesn't take too long at the same time.
 If it would be faster to begin with, I could use a larger data set, of
 course. I still can't believe that 11 minutes is normal (I haven't
 managed to make it run faster or slower than that, that duration is very
 stable).

 It feels kinda slow to me...
 Out of your experience - what would you expect as duration for an index
 with:
 - 21 fields, some using a text type with 6 filters
 - database access using DataImportHandler with a query of (far) less
 than 20ms
 - 2 transformers

 If I knew that indexing time should be shorter than that, at least, I
 would know that something is definitely wrong with what I am doing or
 with the environment I am using.

   Likely, but not guaranteed.  Typically, larger merge factors are good
   for batch indexing, but a lot of that has changed with Lucene's new
   background merger, such that I don't know if it matters as much
 anymore.

 Ok. I also read some posting where it basically said that the default
 parameters are ok. And one shouldn't mess around with them.

 The thing is that our current search setup uses Lucene directly, and the
 indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
 fields are different, the complete setup is different. But it will be
 hard to advertise a new implementation/setup where indexing is three
 times slower - unless I can give some reasons why that is.

 The full index should be fairly fast because the backing data is update
 every few hours. I want to put in place an incremental/partial update as
 main process, but full indexing might have to be done at certain times
 if data has changed completely, or the schema has to be changed/extended.

   No, those are separate things.  The ramBufferSizeMB (although, I like
   the thought of a rumBufferSizeMB too!  ;-)  ) controls how many docs
   Lucene holds in memory before it has to flush.  MF controls how many
   segments are on disk

 alas! the rum. I had that typo on the commandline before. that's my
 subconscious telling me what I should do when I get home, tonight...

 So, increasing ramBufferSize should lead to higher memory usage,
 shouldn't it? I'm not seeing that. :-(

 I'll try once more with MF 10 and a higher rum... well, you know... ;-)

 Cheers,
 Chantal

 Grant Ingersoll schrieb:

 On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:

  Dear all,

 I want to find 

Re: mergeFactor / indexing speed

2009-08-03 Thread Otis Gospodnetic
Hi,

I'd have to poke around the machine(s) to give you better guidance, but here is 
some initial feedback:

- mergeFactor of 1000 seems crazy.  mergeFactor is probably not your problem.  
I'd go back to default of 10.
- 256 MB for ramBufferSizeMB sounds OK.
- pinging the DB won't tell you much about the DB server's performance - ssh to 
the machine and check its CPU load, memory usage, disk IO

Other things to look into:
- Network as the bottleneck?
- Field analysis as the bottleneck?


Otis 
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Chantal Ackermann chantal.ackerm...@btelligent.de
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Monday, August 3, 2009 12:32:12 PM
 Subject: Re: mergeFactor / indexing speed
 
 Hi all,
 
 I'm still struggling with the index performance. I've moved the indexer
 to a different machine, now, which is faster and less occupied.
 
 The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
 running with those settings (and others):
 -server -Xms1G -Xmx7G
 
 Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
 It has been processing roughly 70k documents in half an hour, so far. 
 Which means 1,5 hours at least for 200k - which is as fast/slow as 
 before (on the less performant machine).
 
 The machine is not swapping. It is only using 13% of the memory.
 iostat gives me:
   iostat
 Linux 2.6.9-67.ELsmp  08/03/2009
 
 avg-cpu:  %user   %nice%sys %iowait   %idle
 1.230.000.030.03   98.71
 
 Basically, it is doing very little? *scratch*
 
 The sourcing database is responding as fast as ever. (I checked that 
 from my own machine, and did only a ping from the linux box to the db 
 server.)
 
 Any help, any hint on where to look would be greatly appreciated.
 
 
 Thanks!
 Chantal
 
 
 Chantal Ackermann schrieb:
  Hi again!
 
  Thanks for the answer, Grant.
 
It could very well be the case that you aren't seeing any merges with
only 20K docs.  Ultimately, if you really want to, you can look in
your data.dir and count the files.  If you have indexed a lot and have
an MF of 100 and haven't done an optimize, you will see a lot more
index files.
 
  Do you mean that 20k is not representative enough to test those settings?
  I've chosen the smaller data set so that the index can run completely
  but doesn't take too long at the same time.
  If it would be faster to begin with, I could use a larger data set, of
  course. I still can't believe that 11 minutes is normal (I haven't
  managed to make it run faster or slower than that, that duration is very
  stable).
 
  It feels kinda slow to me...
  Out of your experience - what would you expect as duration for an index
  with:
  - 21 fields, some using a text type with 6 filters
  - database access using DataImportHandler with a query of (far) less
  than 20ms
  - 2 transformers
 
  If I knew that indexing time should be shorter than that, at least, I
  would know that something is definitely wrong with what I am doing or
  with the environment I am using.
 
Likely, but not guaranteed.  Typically, larger merge factors are good
for batch indexing, but a lot of that has changed with Lucene's new
background merger, such that I don't know if it matters as much anymore.
 
  Ok. I also read some posting where it basically said that the default
  parameters are ok. And one shouldn't mess around with them.
 
  The thing is that our current search setup uses Lucene directly, and the
  indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
  fields are different, the complete setup is different. But it will be
  hard to advertise a new implementation/setup where indexing is three
  times slower - unless I can give some reasons why that is.
 
  The full index should be fairly fast because the backing data is update
  every few hours. I want to put in place an incremental/partial update as
  main process, but full indexing might have to be done at certain times
  if data has changed completely, or the schema has to be changed/extended.
 
No, those are separate things.  The ramBufferSizeMB (although, I like
the thought of a rumBufferSizeMB too!  ;-)  ) controls how many docs
Lucene holds in memory before it has to flush.  MF controls how many
segments are on disk
 
  alas! the rum. I had that typo on the commandline before. that's my
  subconscious telling me what I should do when I get home, tonight...
 
  So, increasing ramBufferSize should lead to higher memory usage,
  shouldn't it? I'm not seeing that. :-(
 
  I'll try once more with MF 10 and a higher rum... well, you know... ;-)
 
  Cheers,
  Chantal
 
  Grant Ingersoll schrieb:
  On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:
 
  Dear all,
 
  I want to find out which settings give the best full index
  performance for my setup

Re: mergeFactor / indexing speed

2009-08-03 Thread Grant Ingersoll
How big are your documents?  I haven't benchmarked DIH, so I am not  
sure what to expect, but it does seem like something isn't right.  Can  
you fully describe how you are indexing?  Have you done any profiling?


On Aug 3, 2009, at 12:32 PM, Chantal Ackermann wrote:


Hi all,

I'm still struggling with the index performance. I've moved the  
indexer

to a different machine, now, which is faster and less occupied.

The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18,
running with those settings (and others):
-server -Xms1G -Xmx7G

Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB.
It has been processing roughly 70k documents in half an hour, so  
far. Which means 1,5 hours at least for 200k - which is as fast/slow  
as before (on the less performant machine).


The machine is not swapping. It is only using 13% of the memory.
iostat gives me:
iostat
Linux 2.6.9-67.ELsmp  08/03/2009

avg-cpu:  %user   %nice%sys %iowait   %idle
  1.230.000.030.03   98.71

Basically, it is doing very little? *scratch*

The sourcing database is responding as fast as ever. (I checked that  
from my own machine, and did only a ping from the linux box to the  
db server.)


Any help, any hint on where to look would be greatly appreciated.


Thanks!
Chantal


Chantal Ackermann schrieb:

Hi again!

Thanks for the answer, Grant.

 It could very well be the case that you aren't seeing any merges  
with

 only 20K docs.  Ultimately, if you really want to, you can look in
 your data.dir and count the files.  If you have indexed a lot and  
have

 an MF of 100 and haven't done an optimize, you will see a lot more
 index files.

Do you mean that 20k is not representative enough to test those  
settings?

I've chosen the smaller data set so that the index can run completely
but doesn't take too long at the same time.
If it would be faster to begin with, I could use a larger data set,  
of

course. I still can't believe that 11 minutes is normal (I haven't
managed to make it run faster or slower than that, that duration is  
very

stable).

It feels kinda slow to me...
Out of your experience - what would you expect as duration for an  
index

with:
- 21 fields, some using a text type with 6 filters
- database access using DataImportHandler with a query of (far) less
than 20ms
- 2 transformers

If I knew that indexing time should be shorter than that, at least, I
would know that something is definitely wrong with what I am doing or
with the environment I am using.

 Likely, but not guaranteed.  Typically, larger merge factors are  
good

 for batch indexing, but a lot of that has changed with Lucene's new
 background merger, such that I don't know if it matters as much  
anymore.


Ok. I also read some posting where it basically said that the default
parameters are ok. And one shouldn't mess around with them.

The thing is that our current search setup uses Lucene directly,  
and the

indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The
fields are different, the complete setup is different. But it will be
hard to advertise a new implementation/setup where indexing is three
times slower - unless I can give some reasons why that is.

The full index should be fairly fast because the backing data is  
update
every few hours. I want to put in place an incremental/partial  
update as
main process, but full indexing might have to be done at certain  
times
if data has changed completely, or the schema has to be changed/ 
extended.


 No, those are separate things.  The ramBufferSizeMB (although, I  
like
 the thought of a rumBufferSizeMB too!  ;-)  ) controls how many  
docs
 Lucene holds in memory before it has to flush.  MF controls how  
many

 segments are on disk

alas! the rum. I had that typo on the commandline before. that's my
subconscious telling me what I should do when I get home, tonight...

So, increasing ramBufferSize should lead to higher memory usage,
shouldn't it? I'm not seeing that. :-(

I'll try once more with MF 10 and a higher rum... well, you  
know... ;-)


Cheers,
Chantal

Grant Ingersoll schrieb:

On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:


Dear all,

I want to find out which settings give the best full index
performance for my setup.
Therefore, I have been running a small index (less than 20k
documents) with a mergeFactor of 10 and 100.
In both cases, indexing took about 11.5 min:

mergeFactor: 10
str name=Time taken 0:11:46.792/str
mergeFactor: 100
/admin/cores?action=RELOAD
str name=Time taken 0:11:44.441/str
Tomcat restart
str name=Time taken 0:11:34.143/str

This is a Tomcat 5.5.20, started with a max heap size of 1GB. But  
it
always used much less. No swapping (RedHat Linux 32bit, 3GB RAM,  
old

ATA disk).


Now, I have three questions:

1. How can I check which mergeFactor is really being used? The
solrconfig.xml that is displayed in the admin application is the  
up-

to-date view on the file system. I tested that. But 

mergeFactor / indexing speed

2009-07-31 Thread Chantal Ackermann

Dear all,

I want to find out which settings give the best full index performance 
for my setup.
Therefore, I have been running a small index (less than 20k documents) 
with a mergeFactor of 10 and 100.

In both cases, indexing took about 11.5 min:

mergeFactor: 10
str name=Time taken 0:11:46.792/str
mergeFactor: 100
/admin/cores?action=RELOAD
str name=Time taken 0:11:44.441/str
Tomcat restart
str name=Time taken 0:11:34.143/str

This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it 
always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old ATA 
disk).



Now, I have three questions:

1. How can I check which mergeFactor is really being used? The 
solrconfig.xml that is displayed in the admin application is the 
up-to-date view on the file system. I tested that. But it's not 
necessarily what the current SOLR core is using, isn't it?
Is there a way to check on the actually used mergeFactor (while the 
index is running)?
2. I changed the mergeFactor in both available settings (default and 
main index) in the solrconfig.xml file of the core I am reindexing. That 
is the correct place? Should a change in performance be noticeable when 
increasing from 10 to 100? Or is the change not perceivable if the 
requests for data are taking far longer than all the indexing itself?
3. Do I have to increase rumBufferSizeMB if I increase mergeFactor? (Or 
some other setting?)


(I am still trying to get profiling information on how much application 
time is eaten up by db connection/requests/processing.
The root entity query is about (average) 20ms. The child entity query is 
less than 10ms.
I have my custom entity processor running on the child entity that 
populates the map using a multi-row result set. I have also attached one 
regex and one script transformer.)


Thank you for any tips!
Chantal



--
Chantal Ackermann


Re: mergeFactor / indexing speed

2009-07-31 Thread Grant Ingersoll


On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:


Dear all,

I want to find out which settings give the best full index  
performance for my setup.
Therefore, I have been running a small index (less than 20k  
documents) with a mergeFactor of 10 and 100.

In both cases, indexing took about 11.5 min:

mergeFactor: 10
str name=Time taken 0:11:46.792/str
mergeFactor: 100
/admin/cores?action=RELOAD
str name=Time taken 0:11:44.441/str
Tomcat restart
str name=Time taken 0:11:34.143/str

This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it  
always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old  
ATA disk).



Now, I have three questions:

1. How can I check which mergeFactor is really being used? The  
solrconfig.xml that is displayed in the admin application is the up- 
to-date view on the file system. I tested that. But it's not  
necessarily what the current SOLR core is using, isn't it?
Is there a way to check on the actually used mergeFactor (while the  
index is running)?


It could very well be the case that you aren't seeing any merges with  
only 20K docs.  Ultimately, if you really want to, you can look in  
your data.dir and count the files.  If you have indexed a lot and have  
an MF of 100 and haven't done an optimize, you will see a lot more  
index files.



2. I changed the mergeFactor in both available settings (default and  
main index) in the solrconfig.xml file of the core I am reindexing.  
That is the correct place? Should a change in performance be  
noticeable when increasing from 10 to 100? Or is the change not  
perceivable if the requests for data are taking far longer than all  
the indexing itself?


Likely, but not guaranteed.  Typically, larger merge factors are good  
for batch indexing, but a lot of that has changed with Lucene's new  
background merger, such that I don't know if it matters as much anymore.



3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?  
(Or some other setting?)


No, those are separate things.  The ramBufferSizeMB (although, I like  
the thought of a rumBufferSizeMB too!  ;-)  ) controls how many docs  
Lucene holds in memory before it has to flush.  MF controls how many  
segments are on disk




(I am still trying to get profiling information on how much  
application time is eaten up by db connection/requests/processing.
The root entity query is about (average) 20ms. The child entity  
query is less than 10ms.
I have my custom entity processor running on the child entity that  
populates the map using a multi-row result set. I have also attached  
one regex and one script transformer.)


Thank you for any tips!
Chantal



--
Chantal Ackermann


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Re: mergeFactor / indexing speed

2009-07-31 Thread Chantal Ackermann

Hi again!

Thanks for the answer, Grant.

 It could very well be the case that you aren't seeing any merges with
 only 20K docs.  Ultimately, if you really want to, you can look in
 your data.dir and count the files.  If you have indexed a lot and have
 an MF of 100 and haven't done an optimize, you will see a lot more
 index files.

Do you mean that 20k is not representative enough to test those settings?
I've chosen the smaller data set so that the index can run completely 
but doesn't take too long at the same time.
If it would be faster to begin with, I could use a larger data set, of 
course. I still can't believe that 11 minutes is normal (I haven't 
managed to make it run faster or slower than that, that duration is very 
stable).


It feels kinda slow to me...
Out of your experience - what would you expect as duration for an index 
with:

- 21 fields, some using a text type with 6 filters
- database access using DataImportHandler with a query of (far) less 
than 20ms

- 2 transformers

If I knew that indexing time should be shorter than that, at least, I 
would know that something is definitely wrong with what I am doing or 
with the environment I am using.


 Likely, but not guaranteed.  Typically, larger merge factors are good
 for batch indexing, but a lot of that has changed with Lucene's new
 background merger, such that I don't know if it matters as much anymore.

Ok. I also read some posting where it basically said that the default 
parameters are ok. And one shouldn't mess around with them.


The thing is that our current search setup uses Lucene directly, and the 
indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The 
fields are different, the complete setup is different. But it will be 
hard to advertise a new implementation/setup where indexing is three 
times slower - unless I can give some reasons why that is.


The full index should be fairly fast because the backing data is update 
every few hours. I want to put in place an incremental/partial update as 
main process, but full indexing might have to be done at certain times 
if data has changed completely, or the schema has to be changed/extended.


 No, those are separate things.  The ramBufferSizeMB (although, I like
 the thought of a rumBufferSizeMB too!  ;-)  ) controls how many docs
 Lucene holds in memory before it has to flush.  MF controls how many
 segments are on disk

alas! the rum. I had that typo on the commandline before. that's my 
subconscious telling me what I should do when I get home, tonight...


So, increasing ramBufferSize should lead to higher memory usage, 
shouldn't it? I'm not seeing that. :-(


I'll try once more with MF 10 and a higher rum... well, you know... ;-)

Cheers,
Chantal

Grant Ingersoll schrieb:

On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote:


Dear all,

I want to find out which settings give the best full index
performance for my setup.
Therefore, I have been running a small index (less than 20k
documents) with a mergeFactor of 10 and 100.
In both cases, indexing took about 11.5 min:

mergeFactor: 10
str name=Time taken 0:11:46.792/str
mergeFactor: 100
/admin/cores?action=RELOAD
str name=Time taken 0:11:44.441/str
Tomcat restart
str name=Time taken 0:11:34.143/str

This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it
always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old
ATA disk).


Now, I have three questions:

1. How can I check which mergeFactor is really being used? The
solrconfig.xml that is displayed in the admin application is the up-
to-date view on the file system. I tested that. But it's not
necessarily what the current SOLR core is using, isn't it?
Is there a way to check on the actually used mergeFactor (while the
index is running)?


It could very well be the case that you aren't seeing any merges with
only 20K docs.  Ultimately, if you really want to, you can look in
your data.dir and count the files.  If you have indexed a lot and have
an MF of 100 and haven't done an optimize, you will see a lot more
index files.



2. I changed the mergeFactor in both available settings (default and
main index) in the solrconfig.xml file of the core I am reindexing.
That is the correct place? Should a change in performance be
noticeable when increasing from 10 to 100? Or is the change not
perceivable if the requests for data are taking far longer than all
the indexing itself?


Likely, but not guaranteed.  Typically, larger merge factors are good
for batch indexing, but a lot of that has changed with Lucene's new
background merger, such that I don't know if it matters as much anymore.



3. Do I have to increase rumBufferSizeMB if I increase mergeFactor?
(Or some other setting?)


No, those are separate things.  The ramBufferSizeMB (although, I like
the thought of a rumBufferSizeMB too!  ;-)  ) controls how many docs
Lucene holds in memory before it has to flush.  MF controls how many
segments are on disk


(I am still trying to get