Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?

2011-12-11 Thread Sean Tong
Hi,

We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I 
have been running  benchmark tests that come with Lucence. To my surprise, I 
found that the indexing  in 3.5.0 is significant slower than 2.4.1 for the 
Wikipedia data.

Attached is the algorithm for the tests.   The tests used default Lucence 
settings for flush memory size and merge factor. 512M memory was used  for the 
tasks.  The test machine is a 64-bit Windows 7 machine with Intel Core i7.

The command:
%ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task

Here are the test results:

Lucece 2.4.1

   [java] > Report sum by Prefix (MAddDocs) and Round (3 about 
3 out of 14)

 [java] Operation   round flush mrg   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem

 [java] MAddDocs_20 0 16.00  101   20  1,609.1  
124.2989,218,496241,631,232

 [java] MAddDocs_20 -   1 16.00  10 -  -   1 -  -  20 -  - 1,746.4 
-  - 114.52 - 102,365,864 -  241,762,304

 [java] MAddDocs_20 2 16.00  101   20  1,566.8  
127.6569,428,144174,194,688

Lucene 2.9.4

 [java] > Report sum by Prefix (MAddDocs) and Round (3 about 3 
out of 14)

 [java] Operation   round flush mrg   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem

 [java] MAddDocs_20 0 16.00  101   20 1,046.49  
191.1282,676,152139,657,216

 [java] MAddDocs_20 -   1 16.00  10 -  -   1 -  -  20 -   1,165.35 
-  - 171.62 - 119,364,128 -  156,762,112

 [java] MAddDocs_20 2 16.00  101   20 1,245.86  
160.5350,361,760137,625,600

Lucene 3.5.0

 [java] > Report sum by Prefix (MAddDocs) and Round (3 about 3 
out of 14)

 [java] Operation   round flush mrg   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem

 [java] MAddDocs_20 0 16.00  101   20   676.48  
295.6570,917,592129,695,744

 [java] MAddDocs_20 -   1 16.00  10 -  -   1 -  -  20 -  -  626.13 
-  - 319.42 -  50,329,552 -   94,240,768

 [java] MAddDocs_20 2 16.00  101   20   687.68  
290.8357,732,640 92,864,512


The indexing speed using 2.4.1 is 2.3x  of the speed using 3.5.0.   Did I miss 
any settings or configurations?

Thanks,

Sean



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?

2011-12-12 Thread Sean Tong
Looks like the attachment for the algorithm is missing from last email.  I have 
pasted the text here. Thanks in advance for any help.

#Start of the wikipedia-default.alg file

merge.factor=mrg:10:10:10
max.field.length=2147483647
#max.buffered=buf:10:10:100:100
ram.flush.mb=flush:16:16:16

compound=true

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory

doc.stored=true
doc.tokenized=true
doc.term.vector=false
log.step=5000

docs.file=temp/enwiki-20070527-pages-articles.xml

content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource

query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker

# task at this depth or less would print when they start
task.max.depth.log=2

log.queries=false
# 
-

{ "Rounds"

ResetSystemErase

{ "Populate"
CreateIndex
{ "MAddDocs" AddDoc > : 20
CloseIndex
}

NewRound

} : 3

RepSumByName
RepSumByPrefRound MAddDocs

#End of wikipedia-default.alg file

Thanks,

Sean


From: Sean Tong [mailto:st...@jamasoftware.com]
Sent: Sunday, December 11, 2011 11:54 PM
To: java-user@lucene.apache.org
Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?

Hi,

We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. I 
have been running  benchmark tests that come with Lucence. To my surprise, I 
found that the indexing  in 3.5.0 is significant slower than 2.4.1 for the 
Wikipedia data.

Attached is the algorithm for the tests.   The tests used default Lucence 
settings for flush memory size and merge factor. 512M memory was used  for the 
tasks.  The test machine is a 64-bit Windows 7 machine with Intel Core i7.

The command:
%ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task

Here are the test results:

Lucece 2.4.1

   [java] > Report sum by Prefix (MAddDocs) and Round (3 about 
3 out of 14)

 [java] Operation   round flush mrg   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem

 [java] MAddDocs_20 0 16.00  101   20  1,609.1  
124.2989,218,496241,631,232

 [java] MAddDocs_20 -   1 16.00  10 -  -   1 -  -  20 -  - 1,746.4 
-  - 114.52 - 102,365,864 -  241,762,304

 [java] MAddDocs_20 2 16.00  101   20  1,566.8  
127.6569,428,144174,194,688


Lucene 2.9.4

 [java] > Report sum by Prefix (MAddDocs) and Round (3 about 3 
out of 14)

 [java] Operation   round flush mrg   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem

 [java] MAddDocs_20 0 16.00  101   20 1,046.49  
191.1282,676,152139,657,216

 [java] MAddDocs_20 -   1 16.00  10 -  -   1 -  -  20 -   1,165.35 
-  - 171.62 - 119,364,128 -  156,762,112

 [java] MAddDocs_20 2 16.00  101   20 1,245.86  
160.5350,361,760137,625,600

Lucene 3.5.0

 [java] > Report sum by Prefix (MAddDocs) and Round (3 about 3 
out of 14)

 [java] Operation   round flush mrg   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem

 [java] MAddDocs_20 0 16.00  101   20   676.48  
295.6570,917,592129,695,744

 [java] MAddDocs_20 -   1 16.00  10 -  -   1 -  -  20 -  -  626.13 
-  - 319.42 -  50,329,552 -   94,240,768

 [java] MAddDocs_20 2 16.00  101   20   687.68  
290.8357,732,640 92,864,512


The indexing speed using 2.4.1 is 2.3x  of the speed using 3.5.0.   Did I miss 
any settings or configurations?

Thanks,

Sean




RE: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?

2011-12-12 Thread Sean Tong
Thanks Simon for your response.

I just re-ran the 3.5 benchmark with the ClassicAnalyzer. Here are the results:

 [java] > Report sum by Prefix (MAddDocs) and Round (3 about 3 
out of 14)
 [java] Operation   round flush mrg   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem
 [java] MAddDocs_20 0 16.00  101   20   715.76  
279.4248,828,144128,057,344
 [java] MAddDocs_20 -   1 16.00  10 -  -   1 -  -  20 -  -  679.04 
-  - 294.53 -  68,321,424 -   85,721,088
 [java] MAddDocs_20 2 16.00  101   20   761.95  
262.4963,139,256 91,881,472

The performance is slightly better than the one using StandardAnalyzer,  but  
this is still much worse than the performance with 2.4.1.

Sean

-Original Message-
From: Simon Willnauer [mailto:simon.willna...@googlemail.com] 
Sent: Monday, December 12, 2011 12:20 PM
To: java-user@lucene.apache.org
Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?

hey,

can you try to use the ClassicAnalyzer instead of StandartAnalzyer in
3.5 since in 3.5 the StandartAnalyzer is a different implementation than in 2.9 
and 2.4 or rerun the 2.4 benchmarks with a WhitespaceAnalyzer just for the 
comparison.

simon

On Mon, Dec 12, 2011 at 7:08 PM, Sean Tong  wrote:
> Looks like the attachment for the algorithm is missing from last email.  I 
> have pasted the text here. Thanks in advance for any help.
>
> #Start of the wikipedia-default.alg file
>
> merge.factor=mrg:10:10:10
> max.field.length=2147483647
> #max.buffered=buf:10:10:100:100
> ram.flush.mb=flush:16:16:16
>
> compound=true
>
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> directory=FSDirectory
>
> doc.stored=true
> doc.tokenized=true
> doc.term.vector=false
> log.step=5000
>
> docs.file=temp/enwiki-20070527-pages-articles.xml
>
> content.source=org.apache.lucene.benchmark.byTask.feeds.EnwikiContentS
> ource
>
> query.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
>
> # task at this depth or less would print when they start
> task.max.depth.log=2
>
> log.queries=false
> # 
> --
> ---
>
> { "Rounds"
>
>    ResetSystemErase
>
>    { "Populate"
>        CreateIndex
>        { "MAddDocs" AddDoc > : 20
>        CloseIndex
>    }
>
>    NewRound
>
> } : 3
>
> RepSumByName
> RepSumByPrefRound MAddDocs
>
> #End of wikipedia-default.alg file
>
> Thanks,
>
> Sean
>
>
> From: Sean Tong [mailto:st...@jamasoftware.com]
> Sent: Sunday, December 11, 2011 11:54 PM
> To: java-user@lucene.apache.org
> Subject: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?
>
> Hi,
>
> We plan to upgrade the Lucene library in our application from 2.4.1 to 3.5.0. 
> I have been running  benchmark tests that come with Lucence. To my surprise, 
> I found that the indexing  in 3.5.0 is significant slower than 2.4.1 for the 
> Wikipedia data.
>
> Attached is the algorithm for the tests.   The tests used default Lucence 
> settings for flush memory size and merge factor. 512M memory was used  for 
> the tasks.  The test machine is a 64-bit Windows 7 machine with Intel Core i7.
>
> The command:
> %ant -Dtask.alg=conf/wikipedia-default.alg -Dtask.mem=512M run-task
>
> Here are the test results:
>
> Lucece 2.4.1
>
>       [java] > Report sum by Prefix (MAddDocs) and Round 
> (3 about 3 out of 14)
>
>     [java] Operation       round flush mrg   runCnt   recsPerRun        
> rec/s  elapsedSec    avgUsedMem    avgTotalMem
>
>     [java] MAddDocs_20     0 16.00  10        1       20      
> 1,609.1      124.29    89,218,496    241,631,232
>
>     [java] MAddDocs_20 -   1 16.00  10 -  -   1 -  -  20 -  - 
> 1,746.4 -  - 114.52 - 102,365,864 -  241,762,304
>
>     [java] MAddDocs_20     2 16.00  10        1       20      
> 1,566.8      127.65    69,428,144    174,194,688
>
>
> Lucene 2.9.4
>
>     [java] > Report sum by Prefix (MAddDocs) and Round (3 
> about 3 out of 14)
>
>     [java] Operation       round flush mrg   runCnt   recsPerRun        
> rec/s  elapsedSec    avgUsedMem    avgTotalMem
>
>     [java] MAddDocs_20     0 16.00  10        1       20     
> 1,046.49      191.12    82,676,152    139,657,216
>
>     [java] MAddDocs_20 -   1 16.00  10 -  -   1 -  -  20 -   
> 1,165.35 -  - 171.62 - 119,364,128 -  156,762,112
>
>     [java] MAddDocs_20     2 16.00  10        1       20     
> 1,245.86      1

RE: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?

2011-12-13 Thread Sean Tong
Simon,

I checked the indexes with Luke and  you were right about the benchmarks may 
not be comparable since they had different number of fields and index 
functionalities.  You can find the summaries of the index statistics for 2.4.1, 
2.9.4, and 3.5.0 below.

I also ran the benchmarks for the standard Reuter's data (20,000 documents) 
with the default settings (merge factor 10, flush memory:16m)  and it turned 
out that  2.4.1 and 3.5.0 benchmarks were  similar though the indexes had 
different number of fields too. 

In your experience, do you think the 3.5.0 indexing performance is at least as 
good as 2.4.1 or 2.9.4? Do you have any recommendations on  indexing 
configurations/settings?  Through my experiments, I found large flush memory 
settings (e.g 64m or 128m) helps with the index performance for the Wikipeida 
data in 3.5.0 but not so much in 2.4.1.  

Thanks,

Sean

*
Here are the data for the Wikipedia indexes:

3.5.0

Number of fields: 7
Number of documents: 200,000
Number of terms: 4,849,195
Has deletions?/Optimized? No/No
Incex formact: -11 (lucene 3.1)
Index functionality: lock-less, single norms, shared doc store, checksum, del 
count, omotTf, user data, diagnostics, hasVectors
TermInfos index divisor: N/A
Directory implementation: org.apache.lucene.store.MMapDirectory

Fields
Name Term Count  %
body   3,391,27769.93%
docdate 1,160   0.02%
docdatenum 872,060 17.98%
docid  200,000   4.12%
docname   200,0004.12%
doctimesecn82,2311.7%
doctitle102,467 2.11%


2.9.4
Number of fields: 5
Number of documents: 200,000
Number of terms: 4,760,747
Has deletions?/Optimized? No/No
Incex formact: -9 (lucene 2.9)
Index functionality: lock-less, single norms, shared doc store, checksum, del 
count, omitTf, user data, diagnostics
TermInfos index divisor: N/A
Directory implementation: org.apache.lucene.store.MMapDirectory

Fields:
body   3,391,277   90.18%
docdate  1,160   0.03%
docid  200,000   5.32%
docname 65,843  1.75%
doctitle102,467 2.77%

2.4.1

Number of fields: 4
Number of documents: 200,000
Number of terms: 3,694,904
Has deletions?/Optimized? No/No
Index formact: -7 (lucene 2.4)
Index functionality: lock-less, single norms, shared doc store, checksum, del 
count, omtTf
Directory implementation: org.apache.lucene.store.MMapDirectory

Fields
body   3,391,277   91.78%
docdate 1,160   0.03%
docid  200,000   5.41%
doctitle102,467 2.77%


-Original Message-
From: Simon Willnauer [mailto:simon.willna...@googlemail.com] 
Sent: Tuesday, December 13, 2011 4:30 AM
To: java-user@lucene.apache.org
Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?

hey,

so what I wonder in general is if the benchmarks are comparable. What I mean is 
that the benchmark code has changed since 2.4 a lot so there might be 
additional fields and / or different settings on what to index and how.
could you check with luke if the index has the same fields and if the settings 
are the same / similar and report it back? I also wonder if it maybe now uses 
update instead of add ie. buffers and applies deletes etc.

simon

On Mon, Dec 12, 2011 at 10:03 PM, Sean Tong  wrote:
> Thanks Simon for your response.
>
> I just re-ran the 3.5 benchmark with the ClassicAnalyzer. Here are the 
> results:
>
>     [java] > Report sum by Prefix (MAddDocs) and Round (3 
> about 3 out of 14)
>     [java] Operation       round flush mrg   runCnt   recsPerRun        
> rec/s  elapsedSec    avgUsedMem    avgTotalMem
>     [java] MAddDocs_20     0 16.00  10        1       20       
> 715.76      279.42    48,828,144    128,057,344
>     [java] MAddDocs_20 -   1 16.00  10 -  -   1 -  -  20 -  -  
> 679.04 -  - 294.53 -  68,321,424 -   85,721,088
>     [java] MAddDocs_20     2 16.00  10        1       20       
> 761.95      262.49    63,139,256     91,881,472
>
> The performance is slightly better than the one using StandardAnalyzer,  but  
> this is still much worse than the performance with 2.4.1.
>
> Sean
>
> -Original Message-
> From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
> Sent: Monday, December 12, 2011 12:20 PM
> To: java-user@lucene.apache.org
> Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia 
> data?
>
> hey,
>
> can you try to use the ClassicAnalyzer instead of StandartAnalzyer in
> 3.5 since in 3.5 the StandartAnalyzer is a different implementation than in 
> 2.9 and 2.4 or rerun the 2.4 benchmarks with a WhitespaceAnalyzer just for 
> the comparison.
>
> simon
>
> On Mon, Dec 12, 2011 at 7:08 PM

RE: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?

2011-12-13 Thread Sean Tong
Hi,

I modified the DocMaker in 3.5 to make it index the same 4 fields as 2.4.1 
does. Now I got  very similar stats in the index by checking Luke. The index 
performance was slightly better than that by indexing 7 fields but still not 
comparable with the 2.4.1 performance:

[java] > Report sum by Prefix (MAddDocs) and Round (3 about 3 out 
of 14)
 [java] Operation   round flush mrg   runCnt   recsPerRunrec/s  
elapsedSecavgUsedMemavgTotalMem
 [java] MAddDocs_20 0 16.00  101   20   767.18  
260.70   113,206,984144,637,952
 [java] MAddDocs_20 -   1 16.00  10 -  -   1 -  -  20 -  -  801.61 
-  - 249.50 - 117,778,992 -  144,637,952
 [java] MAddDocs_20 2 16.00  101   20   734.39  
272.33   121,479,568126,287,872

Maybe there are some other settings that make the benchmarks not comparable.

Thanks,

Sean

3.5.0 Index Stats with modified DocMaker:

Number of fields: 4
Number of documents: 200,000
Number of terms: 3,694,904
Has deletions?/Optimized? No/No
Index format: -11 (Lucene 3.1)
Index functionality: lock-less, single norms, shared doc store, checksum, del 
count, omitTf, user data, diagnostics, hasVectors 
Directory implementation: org.apache.lucene.store.MMapDirectory

Fields
body   3,391,277   91.78%
docdate 1,160   0.03%
docid  200,000   5.41%
doctitle102,467 2.77%


-Original Message-
From: Sean Tong 
Sent: Tuesday, December 13, 2011 10:47 AM
To: 'java-user@lucene.apache.org'
Subject: RE: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?

Simon,

I checked the indexes with Luke and  you were right about the benchmarks may 
not be comparable since they had different number of fields and index 
functionalities.  You can find the summaries of the index statistics for 2.4.1, 
2.9.4, and 3.5.0 below.

I also ran the benchmarks for the standard Reuter's data (20,000 documents) 
with the default settings (merge factor 10, flush memory:16m)  and it turned 
out that  2.4.1 and 3.5.0 benchmarks were  similar though the indexes had 
different number of fields too. 

In your experience, do you think the 3.5.0 indexing performance is at least as 
good as 2.4.1 or 2.9.4? Do you have any recommendations on  indexing 
configurations/settings?  Through my experiments, I found large flush memory 
settings (e.g 64m or 128m) helps with the index performance for the Wikipeida 
data in 3.5.0 but not so much in 2.4.1.  

Thanks,

Sean

*
Here are the data for the Wikipedia indexes:

3.5.0

Number of fields: 7
Number of documents: 200,000
Number of terms: 4,849,195
Has deletions?/Optimized? No/No
Incex formact: -11 (lucene 3.1)
Index functionality: lock-less, single norms, shared doc store, checksum, del 
count, omotTf, user data, diagnostics, hasVectors TermInfos index divisor: N/A 
Directory implementation: org.apache.lucene.store.MMapDirectory

Fields
Name Term Count  %
body   3,391,27769.93%
docdate 1,160   0.02%
docdatenum 872,060 17.98%
docid  200,000   4.12%
docname   200,0004.12%
doctimesecn82,2311.7%
doctitle102,467 2.11%


2.9.4
Number of fields: 5
Number of documents: 200,000
Number of terms: 4,760,747
Has deletions?/Optimized? No/No
Incex formact: -9 (lucene 2.9)
Index functionality: lock-less, single norms, shared doc store, checksum, del 
count, omitTf, user data, diagnostics TermInfos index divisor: N/A Directory 
implementation: org.apache.lucene.store.MMapDirectory

Fields:
body   3,391,277   90.18%
docdate  1,160   0.03%
docid  200,000   5.32%
docname 65,843  1.75%
doctitle102,467 2.77%

2.4.1

Number of fields: 4
Number of documents: 200,000
Number of terms: 3,694,904
Has deletions?/Optimized? No/No
Index formact: -7 (lucene 2.4)
Index functionality: lock-less, single norms, shared doc store, checksum, del 
count, omtTf Directory implementation: org.apache.lucene.store.MMapDirectory

Fields
body   3,391,277   91.78%
docdate 1,160   0.03%
docid  200,000   5.41%
doctitle102,467 2.77%


-Original Message-
From: Simon Willnauer [mailto:simon.willna...@googlemail.com]
Sent: Tuesday, December 13, 2011 4:30 AM
To: java-user@lucene.apache.org
Subject: Re: Is indexing much slower in 3.5.0 than in 2.4.1 for Wikipedia data?

hey,

so what I wonder in general is if the benchmarks are comparable. What I mean is 
that the benchmark code has changed since 2.4 a lot so there might be 
additional fields and / or different settings on what to index and how.
could you check with luke if the index has the same fields and if the settings 
are the same / similar and repo