Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter

2008-02-09 Thread Paul Elschot
Op Saturday 09 February 2008 02:00:02 schreef robert engels:
> Curious... on things like this, is it really worth adding (and  
> maintaining) Lucene's own sort, just to achieve a 1.5 % performance  
> increase. It is almost doubtful that you can even measure an  
> improvement at that level, given all of the variables you can't control.
> 
> I see a LOT of code in Lucene that is very obtuse - mainly to gain  
> VERY small performance benefits.
> 
> Isn't there a compelling case to not worry about this stuff, and let  
> the JVM people figure it out, and concentrate on writing clear, easy  
> to understand code.

Well, what is a good way to allow the JVM people to figure it out?

Once they have figured it out, we can remove those little
optimizations.

But the trick is not to think in we and they. There is quite a bit of
Apache licenced code in JVM's already.

> I think we are better off looking for data structure or algorithm  
> changes - these micro-improvements just lead to code bloat, and  
> maintenance headaches. I also think it is doubtful that future JVM  
> generations won't do them automatically anyway, any hand optimizing  
> might actually reduce performance.

I don't like the bloat either, but I'll gladly admit to having copied
some code, adapted it a bit, and proposed to have that adapted
copy added back into the code base. I wish there was a better way.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter

2008-02-09 Thread Michael Busch
robert engels wrote:
> Curious... on things like this, is it really worth adding (and
> maintaining) Lucene's own sort, just to achieve a 1.5 % performance
> increase. It is almost doubtful that you can even measure an improvement
> at that level, given all of the variables you can't control.
> 

I somewhat agree with Robert here. The DocumentsWriter is a quite
complicated class which has already two quicksort implementations and
this patch adds even a third one. Is it really so much more expensive to
e. g. sort on an Object[] array and pass in a Comparator?

Don't get me wrong, I think this is very sophisticated code and it's
super fast as the performance test and also the user experiences with
2.3 proof. However, I think especially in the Open Source world one of
our goals should be to write code that is easy to understand, so that
it's easier for new people to get on board. To find a good balance and
trade-off between simplicity, functionality and performance is not
always easy. Of course, if a patch improves performance by say 15%, I
wouldn't hesitate to commit it. But if it's just 1% but makes the code
more complicated I'm not so sure if it's worth it.

That being said, I wouldn't vote -1 against a patch like this one to
prevent someone from committing it, but I don't think I would
write/commit it myself. I'd just like to encourage everyone to also
think about code simplicity and readability before writing and
committing new code.

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1169) Search with Filter does not work!

2008-02-09 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567306#action_12567306
 ] 

Eks Dev commented on LUCENE-1169:
-

Thank for explaining it!

So we have now classes implementing DocIdSetIterator (OpenBitSetIterator, 
SortedVIntList...) that are strictly speaking not conforming to the 
specification for skipTo().  Side-effects we had here  are probably local for 
this issue, but I have somehow bad feeling having different behaving 
implementations of the same interface. Sounds paranoid, no :)

To make things better, new classes in core like eg. OpenBitSet cover the case 
you described, when we have iterator positioned one before the first one, but 
they do not comply to other side effects.

Mainly, invoking iterator.skipTo(anything <= iterator.doc()) should have the 
same effect as next(), meaning that iterator gets moved not only in 
iterator.skipTo(iterator.doc()) ...

to cut to the chase, should we attempt to fix all OpenDocIdSetIterator 
implementations to comply to these effects, or it will be enough to comment 
these differences "relaxed skipTo contract"? Current usage of  these classes is 
in Filter related code and is practically replacement for BitSet iteration, 
therefore "under control". But if we move on using these classes tightly with 
Scorers I am afraid we could expect "one off" and similar bugs.

Another option would be to change specification and use this sentinel -1 
approach, but honestly, this is way above my  head to comment...

  

> Search with Filter does not work!
> -
>
> Key: LUCENE-1169
> URL: https://issues.apache.org/jira/browse/LUCENE-1169
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Reporter: Eks Dev
>Assignee: Michael Busch
>Priority: Blocker
> Attachments: lucene-1169.patch, TestFilteredSearch.java
>
>
> See attached JUnitTest, self-explanatory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter

2008-02-09 Thread Grant Ingersoll
I also agree w/ Robert and Michael, here.  While DocsWriter is really  
effective, it is very complicated to follow and it makes debugging and  
maintenance much harder.


-Grant

On Feb 9, 2008, at 5:03 AM, Michael Busch wrote:


robert engels wrote:

Curious... on things like this, is it really worth adding (and
maintaining) Lucene's own sort, just to achieve a 1.5 % performance
increase. It is almost doubtful that you can even measure an  
improvement

at that level, given all of the variables you can't control.



I somewhat agree with Robert here. The DocumentsWriter is a quite
complicated class which has already two quicksort implementations and
this patch adds even a third one. Is it really so much more  
expensive to

e. g. sort on an Object[] array and pass in a Comparator?

Don't get me wrong, I think this is very sophisticated code and it's
super fast as the performance test and also the user experiences with
2.3 proof. However, I think especially in the Open Source world one of
our goals should be to write code that is easy to understand, so that
it's easier for new people to get on board. To find a good balance and
trade-off between simplicity, functionality and performance is not
always easy. Of course, if a patch improves performance by say 15%, I
wouldn't hesitate to commit it. But if it's just 1% but makes the code
more complicated I'm not so sure if it's worth it.

That being said, I wouldn't vote -1 against a patch like this one to
prevent someone from committing it, but I don't think I would
write/commit it myself. I'd just like to encourage everyone to also
think about code simplicity and readability before writing and
committing new code.

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter

2008-02-09 Thread Yonik Seeley
On Feb 8, 2008 8:00 PM, robert engels <[EMAIL PROTECTED]> wrote:
> Curious... on things like this, is it really worth adding (and
> maintaining) Lucene's own sort,

Unfortunately, Java's sort on Object[] is a mergeSort, and they
allocate an axillary array to support that.
Mike's latest tests show a 4% speedup on smaller documents, so I think
it's worth it.  While DocumentsWriter is certainly very complex, a
specific sort routine makes it no more complex IMO.

I wonder how well a single generic quickSort(Object[] arr, int low,
int high) would perform vs the type-specific ones?  I guess the main
overhead would be a cast from Object to the specific class to do the
compare?  Too bad Java doesn't have true generics/templates.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter

2008-02-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-325:
--

Attachment: LUCENE-325.patch

Attached patch.  All tests pass.  I plan to commit in a day or two.

This adds two methods to IndexWriter:

  expungeDeletes() -- defaults to doWait=true
  expungeDeletes(boolean doWait)

If doWait is false, and you have a MergeScheduler that runs merges in
BG threads, then the call returns immediately.

I extended MergePolicy so it decides what "expunge deletes" really
means (findMergesToExpungeDeletes).  Then, in LogMergePolicy, I
implemented this policy: we merge all adjacent segments (up to
mergeFactor at once) that have deletes.  If only 1 segment has
deletes, it's a singular merge.


> [PATCH] new method expungeDeleted() added to IndexWriter
> 
>
> Key: LUCENE-325
> URL: https://issues.apache.org/jira/browse/LUCENE-325
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: Windows XP
> Platform: All
>Reporter: John Wang
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, 
> LUCENE-325.patch, TestExpungeDeleted.java
>
>
> We make use the docIDs in lucene. I need a way to compact the docIDs in 
> segments
> to remove the "holes" created from doing deletes. The only way to do this is 
> by
> calling IndexWriter.optimize(). This is a very heavy call, for the cases where
> the index is large but with very small number of deleted docs, calling 
> optimize
> is not practical.
> I need a new method: expungeDeleted(), which finds all the segments that have
> delete documents and merge only those segments.
> I have implemented this method and have discussed with Otis about submitting a
> patch. I don't see where I can attached the patch. I will do according to the
> patch guidleine and email the lucene mailing list.
> Thanks
> -John
> I don't see a place where I can

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter

2008-02-09 Thread Michael McCandless

I agree, there comes a point where the cost of added complexity is not
worth the gains, on balance.  Making that tradeoff is not easy.

I don't think the patch in LUCENE-1172 crosses that line: a 1.6% (4.1%
on small docs) top line gain is still a sizable gain.

The profiler points to many other smaller things which I think are
below the line, that I didn't pursue.

I also agree that DocumentsWriter is complex now, and I'd definitely
like to simplify it with time, hopefully without losing too much
performance.

Believe it or not, earlier versions (on LUCENE-843) were more complex,
and I pared it down before committing it.  At one point I had a
specialized segment merger that would much more efficiently merge
"partial" segments flushed from RAM.  This was actually a fairly
sizable gain (maybe ~15% overall) when building large indices.  But it
also added sizable complexity, so I took it out.  I still think this
is eventually worthwhile (especially when autoCommit=false), but it
belongs with segment merging instead (this is why I opened
LUCENE-856).

Mike

Grant Ingersoll wrote:

I also agree w/ Robert and Michael, here.  While DocsWriter is  
really effective, it is very complicated to follow and it makes  
debugging and maintenance much harder.


-Grant

On Feb 9, 2008, at 5:03 AM, Michael Busch wrote:


robert engels wrote:

Curious... on things like this, is it really worth adding (and
maintaining) Lucene's own sort, just to achieve a 1.5 % performance
increase. It is almost doubtful that you can even measure an  
improvement

at that level, given all of the variables you can't control.



I somewhat agree with Robert here. The DocumentsWriter is a quite
complicated class which has already two quicksort implementations and
this patch adds even a third one. Is it really so much more  
expensive to

e. g. sort on an Object[] array and pass in a Comparator?

Don't get me wrong, I think this is very sophisticated code and it's
super fast as the performance test and also the user experiences with
2.3 proof. However, I think especially in the Open Source world  
one of

our goals should be to write code that is easy to understand, so that
it's easier for new people to get on board. To find a good balance  
and

trade-off between simplicity, functionality and performance is not
always easy. Of course, if a patch improves performance by say 15%, I
wouldn't hesitate to commit it. But if it's just 1% but makes the  
code

more complicated I'm not so sure if it's worth it.

That being said, I wouldn't vote -1 against a patch like this one to
prevent someone from committing it, but I don't think I would
write/commit it myself. I'd just like to encourage everyone to also
think about code simplicity and readability before writing and
committing new code.

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1172) Small speedups to DocumentsWriter

2008-02-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567326#action_12567326
 ] 

Michael McCandless commented on LUCENE-1172:


The above numbers were with the full docs from Wikipedia.  I expect
the optimization to be more effective with smaller docs so I ran test
to index first 10 million small (~ 100 characters) Wikpedia docs.

Trunk took 257.5 sec (best of 3) and patch took 246.9 sec (best of 3)
= 4.1% speedup.

Here's the alg I'm running:

  analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
  
  doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
  
  docs.file=/Volumes/External/lucene/wikifull100.txt
  doc.stored = true
  doc.term.vector = true
  doc.add.log.step=2
  
  directory=FSDirectory
  autocommit=false
  compound=false
  
  ram.flush.mb=64
  
  { "Rounds"
  
 ResetSystemErase
  
{ "BuildIndex"
  -CreateIndex
  { "AddDocs" AddDoc > : 1000
  -CloseIndex
}
  
NewRound
  } : 3
  
  RepSumByPrefRound BuildIndex


> Small speedups to DocumentsWriter
> -
>
> Key: LUCENE-1172
> URL: https://issues.apache.org/jira/browse/LUCENE-1172
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1172.patch
>
>
> Some small fixes that I found while profiling indexing Wikipedia,
> mainly using our own quickSort instead of Arrays.sort.
> Testing first 200K docs of Wikipedia shows a speedup from 274.6
> seconds to 270.2 seconds.
> I'll commit in a day or two.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1169) Search with Filter does not work!

2008-02-09 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567351#action_12567351
 ] 

Paul Elschot commented on LUCENE-1169:
--

Some of the bugs caused by this skipTo() behaviour are hard to catch:
http://issues.apache.org/bugzilla/show_bug.cgi?id=35823

Basically the fix was to guard every invocation of skipTo() with a target > 
doc()
test when no advancing should be done.

In the above case I still don't know what the exact cause was, as the last patch
added this guarding test in more than one place.

One way to 'fix' this is by adding to the javadoc of skipTo() that the 
behaviour is
undefined when target <= doc(), and otherwise the behaviour is the old 
behaviour.
Implementations should then define the behaviour when target <= doc().
This has the advantage that the only way to fix it is by reviewing all the
skipTo(targetDocId) code when the javadoc does not completely define the 
behaviour
of an implementation.

Another way to go about this is to consider target<=doc() on entry of skipTo a 
bug,
and add sth like this:
assert (notInitialized and (target >= 0) or (target > doc());
at the entry of each skipTo implementation in the trunk and fix the bugs as 
they show up.

For the moment I prefer the latter, it is a bit drastic, but it gets rid of a 
lot of uncertainty.
Anyway, when taking it that far, it's another issue.




> Search with Filter does not work!
> -
>
> Key: LUCENE-1169
> URL: https://issues.apache.org/jira/browse/LUCENE-1169
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Reporter: Eks Dev
>Assignee: Michael Busch
>Priority: Blocker
> Attachments: lucene-1169.patch, TestFilteredSearch.java
>
>
> See attached JUnitTest, self-explanatory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-09 Thread Srikant Jakilinki

Hi Ning,

In continuation with our offline conversation, here is a public 
expression of interest in your work and a description of our work. Sorry 
for the length in advance and I hope that the folk will be able to 
collaborate and/or share experiences and/or give us some pointers...


1) We are trying to leverage Lucene on Hadoop for blog archiving and 
searching i.e. ever-increasing data (in terabytes) on commodity hardware 
in a generic LAN. These machines are not hi-spec nor are dedicated but 
actually used within the lab by users for day to day tasks. 
Unfortunately, Nutch and Solr are not applicable to our situation - 
atleast directly. Think of us as an academic oriented Technorati


2) There are 2 aspects.One is that we want to archive the blogposts that 
we hit under a UUID/timestamp taxonomy. This archive can be used for 
many things like cached copies, diffing, surf acceleration etc. The 
other aspect is to archive the indexes. You see, the indexes have a 
lifecycle. For simplicity sake, an index consists of one days worth of 
blogposts (roughly, 15MM documents) and follow the  taxonomy. 
Ideally, we want to store an indefinite archive of blogposts and their 
indexes side-by-side but 1 year or 365 days is a start


3) We want to use the taxonomical name of the post as a specific ID 
field in the Lucene index and want to get away with not storing the 
content of the post at all but only a file pointer/reference to it. This 
we hope will keep the index sizes low but the fact remains that this is 
a case of multiple threads on multiple JVMs handling multiple indexes on 
multiple machines. Further, the posts and indexes are mostly WORM but 
there may be situations where they have to be updated. For example, if 
some blog posts have edited content or have to be removed for copyright, 
or updated with metadata like rank. There is some duplication detection 
work that has to be done here but it is out of scope for now. And oh, 
the lab is a Linux-Windows environment


4) Our first port of call is to have Hadoop running on this group of 
machines (without clustering or load balancing or grid or master/slave 
mumbo jumbo) in the simplest way possible. The goal being to make 
applications see the bunch of machines as a reliable, scalable, 
fault-tolerant, average-performing file store with simple, file CRUD 
operations. For example, the blog crawler should be able to put the 
blogposts in this HDFS in live or in batch mode. With about 20 machines 
and each being installed with a 240GB drive for the experiment, we have 
about 4.5 TB of storage available


5) Next we want to handle Lucene and exploit the structure of its index 
and the algorithms behind it. Since a Lucene index is a directory of 
files, we intend to 'tag' the files as belonging to one index and store 
them on the HDFS. At any instant in time, an index can be regenerated 
and used. The regenerated index is however not used directly from HDFS 
but copied into the local filesystem of the indexer/searcher. This copy 
is subject to change and every once in a while, the constituent files in 
the HDFS are overwritten with the latest files. Hence, naming is quite 
important to us. Even so, we feel that the number of files that have to 
be updated are quite less and that we can use MD5 sums to make sure we 
only update the content changed files. However, this means that out of 
4.5 TB available, we use half of it for archival and the other half for 
searching. Even so, we should be able to store a years worth of posts 
and indexes. Disks are no problem


6) Right then. So, we have (365*15MM) posts and (365*LFi) Lucene file 
segments on the HDFS. Suppose there are N machines online, then each 
machine will have to own 365/N indexes. N constantly keeps changing but 
at any instant the 365 indexes should be live and we are working on the 
best way to achieve this kind of 'fair' autonomic computing cloud where 
when a machine goes down, the other machines will add some indexes to 
their kitty. If a machine is added, then it relieves other machines of 
some indexes. The searcher runs on each of these machines and is a 
service (IP:port) and queries are served using a ParallelMultiSearch() 
[on the machines] and a MultiSearch() [within the machines] so that we 
need not have an unmanageable number of JVMs per machine. Atmost, 1 for 
Hadoop, 1 for Cloud and 1 for Search. We are wondering if Solr can be 
used for search if it supports multiple indexes available on the same 
machine


As you can see, this is not a simple endeavour and it is obvious, I 
suppose, that we are still in theory stage and only now getting to know 
the Lucene projects better. There is a huge body of work, albeit not 
acknowledged in the scientific community as it should be, and I want to 
say kudos to all who have been responsible for it.
I wish and hope to utilize the collective consciousness to mount our 
challenge. Any pointers, code, help, collaboration et al. for any of the