Build failed in Hudson: Lucene-Nightly #144

2007-07-06 Thread hudson
See http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/144/changes Changes: [mikemccand] LUCENE-843: add missing 'synchronized' to allThreadsIdle() method [yonik] replace div with shift since idiv takes ~40 cycles and compiler can't do strength reduction w/o knowing ops are non-negat

Scaling Lucene to 500 million documents - preferred architecture

2007-07-06 Thread muraalee
Hi Everybody, We are building a search infrastructure using lucene to scale upto 500 million document with search < 500 ms. Here is my rough math on the size of content & index : Total Documents = 500 million documents Size / Document = 10k / document Index Size / Million = 2 GB / million docum

RAM Directory doesn't work for index size > 8 GB

2007-07-06 Thread muraalee
Hi, We are facing a strange problem with RAMDirectory for indices greater than 8 GB. We have indexed around 6.5 million lucene documents and the index size is around 8 GB. Below is the contents of Index Directory. 2236964197 _1x.fdt 51811488 _1x.fdx 293 _1x.fnm 2234929832 _1x.f

Re: Spliting index

2007-07-06 Thread Doug Cutting
You can implement a FilterIndexReader that returns only a subset of an index. Then use IndexWriter#addIndexes() to add this to a new, empty index. Do this for each range of terms. This is somewhat similar to Nutch's IndexSorter: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/ap

Re: Author Tags

2007-07-06 Thread Doug Cutting
DM Smith wrote: My question is whether contrib should have a separate policy? If the @author is removed from the file, should we make sure that there is a CREDITS.txt for the contrib with the info in it. Credit isn't file-by-file, it's commit-by-commit and recorded in both Jira and in CHANGES

Re: for a better spellchecker

2007-07-06 Thread J. Delgado
Instead of "overriding" the trigram approach you may want to do a combination. That is create trigrams out of the list of words from the dictionary and weigh the matches much higher than those coming from the index or even have a first dictionary exact lookup and then a trigram/index based lookup

for a better spellchecker

2007-07-06 Thread Mathieu Lecarme
Now, SpellChecker use the trigram algorithm to find similar words. It works well for keyboard fumbles, but not well enough for short words and for languages like french where a same sound can be wrote differently. Spellchecking is a classical computer task, and aspell provides some nice and

Re: Benchmarking Contrib

2007-07-06 Thread Chris Hostetter
: Interesting question... I guess we haven't had one contrib depend on : another yet, or at least, I haven't checked to see if we have. we do actually, a good example is xml-query-parser depending on the queries contrib. the current best practice for doing this is to set a property in your cont

Re: Author Tags

2007-07-06 Thread DM Smith
As a user of Lucene, I can go either way. With an active developer community their need is lessened. The greatest value I have found in them is being able to track down "duplicate" bugs. If I find a particular bug in one piece of code, I try to find other places where the same bug exists, on

[jira] Reopened: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

2007-07-06 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened LUCENE-843: --- Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Re-opening this

Re: Author Tags

2007-07-06 Thread Michael McCandless
+1 I think it makes sense to remove them at one fell swoop and also discourage adding them going forward? Mike "Tom White" <[EMAIL PROTECTED]> wrote: > Hadoop recently removed all @author tags: > https://issues.apache.org/jira/browse/HADOOP-1147. > > Tom > > On 05/07/07, Grant Ingersoll <[EMA

Re: Author Tags

2007-07-06 Thread Tom White
Hadoop recently removed all @author tags: https://issues.apache.org/jira/browse/HADOOP-1147. Tom On 05/07/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Solr just suggested (http://www.mail-archive.com/solr- [EMAIL PROTECTED]/msg04883.html) that they remove Author tags for a variety of good rea