Re:Retreive Compressed Fields

2008-03-26 Thread Sebastin
Hi All, I try to store a string Variable as Field.Store.Compress,during search is there any any inbuilt method to uncompress these records else we can go for some other algorithm to retreive these records? -- View this message in context: http://www.nabble.com/Re%3ARetreive-Compressed-Fie

How to get the a term weight (tf*idf)?

2008-03-26 Thread dillongeo
Hi all, Given a term (e.g. "apple") and a document in index, how can I get the term weight in this document? Is this weight equal to the tf*idf value of this term? Thanks! -- View this message in context: http://www.nabble.com/How-to-get-the-a-term-weight-%28tf*idf%29--tp16321424p16321424.html

Re: Index "corruption" makes it return a different result

2008-03-26 Thread Michael McCandless
[Lucas sent me a zip of the index - thanks!] I ran CheckIndex on the index and it said this on your _al1 segment: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 1000 at org.apache.lucene.util.BitVector.get(BitVector.java:72) at org.apache.lucene.index.Segment

Re: AW: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

2008-03-26 Thread Jay
Thanks, Uwe, for your clarification and for sharing your experience which is very helpful! Jay Uwe Goetzke wrote: Hi Jay, Sorry for the confusion, I wrote NgramStemFilter in an early stage of the project which is essentially the same as NGramTokenFilter from Otis with the addition that I add

Re: Index "corruption" makes it return a different result

2008-03-26 Thread Lucas F. A. Teixeira
100% Impossible... My index has 1 xml, 3 number fields, 1 aphanumeric field. *always* :-) Lucas Michael McCandless wrote: OK. I would recommend upgrading to 2.3.1. There were some corruption issues with term vectors that could cause the wrong document's term vectors to come back. Tha

Re: Index "corruption" makes it return a different result

2008-03-26 Thread Michael McCandless
OK. I would recommend upgrading to 2.3.1. There were some corruption issues with term vectors that could cause the wrong document's term vectors to come back. That screen shot is spooky! Is it possible that one of the documents you indexed had that content? (It could simply be a store

Re: Index "corruption" makes it return a different result

2008-03-26 Thread Lucas F. A. Teixeira
LOL, I know Take a look, editing the cfs file: http://img296.imageshack.us/my.php?image=indexow4.jpg []s, Lucas Yonik Seeley wrote: On Wed, Mar 26, 2008 at 2:13 PM, Lucas F. A. Teixeira <[EMAIL PROTECTED]> wrote: one of the index files has these log messages from my application ser

Re: Index "corruption" makes it return a different result

2008-03-26 Thread Lucas F. A. Teixeira
Thanks Michael! Lucene version: 2.3.0 Here is some screenshot of editing the cfs file: http://img296.imageshack.us/my.php?image=indexow4.jpg Take a look! []s, Lucas Michael McCandless wrote: OK I think I follow now. Which version of Lucene was this? If it's not too large, can you post

Re: Index "corruption" makes it return a different result

2008-03-26 Thread Yonik Seeley
On Wed, Mar 26, 2008 at 2:13 PM, Lucas F. A. Teixeira <[EMAIL PROTECTED]> wrote: > one of the index files > has these log messages from my application server inside it, Wow! That's a new one... -Yonik - To unsubscribe, e-mail

Is there a way to speed up boolean query if I don't care about score?

2008-03-26 Thread Wojtek H
Hi all, Suppose my query has "normal" part for which I want score as usual and other part which is big disjunction (OR) query for which I just want documents to match and don't care about scoring. Is there a way to make it fast? As far as I understand if 'no-score' part was the same in many querie

Re: Index "corruption" makes it return a different result

2008-03-26 Thread Michael McCandless
OK I think I follow now. Which version of Lucene was this? If it's not too large, can you post the CFS file that got mixed up? Be sure to cc me directly on the mail because the mailing list software removes attachments. Mike Lucas F. A. Teixeira wrote: This is just one of the index fil

Re: Index "corruption" makes it return a different result

2008-03-26 Thread Lucas F. A. Teixeira
This is just one of the index files. As I said, the local disk where the index is generated, it's not full, the full disk it's the shared storage where my application server store its logs. When this disk hitted 100%, all the indexing process stop (of course, all the processing instances of th

Re: Improving Index Search Performance

2008-03-26 Thread Paul Elschot
Since you're using all the results for a query, and ignoring the score value, you might try and do the same thing with a relational database. But I would not expect that to be much faster, especially when using a field cache. Other than that, you could also go the other way, and try and add more

Re: Index "corruption" makes it return a different result

2008-03-26 Thread Michael McCandless
I couldn't quite follow the part about "_al1.cfs". It sounds like your indexing process hit a disk full event, that led to this error? Can you post the full exception(s) from the disk full? Which version of Lucene are you using? Mike Lucas F. A. Teixeira wrote: Hello all! I had a problem

Re: Integrating Spell Checker contributed to Lucene

2008-03-26 Thread Ivan Vasilev
-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ NOD32 2974 (20080326) Information __ This message was checked by NOD32 antivirus system. http://www.eset.com ---

Re: The best way to iterate over document

2008-03-26 Thread Wojtek H
Thank you for reply. What I did not mention before was that for iteration we don't care about scoring, so that's not the issue at all. Creating Filter with BitSet seems much better idea than keeping HitIterator in memory. Am I right that in such a case with MatchAllDocsQuery memory usage would be a

Re: random accessing term value

2008-03-26 Thread Erik Hatcher
You can store term vectors with positions too. Wouldn't that work for this case? Erik On Mar 25, 2008, at 11:59 PM, John Wang wrote: I am not sure how term vectors would help me. Term vectors are ordered by frequency, not in lex order. Since I know in the dictionary the terms are

Re: The best way to iterate over document

2008-03-26 Thread Erick Erickson
Why not keep a Filter in memory? It consists of a single bit per document and the ordinal position of that bit is the Lucene doc ID. You could create this reasonably quickly for the *first* query that came in via HitCollector. Then each time you wanted another chunk, use the filter to know which d

Re: Improving Index Search Performance

2008-03-26 Thread Ian Lea
Well, caching is designed to use memory. If you are saying that you haven't got enough memory to cache all your values then caching them all isn't going to work, at any level. If you implemented your own cache you could control memory usage with an LRU algorithm or whatever made sense for your app

Index "corruption" makes it return a different result

2008-03-26 Thread Lucas F. A. Teixeira
Hello all! I had a problem this week, and I like to share with you all. My weblogic server that generate my index hrows its logs in a shared storage. During my indexing process (SOLR+Lucene), this shared storage became 100% full, and everything collapsed (all servers that uses this shared stor

Re: Integrating Spell Checker contributed to Lucene

2008-03-26 Thread Mathieu Lecarme
Ivan Vasilev a écrit : Thanks Mathieu, I tryed to checkout but without success. Anyway I can do it manually, but as the contribution is still not approved from Lucene our chiefs will not whant it to be included to our project by now. It's a right decision. I hope the third patch will be good

Re: Integrating Spell Checker contributed to Lucene

2008-03-26 Thread Ivan Vasilev
jets/revuedepresse/browser/trunk/src/java You can do a svn checkout. M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ NOD32 2973 (20080326) Information _

Re: Improving Index Search Performance

2008-03-26 Thread Shailendra Mudgal
> The bottom line is that reading fields from docs is expensive. > FieldCache will, I believe, load fields for all documents but only > once - so the second and subsequent times it will be fast. Even > without using a cache it is likely that things will speed up because > of caching by the OS. A

setPositionIncrement questions

2008-03-26 Thread Itamar Syn-Hershko
Hi all, Breaking proximity data has been discussed several times before, and concluded that setPositionIncrement is the way to go. In regards of it: 1. Where should it be called exactly to create the gap properly? 2. Is there a way to call it directly somehow while indexing (e.g. after adding

Re: Improving Index Search Performance

2008-03-26 Thread Toke Eskildsen
On Wed, 2008-03-26 at 10:45 +, Ian Lea wrote: > If you've got plenty of memory vs index size you could look at > RAMDirectory or MMapDirectory. Or how about some solid state disks? > Someone recently posted some very impressive performance stats. That was probably me. A (very) quick test for

Re: Integrating Spell Checker contributed to Lucene

2008-03-26 Thread Mathieu Lecarme
Ivan Vasilev a écrit : Thanks Mathieu for your help! The contribution that you have made to Lucene by this patch seems to be great, but the hunspell dictionary is under LGPL which the lawyer of our company does not like. It's the spell tool used by Openoffice and firefox. Data must be multi l

Re: Improving Index Search Performance

2008-03-26 Thread Ian Lea
Hi The bottom line is that reading fields from docs is expensive. FieldCache will, I believe, load fields for all documents but only once - so the second and subsequent times it will be fast. Even without using a cache it is likely that things will speed up because of caching by the OS. If you'

Re: Integrating Spell Checker contributed to Lucene

2008-03-26 Thread Ivan Vasilev
Thanks Mathieu for your help! The contribution that you have made to Lucene by this patch seems to be great, but the hunspell dictionary is under LGPL which the lawyer of our company does not like. Wordnet dictionary seems to be more free and may be could help together with your patch. In the

AW: AW: feedback: Indexing speed improvement lucene 2.2->2.3.1

2008-03-26 Thread Uwe Goetzke
Hi Jay, Sorry for the confusion, I wrote NgramStemFilter in an early stage of the project which is essentially the same as NGramTokenFilter from Otis with the addition that I add begin and end token markers (e.g. word gets and _word_ and so _w wo rd d_ ). As I modified a lot of our lucene co

The best way to iterate over document

2008-03-26 Thread Wojtek H
Hi all, our problem is to choose the best (the fastest) way to iterate over huge set of documents (basic and most important case is to iterate over all documents in the index). Some slow process accesses documents and now it is done via repeating query (for instance MatchAllDocsQuery). It processe

Re: Improving Index Search Performance

2008-03-26 Thread Shailendra Mudgal
Hi All, Thanks for your reply. I would like to mention here is that the companyId is a multivalued field. I tried paul's suggestions also but doesn't seem much gain. Still the searcher.doc() method is taking almost the same amount of time. > you can use the FieldCache to lookup the compnayId for

Re: is it possible to change the way score from different field combine to give final lucene score

2008-03-26 Thread John Wang
HI Grant: I don't see FunctionQuery in the javadocs. Can you post a link? Thanks -john On Mon, Mar 24, 2008 at 3:32 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > See the FunctionQuery and the org.apache.lucene.search.function > package. You can also implement your own query, as it's n