Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread Michael Sokolov
On 1/15/15 11:23 AM, danield wrote: Hi Mike, Thank you for your reply. Yes, I had thought of this, but it is not a solution to my problem, and this is because the Term Frequency and therefore the results will still be wrong, as prepending or appending a string to the term will still make it a

Re: index corruption with Lucene 2.9.4

2015-01-15 Thread Michael McCandless
Is your index on a remote file system? You could be hitting https://issues.apache.org/jira/browse/LUCENE-5541 Best to upgrade Lucene (we stopped relying on File.exists a while back). Mike McCandless http://blog.mikemccandless.com On Thu, Jan 15, 2015 at 12:10 PM, Ian Koelliker

RE: index corruption with Lucene 2.9.4

2015-01-15 Thread Ian Koelliker
Thanks for the information. We are going to be upgrading to a newer version of Lucene, but we cannot upgrade to version 4.x yet due to the fact that 4.x cannot read older index formats. The best we can do currently is upgrade to the latest 3.x which appears to have the same problem.

index corruption with Lucene 2.9.4

2015-01-15 Thread Ian Koelliker
Hello, We are seeing some weird instances of index corruption periodically when using Lucene 2.9.4. There are two specific cases we are seeing. 1) We are using the compound format and have noticed that sometimes we get errors when searching noting that files are missing (i.e. .fnm, .fdt,

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread danield
Oh thanks Mike, it did say somewhere. I guess it wouldn't hurt to make that explanation more prominent, as I clearly missed it. Never mind, I am working on my own solution for this, through subclassing QueryParser, BooleanQuery, BooleanScorer, Similarity and a bunch of other classes. Cheers,

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread danield
Hi Mike, Thank you for your reply. Yes, I had thought of this, but it is not a solution to my problem, and this is because the Term Frequency and therefore the results will still be wrong, as prepending or appending a string to the term will still make it a different term. Similarily, I could

Indexing

2015-01-15 Thread tomas.kalas
Hello, i have a question how Lucene indexes? I have sentence and tokenized it at tokens and index save only tokens?Or original sentence too ? When i want to see for example sentence with id 1, it lucene build this sentence from tokens where are saved in index? Or the sentence is indexed too ?And

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread Jack Krupansky
File a Jira for this particular doc fix since it is significant and not just mere worksmithing. Better yet, submit a patch since that's Javadoc, although the exact form of the doc fix might be debatable, so I general description of the problem should be sufficient, unless you feel motivated. --

Re: Indexing

2015-01-15 Thread Erick Erickson
Basically there is a stored fork and an indexed fork. If you specify the input should be stored, a verbatim copy is put in a special segment file with the extension .fdt. This is entirely orthogonal to indexing the tokens, which are what search operates on. So you can store and index, store but

Re: Multi-valued field and numTerms

2015-01-15 Thread Michael McCandless
Normally Lucene will count your d1 as having length=2. However, if la was added as a synonym for los angeles, such that it overlaps its position, then the default similarity discounts that and will count it as length=1. But for that to work, the position of the 2nd token must be the same as the

Multi-valued field and numTerms

2015-01-15 Thread rama44ster
Hi, I am using lucene to index documents that have a multivalued text field named ‘city’. Each document might have multiple values for this field, like la, los angeles etc. Assuming document d1 contains city = la ; city = los angeles document d2 contains city = la mirada document d3 contains city

Re: trouble with Collector and FieldCache

2015-01-15 Thread Ian Lea
How are you storing the id field? A wild guess might be that this error might be caused by having some documents with id stored, perhaps, as a StringField or TextField and some as an IntField. -- Ian. On Wed, Jan 14, 2015 at 2:07 PM, Sascha Janz sascha.j...@gmx.net wrote: hello, i am

Aw: Re: trouble with Collector and FieldCache

2015-01-15 Thread Sascha Janz
they are all stored like this Field fid = new Field(id, , Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO); fid.setStringValue(Integer.toString(id)); because of reusing fid i have to set the value this way.   Gesendet: Donnerstag, 15. Januar 2015 um 13:12 Uhr Von: Ian Lea

problem in setNextReader from FieldComparator

2015-01-15 Thread Victor Podberezski
i'm struggling with a migration from lucene 2.4 to 2.9 I'm trying to migrate from SortComparatorSource to FieldComparator. I cannot make it works right after a lot of tests. I noted that inside the setNextReader method not all the stored field's terms are retrieved. For example i have one

Re: Multi-valued field and numTerms

2015-01-15 Thread Michael Sokolov
On 1/15/15 4:34 AM, rama44ster wrote: Hi, I am using lucene to index documents that have a multivalued text field named ‘city’. Each document might have multiple values for this field, like la, los angeles etc. Assuming document d1 contains city = la ; city = los angeles document d2 contains