Re: VInt's as prefix. Was: bytecount as prefix

2006-05-11 Thread Ben van Klinken
Ok, haven't been following the 2.0 thing very well :) But we at clucene are trying to get this stream thing going, so would like to do something which will be compatible with java lucene. So if there's something i can do with the refence version so that what we are doing isn't incompatible, it wo

Re: VInt's as prefix. Was: bytecount as prefix

2006-05-11 Thread Doug Cutting
Ben van Klinken wrote: What's the chance of this making it into Lucene 2.0? Let me know if there's anything i can do to get this into Lucene 2. Lucene 2.0 is all but out the door. We're talking about Lucene 2.x or Lucene 3 here. Doug

Re: VInt's as prefix. Was: bytecount as prefix

2006-05-11 Thread Ben van Klinken
What we really need is the ability to add "leading zeroes" to a VInt. I really like this idea! A VInt can then be written with a static length. Then in clucene we can implement our stream optimisations without any changes to the code logic. What's the chance of this making it into Lucene 2.0? L

Re: VInt's as prefix. Was: bytecount as prefix

2006-05-11 Thread Yonik Seeley
On 5/11/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote: Maybe we should consider loading differing subclasses of IndexInput/ IndexOutput based on the detected file format version? If this were C, I'd use function pointers. What's the best way to approximate that in Java? Nothing but subclassin

Re: VInt's as prefix. Was: bytecount as prefix

2006-05-11 Thread Marvin Humphrey
On May 11, 2006, at 8:02 AM, Yonik Seeley wrote: Of course there is that *little* detail of backward compatability ;-) There is that. :) Between using bytecounts as String prefixes, transitioning from modified UTF-8 to standard UTF-8, and potentially changing the definition of VInt, the

Re: VInt's as prefix. Was: bytecount as prefix

2006-05-11 Thread Yonik Seeley
On 5/11/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote: I believe that this is possible if we change the definition of VInt so that the high bytes are written first, rather than the low bytes. The "BER compressed integer" Great idea Marvin! The decoding could be slightly faster with reverse-byt

Re: VInt's as prefix. Was: bytecount as prefix

2006-05-11 Thread Marvin Humphrey
On May 11, 2006, at 3:24 AM, Ben van Klinken wrote: Here is where the problem is, though: this is not possible currently because we are using a VInt for the field data length. What we really need is the ability to add "leading zeroes" to a VInt. I believe that this is possible if we change t

VInt's as prefix. Was: bytecount as prefix

2006-05-11 Thread Ben van Klinken
int? fieldsStream.readInt(): fieldsStream.readVInt()]; << CHANGE HERE ... <> string value; if ( dontUseVint ){ << I'm not completely sure about this section, since changes relating to 'by

Re: bytecount as prefix

2006-05-07 Thread Marvin Humphrey
Got it. This was the problem, in TermInfosWriter.writeTerm(): -lastTerm = term; +lastBytes = bytes; } Without lastTerm being updated, the auxiliary term dictionary got screwed up. This problem only manifested on large tests because small tests never moved past the first entry, which

Re: bytecount as prefix

2006-05-06 Thread Marvin Humphrey
No progress yet. I think my next move is to do what I did when trying to get KinoSearch to write Lucene-compatible indexes: 1) Generate an optimized split-file format Lucene index from a pathological test corpus. 2) Hack KinoSearch so that it ought to produce an index which is identical

Re: bytecount as prefix

2006-05-06 Thread Marvin Humphrey
On Sat, May 06, 2006 at 05:11:02PM +0900, David Balmain wrote: > Hi Marvin, > > Where are you with this? I also have a vested interest in seeing > Lucene move to using byte counts. I was wondering if I could help out. > Is the patch you pasted here the latest you have? All I've added since then i

Re: bytecount as prefix

2006-05-06 Thread David Balmain
Hi Marvin, Where are you with this? I also have a vested interest in seeing Lucene move to using byte counts. I was wondering if I could help out. Is the patch you pasted here the latest you have? Cheers, Dave On 4/12/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote: Greets, I'm back working on

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-05 Thread Doug Cutting
Marvin Humphrey wrote: More problematic than the "Modified UTF-8" actually, is the definition of a Lucene String. According to the File Formats document, "Lucene writes strings as a VInt representing the length, followed by the character data." The word "length" is ambiguous in that context,

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-03 Thread Marvin Humphrey
On May 1, 2006, at 7:33 PM, Chuck Williams wrote: > Could someone summarize succinctly why it is considered a > major issue that Lucene uses the Java modified UTF-8 > encoding within its index rather than the standard UTF-8 > encoding. Is the only concern compatibility with index > formats in ot

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread Tatu Saloranta
--- jian chen <[EMAIL PROTECTED]> wrote: > Plus, as open source and open standard advocates, we > don't want to be like > Micros$ft, who claims to use industrial "standard" > XML as the next > generation word file format. However, it is very > hard to write your own Word > reader, because their wo

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread Chuck Williams
The benefits to a byte count are substantial, including: 1. Lazy fields can skip strings without reading them, as they do for all other value types. 2. The file format could be changed to standard UTF-8 without any significant performance cost 3. Any other index operation that

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread jian chen
Hi, Doug, I totally agree with what you said. Yeah, I think it is more of a file format issue, less of an API issue. It seems that we just need to add an extra constructor to Term.java to take in utf8 byte array. Lucene 2.0 is going to break the backward compability anyway, right? So, maybe this

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread Doug Cutting
Chuck Williams wrote: For lazy fields, there would be a substantial benefit to having the count on a String be an encoded byte count rather than a Java char count, but this has the same problem. If there is a way to beat this problem, then I'd start arguing for a byte count. I think the way to

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread Chuck Williams
Hi Jian, I agree with you about Microsoft. It's a standard ploy to put window dressing on stuff to combat competition, in this case from the open document standard. So the UTF-8 concern is interoperability with other programs at the index level. An interesting question here is whether the Lucen

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
Plus, as open source and open standard advocates, we don't want to be like Micros$ft, who claims to use industrial "standard" XML as the next generation word file format. However, it is very hard to write your own Word reader, because their word file format is proprietary and hard to write program

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
Hi, Chuck, Using standard UTF-8 is very important for Lucene index so any program could read the Lucene index easily, be it written in perl, c/c++ or any new future programming languages. It is like storing data in a database for web application. You want to store it in such a way that other pro

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread Chuck Williams
Could someone summarize succinctly why it is considered a major issue that Lucene uses the Java modified UTF-8 encoding within its index rather than the standard UTF-8 encoding. Is the only concern compatibility with index formats in other Lucene variants? The API to the values is a String, which

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
Hi, Marvin, Thanks for your quick response. I am in the camp of fearless refactoring, even at the expense of breaking compatibility with previous releases. ;-) Compatibility aside, I am trying to identify if changing the implementation of Term is the right way to go for this problem. If it is,

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread Marvin Humphrey
On May 1, 2006, at 6:27 PM, jian chen wrote: This way, for indexing new documents, the new Term(String text) is called and utf8bytes will be obtained from the input term text. For segment term info merge, the utf8bytes will be loaded from the Lucene index, which already stores the term text

storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
Hi, All, Recently I have been following through the whole discussion on storing text/string as standard UTF-8 and how to achieve that in Lucene. If we are stroing the term text and the field strings as UTF-8 bytes, I now understand that it is a tricky issue because of the performance problem we

Re: bytecount as prefix

2006-04-12 Thread Doug Cutting
Marvin Humphrey wrote: A phantom blank Term shows up out of nowhere in the middle of the merge process. When you stick a System.err.println into TermInfosWriter's writeTerm... Did you try putting a print statement in SegmentMergeInfo.next(), to see where this blank term comes from? Doug

Re: bytecount as prefix

2006-04-11 Thread Chris Hostetter
org : To: java-dev@lucene.apache.org : Subject: Re: bytecount as prefix : : : On Apr 11, 2006, at 12:05 PM, Marvin Humphrey wrote: : : > TestRangeFilter. : : A phantom blank Term shows up out of nowhere in the middle of the : merge process. : : When you stick a System.err.println into TermInfosW

Re: bytecount as prefix

2006-04-11 Thread Marvin Humphrey
On Apr 11, 2006, at 12:05 PM, Marvin Humphrey wrote: TestRangeFilter. A phantom blank Term shows up out of nowhere in the middle of the merge process. When you stick a System.err.println into TermInfosWriter's writeTerm, you ordinarily see it adding Terms in proper sort order: [j

Re: bytecount as prefix

2006-04-11 Thread Marvin Humphrey
On Apr 11, 2006, at 2:27 PM, Marvin Humphrey wrote: "all but last", "all but first" and "all but ends" pass! Scratch that, it's totally untrue. I'd forgotten that these compound test cases bail as soon as there's a single failure. "all but last" also fails to return any docs at all. M

Re: bytecount as prefix

2006-04-11 Thread Marvin Humphrey
On Apr 11, 2006, at 2:08 PM, Yonik Seeley wrote: On 4/11/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote: What do the failing tests have in common? On TestIndexModifier, only a small portion of the deletions fail, and they're all for fairly high values of delId -- sometimes the highest, but not

Re: bytecount as prefix

2006-04-11 Thread Yonik Seeley
On 4/11/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > What do the failing tests have in common? > > On TestIndexModifier, only a small portion of the deletions fail, and > they're all for fairly high values of delId -- sometimes the highest, > but not always. For RangeFilter and ConstantScoreRa

Re: bytecount as prefix

2006-04-11 Thread Marvin Humphrey
On Apr 11, 2006, at 12:18 PM, Doug Cutting wrote: Marvin Humphrey wrote: I'm back working on converting Lucene to using a byte count instead of a char count at as a prefix at the head of each String. Three tests are failing: TestIndexModifier, TestConstantScoreRangeQuery, and TestRang

Re: bytecount as prefix

2006-04-11 Thread Doug Cutting
Marvin Humphrey wrote: I'm back working on converting Lucene to using a byte count instead of a char count at as a prefix at the head of each String. Three tests are failing: TestIndexModifier, TestConstantScoreRangeQuery, and TestRangeFilter. Why those and not others? - private static f

bytecount as prefix

2006-04-11 Thread Marvin Humphrey
Greets, I'm back working on converting Lucene to using a byte count instead of a char count at as a prefix at the head of each String. Three tests are failing: TestIndexModifier, TestConstantScoreRangeQuery, and TestRangeFilter. Why those and not others? Marvin Humphrey Rectangular Rese