Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-05 Thread Doug Cutting
Marvin Humphrey wrote: More problematic than the "Modified UTF-8" actually, is the definition of a Lucene String. According to the File Formats document, "Lucene writes strings as a VInt representing the length, followed by the character data." The word "length" is ambiguous in that context,

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-03 Thread Marvin Humphrey
On May 1, 2006, at 7:33 PM, Chuck Williams wrote: > Could someone summarize succinctly why it is considered a > major issue that Lucene uses the Java modified UTF-8 > encoding within its index rather than the standard UTF-8 > encoding. Is the only concern compatibility with index > formats in ot

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread Tatu Saloranta
--- jian chen <[EMAIL PROTECTED]> wrote: > Plus, as open source and open standard advocates, we > don't want to be like > Micros$ft, who claims to use industrial "standard" > XML as the next > generation word file format. However, it is very > hard to write your own Word > reader, because their wo

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread Chuck Williams
The benefits to a byte count are substantial, including: 1. Lazy fields can skip strings without reading them, as they do for all other value types. 2. The file format could be changed to standard UTF-8 without any significant performance cost 3. Any other index operation that

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread jian chen
Hi, Doug, I totally agree with what you said. Yeah, I think it is more of a file format issue, less of an API issue. It seems that we just need to add an extra constructor to Term.java to take in utf8 byte array. Lucene 2.0 is going to break the backward compability anyway, right? So, maybe this

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread Doug Cutting
Chuck Williams wrote: For lazy fields, there would be a substantial benefit to having the count on a String be an encoded byte count rather than a Java char count, but this has the same problem. If there is a way to beat this problem, then I'd start arguing for a byte count. I think the way to

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-02 Thread Chuck Williams
Hi Jian, I agree with you about Microsoft. It's a standard ploy to put window dressing on stuff to combat competition, in this case from the open document standard. So the UTF-8 concern is interoperability with other programs at the index level. An interesting question here is whether the Lucen

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
Plus, as open source and open standard advocates, we don't want to be like Micros$ft, who claims to use industrial "standard" XML as the next generation word file format. However, it is very hard to write your own Word reader, because their word file format is proprietary and hard to write program

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
Hi, Chuck, Using standard UTF-8 is very important for Lucene index so any program could read the Lucene index easily, be it written in perl, c/c++ or any new future programming languages. It is like storing data in a database for web application. You want to store it in such a way that other pro

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread Chuck Williams
Could someone summarize succinctly why it is considered a major issue that Lucene uses the Java modified UTF-8 encoding within its index rather than the standard UTF-8 encoding. Is the only concern compatibility with index formats in other Lucene variants? The API to the values is a String, which

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread jian chen
Hi, Marvin, Thanks for your quick response. I am in the camp of fearless refactoring, even at the expense of breaking compatibility with previous releases. ;-) Compatibility aside, I am trying to identify if changing the implementation of Term is the right way to go for this problem. If it is,

Re: storing term text internally as byte array and bytecount as prefix, etc.

2006-05-01 Thread Marvin Humphrey
On May 1, 2006, at 6:27 PM, jian chen wrote: This way, for indexing new documents, the new Term(String text) is called and utf8bytes will be obtained from the input term text. For segment term info merge, the utf8bytes will be loaded from the Lucene index, which already stores the term text