Marvin Humphrey wrote:
More problematic than the "Modified UTF-8" actually, is the definition
of a Lucene String. According to the File Formats document, "Lucene
writes strings as a VInt representing the length, followed by the
character data." The word "length" is ambiguous in that context,
On May 1, 2006, at 7:33 PM, Chuck Williams wrote:
> Could someone summarize succinctly why it is considered a
> major issue that Lucene uses the Java modified UTF-8
> encoding within its index rather than the standard UTF-8
> encoding. Is the only concern compatibility with index
> formats in ot
--- jian chen <[EMAIL PROTECTED]> wrote:
> Plus, as open source and open standard advocates, we
> don't want to be like
> Micros$ft, who claims to use industrial "standard"
> XML as the next
> generation word file format. However, it is very
> hard to write your own Word
> reader, because their wo
The benefits to a byte count are substantial, including:
1. Lazy fields can skip strings without reading them, as they do for
all other value types.
2. The file format could be changed to standard UTF-8 without any
significant performance cost
3. Any other index operation that
Hi, Doug,
I totally agree with what you said. Yeah, I think it is more of a file
format issue, less of an API issue. It seems that we just need to add an
extra constructor to Term.java to take in utf8 byte array.
Lucene 2.0 is going to break the backward compability anyway, right? So,
maybe this
Chuck Williams wrote:
For lazy fields, there would be a substantial benefit to having the
count on a String be an encoded byte count rather than a Java char
count, but this has the same problem. If there is a way to beat this
problem, then I'd start arguing for a byte count.
I think the way to
Hi Jian,
I agree with you about Microsoft. It's a standard ploy to put window
dressing on stuff to combat competition, in this case from the open
document standard.
So the UTF-8 concern is interoperability with other programs at the
index level. An interesting question here is whether the Lucen
Plus, as open source and open standard advocates, we don't want to be like
Micros$ft, who claims to use industrial "standard" XML as the next
generation word file format. However, it is very hard to write your own Word
reader, because their word file format is proprietary and hard to write
program
Hi, Chuck,
Using standard UTF-8 is very important for Lucene index so any program could
read the Lucene index easily, be it written in perl, c/c++ or any new future
programming languages.
It is like storing data in a database for web application. You want to store
it in such a way that other pro
Could someone summarize succinctly why it is considered a major issue
that Lucene uses the Java modified UTF-8 encoding within its index
rather than the standard UTF-8 encoding. Is the only concern
compatibility with index formats in other Lucene variants? The API to
the values is a String, which
Hi, Marvin,
Thanks for your quick response. I am in the camp of fearless refactoring,
even at the expense of breaking compatibility with previous releases. ;-)
Compatibility aside, I am trying to identify if changing the implementation
of Term is the right way to go for this problem.
If it is,
On May 1, 2006, at 6:27 PM, jian chen wrote:
This way, for indexing new documents, the new Term(String text) is
called
and utf8bytes will be obtained from the input term text. For
segment term
info merge, the utf8bytes will be loaded from the Lucene index, which
already stores the term text
12 matches
Mail list logo