Ok, haven't been following the 2.0 thing very well :)
But we at clucene are trying to get this stream thing going, so would
like to do something which will be compatible with java lucene.
So if there's something i can do with the refence version so that what
we are doing isn't incompatible, it wo
Ben van Klinken wrote:
What's the chance of this making it into Lucene 2.0? Let me know if
there's anything i can do to get this into Lucene 2.
Lucene 2.0 is all but out the door. We're talking about Lucene 2.x or
Lucene 3 here.
Doug
What we really need is the ability to add "leading zeroes" to a VInt.
I really like this idea! A VInt can then be written with a static length.
Then in clucene we can implement our stream optimisations without any
changes to the code logic.
What's the chance of this making it into Lucene 2.0? L
On 5/11/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
Maybe we should consider loading differing subclasses of IndexInput/
IndexOutput based on the detected file format version? If this were
C, I'd use function pointers. What's the best way to approximate
that in Java?
Nothing but subclassin
On May 11, 2006, at 8:02 AM, Yonik Seeley wrote:
Of course there is that *little* detail of backward compatability ;-)
There is that. :)
Between using bytecounts as String prefixes, transitioning from
modified UTF-8 to standard UTF-8, and potentially changing the
definition of VInt, the
On 5/11/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
I believe that this is possible if we change the definition of VInt
so that the high bytes are written first, rather than the low bytes.
The "BER compressed integer"
Great idea Marvin! The decoding could be slightly faster with
reverse-byt
On May 11, 2006, at 3:24 AM, Ben van Klinken wrote:
Here is where the problem is, though: this
is not possible currently because we are using a VInt for the field
data length.
What we really need is the ability to add "leading zeroes" to a VInt.
I believe that this is possible if we change t
int?
fieldsStream.readInt():
fieldsStream.readVInt()]; << CHANGE HERE
...
<>
string value;
if ( dontUseVint ){
<< I'm not completely sure about this section,
since changes relating to 'by
Got it.
This was the problem, in TermInfosWriter.writeTerm():
-lastTerm = term;
+lastBytes = bytes;
}
Without lastTerm being updated, the auxiliary term dictionary got
screwed up. This problem only manifested on large tests because small
tests never moved past the first entry, which
No progress yet.
I think my next move is to do what I did when trying to get KinoSearch
to write Lucene-compatible indexes:
1) Generate an optimized split-file format Lucene index from a
pathological test corpus.
2) Hack KinoSearch so that it ought to produce an index which is
identical
On Sat, May 06, 2006 at 05:11:02PM +0900, David Balmain wrote:
> Hi Marvin,
>
> Where are you with this? I also have a vested interest in seeing
> Lucene move to using byte counts. I was wondering if I could help out.
> Is the patch you pasted here the latest you have?
All I've added since then i
Hi Marvin,
Where are you with this? I also have a vested interest in seeing
Lucene move to using byte counts. I was wondering if I could help out.
Is the patch you pasted here the latest you have?
Cheers,
Dave
On 4/12/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
Greets,
I'm back working on
Marvin Humphrey wrote:
More problematic than the "Modified UTF-8" actually, is the definition
of a Lucene String. According to the File Formats document, "Lucene
writes strings as a VInt representing the length, followed by the
character data." The word "length" is ambiguous in that context,
On May 1, 2006, at 7:33 PM, Chuck Williams wrote:
> Could someone summarize succinctly why it is considered a
> major issue that Lucene uses the Java modified UTF-8
> encoding within its index rather than the standard UTF-8
> encoding. Is the only concern compatibility with index
> formats in ot
--- jian chen <[EMAIL PROTECTED]> wrote:
> Plus, as open source and open standard advocates, we
> don't want to be like
> Micros$ft, who claims to use industrial "standard"
> XML as the next
> generation word file format. However, it is very
> hard to write your own Word
> reader, because their wo
The benefits to a byte count are substantial, including:
1. Lazy fields can skip strings without reading them, as they do for
all other value types.
2. The file format could be changed to standard UTF-8 without any
significant performance cost
3. Any other index operation that
Hi, Doug,
I totally agree with what you said. Yeah, I think it is more of a file
format issue, less of an API issue. It seems that we just need to add an
extra constructor to Term.java to take in utf8 byte array.
Lucene 2.0 is going to break the backward compability anyway, right? So,
maybe this
Chuck Williams wrote:
For lazy fields, there would be a substantial benefit to having the
count on a String be an encoded byte count rather than a Java char
count, but this has the same problem. If there is a way to beat this
problem, then I'd start arguing for a byte count.
I think the way to
Hi Jian,
I agree with you about Microsoft. It's a standard ploy to put window
dressing on stuff to combat competition, in this case from the open
document standard.
So the UTF-8 concern is interoperability with other programs at the
index level. An interesting question here is whether the Lucen
Plus, as open source and open standard advocates, we don't want to be like
Micros$ft, who claims to use industrial "standard" XML as the next
generation word file format. However, it is very hard to write your own Word
reader, because their word file format is proprietary and hard to write
program
Hi, Chuck,
Using standard UTF-8 is very important for Lucene index so any program could
read the Lucene index easily, be it written in perl, c/c++ or any new future
programming languages.
It is like storing data in a database for web application. You want to store
it in such a way that other pro
Could someone summarize succinctly why it is considered a major issue
that Lucene uses the Java modified UTF-8 encoding within its index
rather than the standard UTF-8 encoding. Is the only concern
compatibility with index formats in other Lucene variants? The API to
the values is a String, which
Hi, Marvin,
Thanks for your quick response. I am in the camp of fearless refactoring,
even at the expense of breaking compatibility with previous releases. ;-)
Compatibility aside, I am trying to identify if changing the implementation
of Term is the right way to go for this problem.
If it is,
On May 1, 2006, at 6:27 PM, jian chen wrote:
This way, for indexing new documents, the new Term(String text) is
called
and utf8bytes will be obtained from the input term text. For
segment term
info merge, the utf8bytes will be loaded from the Lucene index, which
already stores the term text
Hi, All,
Recently I have been following through the whole discussion on storing
text/string as standard UTF-8 and how to achieve that in Lucene.
If we are stroing the term text and the field strings as UTF-8 bytes, I now
understand that it is a tricky issue because of the performance problem we
Marvin Humphrey wrote:
A phantom blank Term shows up out of nowhere in the middle of the merge
process.
When you stick a System.err.println into TermInfosWriter's writeTerm...
Did you try putting a print statement in SegmentMergeInfo.next(), to see
where this blank term comes from?
Doug
org
: To: java-dev@lucene.apache.org
: Subject: Re: bytecount as prefix
:
:
: On Apr 11, 2006, at 12:05 PM, Marvin Humphrey wrote:
:
: > TestRangeFilter.
:
: A phantom blank Term shows up out of nowhere in the middle of the
: merge process.
:
: When you stick a System.err.println into TermInfosW
On Apr 11, 2006, at 12:05 PM, Marvin Humphrey wrote:
TestRangeFilter.
A phantom blank Term shows up out of nowhere in the middle of the
merge process.
When you stick a System.err.println into TermInfosWriter's writeTerm,
you ordinarily see it adding Terms in proper sort order:
[j
On Apr 11, 2006, at 2:27 PM, Marvin Humphrey wrote:
"all but last", "all but first" and "all but ends" pass!
Scratch that, it's totally untrue. I'd forgotten that these compound
test cases bail as soon as there's a single failure. "all but last"
also fails to return any docs at all.
M
On Apr 11, 2006, at 2:08 PM, Yonik Seeley wrote:
On 4/11/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
What do the failing tests have in common?
On TestIndexModifier, only a small portion of the deletions fail, and
they're all for fairly high values of delId -- sometimes the highest,
but not
On 4/11/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
> What do the failing tests have in common?
>
> On TestIndexModifier, only a small portion of the deletions fail, and
> they're all for fairly high values of delId -- sometimes the highest,
> but not always. For RangeFilter and ConstantScoreRa
On Apr 11, 2006, at 12:18 PM, Doug Cutting wrote:
Marvin Humphrey wrote:
I'm back working on converting Lucene to using a byte count
instead of a char count at as a prefix at the head of each
String. Three tests are failing: TestIndexModifier,
TestConstantScoreRangeQuery, and TestRang
Marvin Humphrey wrote:
I'm back working on converting Lucene to using a byte count instead of
a char count at as a prefix at the head of each String. Three tests
are failing: TestIndexModifier, TestConstantScoreRangeQuery, and
TestRangeFilter.
Why those and not others?
- private static f
Greets,
I'm back working on converting Lucene to using a byte count instead
of a char count at as a prefix at the head of each String. Three
tests are failing: TestIndexModifier, TestConstantScoreRangeQuery,
and TestRangeFilter.
Why those and not others?
Marvin Humphrey
Rectangular Rese
34 matches
Mail list logo