Re: Flexible indexing design

2008-04-29 Thread Michael McCandless
Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > Container is only aware of the single inStream, while codec can still > > think its operating on 3 even if it's really 1 or 2. > > > > I don't understand. If you have three streams, all of them are going to > have to get skipped, right? For the "all

Re: Flexible indexing design

2008-04-28 Thread Marvin Humphrey
On Apr 27, 2008, at 3:28 AM, Michael McCandless wrote: Actually, I was picturing that the container does the seeking itself (using skip data), to get "close" to the right point, and then it uses the codec to step through single docs at a time until it's at or beyond the right one. I believe i

Re: Flexible indexing design

2008-04-27 Thread Michael McCandless
Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > > Seeking might get a little weird, I suppose. > > > > Maybe not?: if the container is only aware of the single InStream, and > > say it's "indexed" with a multi-skip index, then when you ask > > container to seek, it forwards the request to multi-ski

Re: Flexible indexing design

2008-04-24 Thread Marvin Humphrey
On Apr 24, 2008, at 4:47 AM, Michael McCandless wrote: Seeking might get a little weird, I suppose. Maybe not?: if the container is only aware of the single InStream, and say it's "indexed" with a multi-skip index, then when you ask container to seek, it forwards the request to multi-skip whic

Re: Flexible indexing design

2008-04-24 Thread Michael McCandless
Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > On Apr 17, 2008, at 11:57 AM, Michael McCandless wrote: > > > > If I have a pluggable indexer, > > then on the querying side I need something (I'm not sure what/how) > > that knows how to create the right demuxer (container) and codec > > (decoder) to

Re: Flexible indexing design

2008-04-18 Thread Marvin Humphrey
On Apr 17, 2008, at 11:57 AM, Michael McCandless wrote: If I have a pluggable indexer, then on the querying side I need something (I'm not sure what/how) that knows how to create the right demuxer (container) and codec (decoder) to interact with whatever my indexing plugins wrote. So I don't t

Re: Flexible indexing design

2008-04-17 Thread Michael McCandless
Marvin Humphrey <[EMAIL PROTECTED]> wrote: > On Apr 13, 2008, at 2:35 AM, Michael McCandless wrote: > > > > I think the major difference is locality? In a compound file, you > > have to seek "far away" to reach the prx & skip data (if they are > > separate). > > There's another item worth mentio

Re: Flexible indexing design

2008-04-15 Thread Marvin Humphrey
On Apr 13, 2008, at 2:35 AM, Michael McCandless wrote: I think the major difference is locality? In a compound file, you have to seek "far away" to reach the prx & skip data (if they are separate). There's another item worth mentioning, something that Doug, Grant and I discussed when this

Re: Flexible indexing design

2008-04-13 Thread Michael McCandless
Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > On Apr 10, 2008, at 3:10 AM, Michael McCandless wrote: > > > > Can't you compartmentalize while still serializing skip data into the > > single frq/prx file? > > > > Yes, that's possible. > > The way KS is set up right now, PostingList objects maintai

Re: Flexible indexing design

2008-04-12 Thread Marvin Humphrey
On Apr 10, 2008, at 3:10 AM, Michael McCandless wrote: Can't you compartmentalize while still serializing skip data into the single frq/prx file? Yes, that's possible. The way KS is set up right now, PostingList objects maintain i/o state, and Posting's Read_Record() method just deals with

Re: Flexible indexing design

2008-04-10 Thread Michael McCandless
Marvin Humphrey <[EMAIL PROTECTED]> wrote: > On Apr 9, 2008, at 6:35 AM, Michael Busch wrote: > > > > We also need to come up with a good solution for the dictionary, because a > term with frq/prx postings needs to store two (or three for skiplist) file > pointers in the dictionary, whereas e. g. a

Re: Flexible indexing design (was Re: Pooling of posting objects in DocumentsWriter)

2008-04-10 Thread Michael McCandless
Michael Busch <[EMAIL PROTECTED]> wrote: > > I agree we would have an abstract base Posting class that just tracks > > the term text. > > > > Then, DocumentsWriter manages inverting each field, maintaining the > > per-field hash of term Text -> abstract Posting instances, exposing > > the methods

Re: Flexible indexing design

2008-04-09 Thread Marvin Humphrey
On Apr 9, 2008, at 6:35 AM, Michael Busch wrote: We also need to come up with a good solution for the dictionary, because a term with frq/prx postings needs to store two (or three for skiplist) file pointers in the dictionary, whereas e. g. a "binary" posting list only needs one pointer.

Re: Flexible indexing

2007-03-13 Thread Marvin Humphrey
On Mar 13, 2007, at 2:03 AM, Nicolas Lalevée wrote: At present KS allows you to attach both a Similarity and an Analyzer to a field name via a FieldSpec subclass. I haven't quite figured out how to attach a posting format. Should it return an object, like FieldSpec's similarity() method does?

Re: Flexible indexing

2007-03-13 Thread Marvin Humphrey
On Mar 12, 2007, at 5:08 PM, Grant Ingersoll wrote: I can see having storage at: Index Document/Field //already exists Token I hadn't thought of it that way, as a logical extension outwards at all levels. If I understand you correctly, it's a clever point, but the thing is, it's cake f

Re: Flexible indexing

2007-03-13 Thread Marvin Humphrey
On Mar 13, 2007, at 2:38 AM, Michael Busch wrote: Global field semantics make our life with FI much easier in a single index. But even with global field semantics we would have the same problem with the IndexWriter.addIndexes() method, no? I'm curious about how you solved that conflict in

Re: Flexible indexing

2007-03-13 Thread Nicolas Lalevée
Le Dimanche 11 Mars 2007 22:41, Michael Busch a écrit : > Hi Grant, > > I certainly agree that it would be great if we could make some progress > and commit the payloads patch soon. I think it is quite independent from > FI. FI will introduce different posting formats (see Wiki: > http://wiki.apach

Re: Flexible indexing

2007-03-13 Thread Michael Busch
Marvin Humphrey wrote: It uses global field semantics, which Hoss won't be happy about. ;) However, I'm grateful to Hoss for past critiques, as they've helped me to refine and improve how Schema works. For instance, as of KS 0.20_02 you can introduce new field_name => FieldSpec association

Re: Flexible indexing

2007-03-13 Thread Nicolas Lalevée
Le Lundi 12 Mars 2007 21:34, Marvin Humphrey a écrit : > On Mar 10, 2007, at 3:27 PM, Michael Busch wrote: > > - Introduce index format. Nicolas has already written a lot of code > > in this regard! > > I worry that going the interface route is going to be too > restrictive. When I looked at Nicho

Re: Flexible indexing

2007-03-12 Thread Grant Ingersoll
On Mar 12, 2007, at 6:54 PM, Michael Busch wrote: Marvin Humphrey wrote: On Mar 12, 2007, at 2:11 PM, Michael Busch wrote: I think our best option here is to have a closed XML file for the index format/configuration (something like you sent in your other mail) plus a binary file for cust

Re: Flexible indexing

2007-03-12 Thread Marvin Humphrey
On Mar 12, 2007, at 3:54 PM, Michael Busch wrote: Sounds interesting! I will take a closer look at it... Here's an introduction courtesy of JYaml, a YAML library for Java: http://jyaml.sourceforge.net/tutorial.html For an example of how YAML is well suited to the task of serializing ind

Re: Flexible indexing

2007-03-12 Thread Michael Busch
Marvin Humphrey wrote: On Mar 12, 2007, at 2:11 PM, Michael Busch wrote: I think our best option here is to have a closed XML file for the index format/configuration (something like you sent in your other mail) plus a binary file for custom index-level metadata like Grant suggested. Why th

Re: Flexible indexing

2007-03-12 Thread Marvin Humphrey
On Mar 12, 2007, at 2:11 PM, Michael Busch wrote: I think our best option here is to have a closed XML file for the index format/configuration (something like you sent in your other mail) plus a binary file for custom index-level metadata like Grant suggested. Why the binary file? Btw,

Re: Flexible indexing

2007-03-12 Thread Michael Busch
Marvin Humphrey wrote: On Mar 10, 2007, at 3:27 PM, Michael Busch wrote: I'm going to respond to this over several mails (: and possibly days :) because there's an awful lot here, and I've already implemented a lot of it in KS. We should also make this public, so that users can store their

Re: Flexible indexing

2007-03-12 Thread Marvin Humphrey
On Mar 10, 2007, at 3:27 PM, Michael Busch wrote: - Introduce index-level metadata. Preferable in XML format, so it will be human readable. Later on, we can store information about the index format in this file, like the codecs that are used to store the data. To provoke thought about wh

Re: Flexible indexing

2007-03-12 Thread Marvin Humphrey
On Mar 10, 2007, at 3:27 PM, Michael Busch wrote: - Introduce index format. Nicolas has already written a lot of code in this regard! I worry that going the interface route is going to be too restrictive. When I looked at Nicholas's index format spec, I immediately wanted to add an Anal

Re: Flexible indexing

2007-03-12 Thread Marvin Humphrey
On Mar 10, 2007, at 3:27 PM, Michael Busch wrote: I'm going to respond to this over several mails (: and possibly days :) because there's an awful lot here, and I've already implemented a lot of it in KS. We should also make this public, so that users can store their own index metadata.

Re: Flexible indexing

2007-03-11 Thread Michael Busch
Grant Ingersoll wrote: In regard of FI and 662 however I really believe we should split it up and plan ahead (in a way I mentioned already), so that we have more isolated patches. It is really great that we have 662 already (Nicolas, thank you so much for your hard work, I hope you'll keep w

Re: Flexible indexing

2007-03-11 Thread Grant Ingersoll
On Mar 11, 2007, at 5:41 PM, Michael Busch wrote: Hi Grant, I certainly agree that it would be great if we could make some progress and commit the payloads patch soon. I think it is quite independent from FI. FI will introduce different posting formats (see Wiki: http://wiki.apache.org/l

Re: Flexible indexing

2007-03-11 Thread Michael Busch
Hi Grant, I certainly agree that it would be great if we could make some progress and commit the payloads patch soon. I think it is quite independent from FI. FI will introduce different posting formats (see Wiki: http://wiki.apache.org/lucene-java/FlexibleIndexing). Payloads will be part of

Re: Flexible indexing (was: Re: [jira] Commented: (LUCENE-755) Payloads)

2007-03-10 Thread Grant Ingersoll
Hi Michael, This is very good. I know 662 is different, just wasn't sure if Nicolas patch was meant to be applied after 662, b/c I know we had discussed this before. I do agree with you about planning this out, but I also know that patches seem to motivate people the best and provide a c

Re: Flexible Indexing (was Re: Lucene Planning)

2006-06-02 Thread Marvin Humphrey
On Jun 2, 2006, at 6:48 AM, Grant Ingersoll wrote: I thought it was you, but wasn't sure. I'm always looking for ways to minimize Term Vectors, because I consider excerpting/highlighting a core feature rather than an add- on, and they seem like such overkill. It bothers me that they dupl

Re: Flexible Indexing (was Re: Lucene Planning)

2006-06-02 Thread Grant Ingersoll
I thought it was you, but wasn't sure. I would also like a way to store the frequency of the term in the overall collection (probably should go in the Term dictionary, but not sure, at the cost of an additional VInt per term, but I am open to other places to store it). Right now, in order to

Re: Flexible Indexing (was Re: Lucene Planning)

2006-06-01 Thread Marvin Humphrey
On Jun 1, 2006, at 5:48 AM, Grant Ingersoll wrote: Someone on the list a while ago suggested moving Term Vectors out of the postings and storing them separately, as then they don't have to be merged (but they doc ids would have to be kept up to date) Yes, that was me. :) I suggested stor

Re: Flexible Indexing (was Re: Lucene Planning)

2006-06-01 Thread Grant Ingersoll
Marvin Humphrey wrote: * Term Vectors (optional) Someone on the list a while ago suggested moving Term Vectors out of the postings and storing them separately, as then they don't have to be merged (but they doc ids would have to be kept up to date) -- Grant Ingersoll Sr. Software Engi

Re: Flexible Indexing (was Re: Lucene Planning)

2006-05-31 Thread Marvin Humphrey
[wild brainstorming...] Another reason to consolidate the freqs, positions, and boosts/norms into one file: we can isolate and distill the code that encodes/ decodes that file into a plugin, weakening the current tight coupling between Lucene and its file format. Changing that index format