Re: Future projects

2009-04-22 Thread Jason Rutherglen
Hey Michael, You're in San Jose? Feel free to come by one of these days on our pizza days. Also, can you post what you have of LUCENE-1231? I got a lot more familiar with IndexWriter internals with LUCENE-1516 and could to a good whack at getting LUCENE-1231 integrated. Cheers! Jason On Sun,

Re: Future projects

2009-04-12 Thread Michael Busch
On 4/4/09 4:42 AM, Michael McCandless wrote: As I recently mentioned on 1231 I'm looking into changing the Document and Field APIs. I've some rough prototype. I think we should also try to get it in before 2.9? On the other hand I don't want to block the 2.9 release with too much stuff. T

Re: Future projects

2009-04-08 Thread Michael McCandless
On Tue, Apr 7, 2009 at 7:05 PM, Jason Rutherglen wrote: >  >  I think we should keep it simple, unless we discover real perf problems > with the current approach. > > Simple is good, however the indexing performance will lag because we're back > to the indexing speed of pre ram buffer? (i.e. mergi

Re: Future projects

2009-04-07 Thread Jason Rutherglen
> I think we should keep it simple, unless we discover real perf problems with the current approach. Simple is good, however the indexing performance will lag because we're back to the indexing speed of pre ram buffer? (i.e. merging segments using a ramdirectory). > need to do a merge sort (acr

Re: Future projects

2009-04-07 Thread Michael McCandless
On Mon, Apr 6, 2009 at 6:43 PM, Jason Rutherglen wrote: >> The realtime reader would have to have sub-readers per thread, > and an aggregate reader that "joins" them by interleaving the > docIDs > > Nice (i.e. nice and complex)! Right, this is why I like the current [simple] near real-time approa

Re: Future projects

2009-04-06 Thread Jason Rutherglen
> The realtime reader would have to have sub-readers per thread, and an aggregate reader that "joins" them by interleaving the docIDs Nice (i.e. nice and complex)! Not knowing too much about the internals, how would the interleaving work? Does each subreader have a "start" ala Multi*Reader? Or are

Re: Future projects

2009-04-04 Thread Michael McCandless
On Fri, Apr 3, 2009 at 8:01 PM, Jason Rutherglen wrote: > I looked at the IndexWriter code in regards to creating a realtime reader, > with the many flexible indexing classes I'm unsure of how one would get a > frozenish IndexInput of the byte slices, given the byte slices are attached > to differ

Re: Future projects

2009-04-04 Thread Michael McCandless
On Fri, Apr 3, 2009 at 7:11 PM, Michael Busch wrote: > Yeah me too. I think eventually we want this to be a Codec, but we probably > don't want to wait until all the flexible indexing work is done. > So maybe we should just not worry too much about a perfectly integrated API > yet and release it

Re: Future projects

2009-04-04 Thread Michael McCandless
On Fri, Apr 3, 2009 at 5:42 PM, Jason Rutherglen wrote: >> I think the realtime reader'd just store the maxDocID it's allowed to >> search, and we would likely keep using the RAM format now used. > > Sounds pretty good.  Are there any other gotchas in the design? Yes: the flushing process becomes

Re: Future projects

2009-04-04 Thread Michael McCandless
On Fri, Apr 3, 2009 at 5:32 PM, Jason Rutherglen wrote: >> meaning in Bobo you'd like to manage your own memory resident > field caches, and merge them whenever IW has merged a segment? > Seems like you don't need genealogy for that. > > Agreed, there is no need for full genealogy. OK >> CSF isn

Re: Future projects

2009-04-04 Thread Michael McCandless
On Fri, Apr 3, 2009 at 3:16 PM, John Wang wrote: > By default bobo DOES use a flavor of the field cache data structure with > some addition information for performance. (e.g. minDocid,maxDocid,freq per > term) > Bobo is architected as a platform where clients can write their own > "FacetHandlers"

Re: Future projects

2009-04-03 Thread Jason Rutherglen
I looked at the IndexWriter code in regards to creating a realtime reader, with the many flexible indexing classes I'm unsure of how one would get a frozenish IndexInput of the byte slices, given the byte slices are attached to different threads? On Fri, Apr 3, 2009 at 2:42 PM, Jason Rutherglen w

Re: Future projects

2009-04-03 Thread Michael Busch
On 4/3/09 3:35 AM, Michael McCandless wrote: It seems like we've been talking about CSF for 2 years and there isn't a patch for it? If I had more time I'd take a look. What is the status of it? I think Michael is looking into it? I'd really like to get it into 2.9. We should do it in co

Re: Future projects

2009-04-03 Thread Jason Rutherglen
> I think the realtime reader'd just store the maxDocID it's allowed to search, and we would likely keep using the RAM format now used. Sounds pretty good. Are there any other gotchas in the design? On Thu, Apr 2, 2009 at 1:40 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Wed

Re: Future projects

2009-04-03 Thread Jason Rutherglen
> meaning in Bobo you'd like to manage your own memory resident field caches, and merge them whenever IW has merged a segment? Seems like you don't need genealogy for that. Agreed, there is no need for full genealogy. > CSF isn't really designed yet. How come it can't be used with Bobo's field ca

Re: Future projects

2009-04-03 Thread John Wang
By default bobo DOES use a flavor of the field cache data structure with some addition information for performance. (e.g. minDocid,maxDocid,freq per term) Bobo is architected as a platform where clients can write their own "FacetHandlers" in which each FacetHandler manages its own view of memory st

Re: Future projects

2009-04-03 Thread Michael McCandless
On Thu, Apr 2, 2009 at 6:55 PM, John Wang wrote: > Just to clarify, Approach 1 and approach 2 are both currently performing ok > currently for us. OK that's very good to know. Mike - To unsubscribe, e-mail: java-dev-unsubscr...

Re: Future projects

2009-04-03 Thread Michael McCandless
On Thu, Apr 2, 2009 at 5:56 PM, Jason Rutherglen wrote: >> I think I need to understand better why delete by Query isn't > viable in your situation... > > The delete by query is a separate problem which I haven't fully > explored yet. Oh, I had thought we were tugging on this thread in order to e

Re: Future projects

2009-04-02 Thread John Wang
Just to clarify, Approach 1 and approach 2 are both currently performing ok currently for us. -John On Thu, Apr 2, 2009 at 2:41 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Thu, Apr 2, 2009 at 4:43 PM, Jason Rutherglen > wrote: > >> What does Bobo use the cached bitsets for? >

Re: Future projects

2009-04-02 Thread Jason Rutherglen
> I think I need to understand better why delete by Query isn't viable in your situation... The delete by query is a separate problem which I haven't fully explored yet. Tracking the segment genealogy is really an interim step for merging field caches before column stride fields gets implemented.

Re: Future projects

2009-04-02 Thread Michael McCandless
On Thu, Apr 2, 2009 at 4:43 PM, Jason Rutherglen wrote: >> What does Bobo use the cached bitsets for? > > Bobo is a faceting engine that uses custom field caches and sometimes cached > bitsets rather than relying exclusively on bitsets to calculate facets.  It > is useful where many facets (50+) n

Re: Future projects

2009-04-02 Thread Jason Rutherglen
> What does Bobo use the cached bitsets for? Bobo is a faceting engine that uses custom field caches and sometimes cached bitsets rather than relying exclusively on bitsets to calculate facets. It is useful where many facets (50+) need to be calculated and bitset caching, loading and intersection

Re: Future projects

2009-04-02 Thread Michael McCandless
On Thu, Apr 2, 2009 at 2:29 PM, Jason Rutherglen wrote: >> What is "passing filters to the SegmentReader level"? EG as of > LUCENE-1483, we now ask a Filter for it's DocIdSet once per > SegmentReader. > > The patch I was thinking of is LUCENE-1536. I wasn't sure what > the next steps are for it, i

Re: Future projects

2009-04-02 Thread Michael McCandless
I'm not sure how big a win this'd be, since the OS will cache those in RAM and the CPU cost there (to pull from OS's cache and reprocess) is maybe not high. Optimizing search is interesting, because it's the wicked slow queries that you need to make faster even when it's at the expense of wicked f

Re: Future projects

2009-04-02 Thread Michael McCandless
On Thu, Apr 2, 2009 at 2:07 PM, Jason Rutherglen wrote: > I'm interested in merging cached bitsets and field caches.  While this may > be something related to LUCENE-831, in Bobo there are custom field caches > which we want to merge in RAM (rather than reload from the reader using > termenum + te

Re: Future projects

2009-04-02 Thread Jason Rutherglen
> What is "passing filters to the SegmentReader level"? EG as of LUCENE-1483, we now ask a Filter for it's DocIdSet once per SegmentReader. The patch I was thinking of is LUCENE-1536. I wasn't sure what the next steps are for it, i.e. the JumpScorer, Scorer.skipToButNotNext, or simply implementing

Re: Future projects

2009-04-02 Thread Jason Rutherglen
I'm interested in merging cached bitsets and field caches. While this may be something related to LUCENE-831, in Bobo there are custom field caches which we want to merge in RAM (rather than reload from the reader using termenum + termdocs). This could somehow lead to delete by doc id. Tracking

Re: Future projects

2009-04-02 Thread Jason Rutherglen
4) An additional possibly contrib module is caching the results of TermQueries. In looking at the TermQuery code would we need to cache the entire docs and freqs as arrays which would be a memory hog? On Wed, Apr 1, 2009 at 4:05 PM, Jason Rutherglen wrote: > Now that LUCENE-1516 is close to bei

Re: Future projects

2009-04-02 Thread John Wang
Michael: I love your suggestion on 3)! This really opens doors for flexible indexing. -John On Thu, Apr 2, 2009 at 1:40 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen > wrote: > > Now that LUCENE-1516 is close to being commit

Re: Future projects

2009-04-02 Thread Michael McCandless
On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen wrote: > Now that LUCENE-1516 is close to being committed perhaps we can > figure out the priority of other issues: > > 1. Searchable IndexWriter RAM buffer I think first priority is to get a good assessment of the performance of the current implem

Future projects

2009-04-01 Thread Jason Rutherglen
Now that LUCENE-1516 is close to being committed perhaps we can figure out the priority of other issues: 1. Searchable IndexWriter RAM buffer 2. Finish up benchmarking and perhaps implement passing filters to the SegmentReader level 3. Deleting by doc id using IndexWriter With 1) I'm interested