Re: Deterministic index construction

2020-12-19 Thread Haoyu Zhai
Hi Adrien
I think Mike's comment is correct, we already have index sorted but we want
to reconstruct a index with exact same number of segments and each segment
contains exact same documents.

Mike
AddIndexes could take CodecReader as input [1], which allows us to pass in
a customized FilteredIndexReader I think? Then it knows which docs to take.
And then suppose original index has N segments, we could open N IndexWriter
concurrently and rebuilt those N segments, and at last somehow merge them
back to a whole index. (I am not quite sure about whether we could achieve
the last step easily, but that sounds not so hard?)

[1]
https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes-org.apache.lucene.index.CodecReader...-

Michael Sokolov  于2020年12月19日周六 上午9:13写道:

> I don't know about addIndexes. Does that let you say which document goes
> where somehow? Wouldn't you have to select a subset of documents from each
> originally indexed segment?
>
> On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov  wrote:
>
>> I think the idea is to exert control over the distribution of documents
>> among the segments, in a deterministic reproducible way.
>>
>> On Sat, Dec 19, 2020, 11:39 AM Adrien Grand  wrote:
>>
>>> Have you considered leveraging Lucene's built-in index sorting? It
>>> supports concurrent indexing and is quite fast.
>>>
>>> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai  wrote:
>>>
 Hi
 Our team is seeking a way of construct (or rebuild) a deterministic
 sorted index concurrently (I know lucene could achieve that in a sequential
 manner but that might be too slow for us sometimes)
 Currently we have roughly 2 ideas, all assuming there's a pre-built
 index and have dumped a doc-segment map so that IndexWriter would be able
 to be aware of which doc belong to which segment:
 1. First build index in the normal way (concurrently), after the index
 is built, using "addIndexes" functionality to merge documents into the
 correct segment.
 2. By controlling FlushPolicy and other related classes, make sure each
 segment created (before merge) has only the documents that belong to one of
 the segments in the pre-built index. And create a dedicated MergePolicy to
 only merge segments belonging to one pre-built segment.

 Basically we think first one is easier to implement and second one is
 faster. Want to seek some ideas & suggestions & feedback here.

 Thanks
 Patrick Zhai

>>>
>>>
>>> --
>>> Adrien
>>>
>>


Re: 8.8 Release

2020-12-19 Thread Bruno Roustant
+1 Thanks for volunteering

Le ven. 18 déc. 2020 à 01:41, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> a écrit :

> Sure, Houston. I'll wait another week. Have a good new year and merry
> Christmas!
>
> On Fri, 18 Dec, 2020, 5:58 am Timothy Potter, 
> wrote:
>
>> Great point Houston! +1 on waiting until a week into January
>>
>> On Thu, Dec 17, 2020 at 4:46 PM Houston Putman 
>> wrote:
>>
>>> Thanks for volunteering Ishan.
>>>
>>> I think it might be a good idea to wait to cut and release 8.8 at least
>>> a week into January. Many people are going to be away during the holiday
>>> season, and particularly the last week of the year. Pushing into January
>>> just gives more people a chance to look at the release and be involved.
>>>
>>> - Houston
>>>
>>> On Fri, Dec 11, 2020 at 3:26 PM Noble Paul  wrote:
>>>
 Thanks Ishan for volunteering

 On Fri, Dec 11, 2020 at 5:07 AM Christine Poerschke (BLOOMBERG/
 LONDON)  wrote:
 >
 > With a view towards including it in the release, I'd appreciate code
 review input on
 >
 > https://github.com/apache/lucene-solr/pull/1992 for
 >
 > https://issues.apache.org/jira/browse/SOLR-14939 (JSON facets: range
 faceting to support cache=false parameter)
 >
 > if anyone has some time next week perhaps?
 >
 > Thanks in advance!
 >
 > Christine
 >
 > From: dev@lucene.apache.org At: 12/10/20 18:01:58
 > To: dev@lucene.apache.org
 > Subject: Re: 8.8 Release
 >
 > +1
 >
 > Joel Bernstein
 > http://joelsolr.blogspot.com/
 >
 >
 > On Thu, Dec 10, 2020 at 11:23 AM David Smiley 
 wrote:
 >>
 >> Thanks for volunteering!
 >>
 >> On Thu, Dec 10, 2020 at 11:11 AM Ishan Chattopadhyaya <
 ichattopadhy...@gmail.com> wrote:
 >>>
 >>> Hi Devs,
 >>> There are lots of changes accumulated and some underway. I wish to
 volunteer for a 8.8 release, if there are no objections. I'm planning to
 build the RC in three weeks, i.e. 31 December (and cut the branch about 3-4
 days before that). Please let me know if someone has any concerns.
 >>> Thanks and regards,
 >>> Ishan
 >>>
 >> --
 >> Sent from Gmail Mobile
 >
 >


 --
 -
 Noble Paul

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: Deterministic index construction

2020-12-19 Thread Michael Sokolov
I don't know about addIndexes. Does that let you say which document goes
where somehow? Wouldn't you have to select a subset of documents from each
originally indexed segment?

On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov  wrote:

> I think the idea is to exert control over the distribution of documents
> among the segments, in a deterministic reproducible way.
>
> On Sat, Dec 19, 2020, 11:39 AM Adrien Grand  wrote:
>
>> Have you considered leveraging Lucene's built-in index sorting? It
>> supports concurrent indexing and is quite fast.
>>
>> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai  wrote:
>>
>>> Hi
>>> Our team is seeking a way of construct (or rebuild) a deterministic
>>> sorted index concurrently (I know lucene could achieve that in a sequential
>>> manner but that might be too slow for us sometimes)
>>> Currently we have roughly 2 ideas, all assuming there's a pre-built
>>> index and have dumped a doc-segment map so that IndexWriter would be able
>>> to be aware of which doc belong to which segment:
>>> 1. First build index in the normal way (concurrently), after the index
>>> is built, using "addIndexes" functionality to merge documents into the
>>> correct segment.
>>> 2. By controlling FlushPolicy and other related classes, make sure each
>>> segment created (before merge) has only the documents that belong to one of
>>> the segments in the pre-built index. And create a dedicated MergePolicy to
>>> only merge segments belonging to one pre-built segment.
>>>
>>> Basically we think first one is easier to implement and second one is
>>> faster. Want to seek some ideas & suggestions & feedback here.
>>>
>>> Thanks
>>> Patrick Zhai
>>>
>>
>>
>> --
>> Adrien
>>
>


Re: Deterministic index construction

2020-12-19 Thread Michael Sokolov
I think the idea is to exert control over the distribution of documents
among the segments, in a deterministic reproducible way.

On Sat, Dec 19, 2020, 11:39 AM Adrien Grand  wrote:

> Have you considered leveraging Lucene's built-in index sorting? It
> supports concurrent indexing and is quite fast.
>
> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai  wrote:
>
>> Hi
>> Our team is seeking a way of construct (or rebuild) a deterministic
>> sorted index concurrently (I know lucene could achieve that in a sequential
>> manner but that might be too slow for us sometimes)
>> Currently we have roughly 2 ideas, all assuming there's a pre-built index
>> and have dumped a doc-segment map so that IndexWriter would be able to be
>> aware of which doc belong to which segment:
>> 1. First build index in the normal way (concurrently), after the index is
>> built, using "addIndexes" functionality to merge documents into the correct
>> segment.
>> 2. By controlling FlushPolicy and other related classes, make sure each
>> segment created (before merge) has only the documents that belong to one of
>> the segments in the pre-built index. And create a dedicated MergePolicy to
>> only merge segments belonging to one pre-built segment.
>>
>> Basically we think first one is easier to implement and second one is
>> faster. Want to seek some ideas & suggestions & feedback here.
>>
>> Thanks
>> Patrick Zhai
>>
>
>
> --
> Adrien
>


Re: Deterministic index construction

2020-12-19 Thread Adrien Grand
Have you considered leveraging Lucene's built-in index sorting? It supports
concurrent indexing and is quite fast.

On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai  wrote:

> Hi
> Our team is seeking a way of construct (or rebuild) a deterministic sorted
> index concurrently (I know lucene could achieve that in a sequential manner
> but that might be too slow for us sometimes)
> Currently we have roughly 2 ideas, all assuming there's a pre-built index
> and have dumped a doc-segment map so that IndexWriter would be able to be
> aware of which doc belong to which segment:
> 1. First build index in the normal way (concurrently), after the index is
> built, using "addIndexes" functionality to merge documents into the correct
> segment.
> 2. By controlling FlushPolicy and other related classes, make sure each
> segment created (before merge) has only the documents that belong to one of
> the segments in the pre-built index. And create a dedicated MergePolicy to
> only merge segments belonging to one pre-built segment.
>
> Basically we think first one is easier to implement and second one is
> faster. Want to seek some ideas & suggestions & feedback here.
>
> Thanks
> Patrick Zhai
>


-- 
Adrien