Re: Deterministic index construction
Hi Adrien I think Mike's comment is correct, we already have index sorted but we want to reconstruct a index with exact same number of segments and each segment contains exact same documents. Mike AddIndexes could take CodecReader as input [1], which allows us to pass in a customized FilteredIndexReader I think? Then it knows which docs to take. And then suppose original index has N segments, we could open N IndexWriter concurrently and rebuilt those N segments, and at last somehow merge them back to a whole index. (I am not quite sure about whether we could achieve the last step easily, but that sounds not so hard?) [1] https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes-org.apache.lucene.index.CodecReader...- Michael Sokolov 于2020年12月19日周六 上午9:13写道: > I don't know about addIndexes. Does that let you say which document goes > where somehow? Wouldn't you have to select a subset of documents from each > originally indexed segment? > > On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov wrote: > >> I think the idea is to exert control over the distribution of documents >> among the segments, in a deterministic reproducible way. >> >> On Sat, Dec 19, 2020, 11:39 AM Adrien Grand wrote: >> >>> Have you considered leveraging Lucene's built-in index sorting? It >>> supports concurrent indexing and is quite fast. >>> >>> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai wrote: >>> Hi Our team is seeking a way of construct (or rebuild) a deterministic sorted index concurrently (I know lucene could achieve that in a sequential manner but that might be too slow for us sometimes) Currently we have roughly 2 ideas, all assuming there's a pre-built index and have dumped a doc-segment map so that IndexWriter would be able to be aware of which doc belong to which segment: 1. First build index in the normal way (concurrently), after the index is built, using "addIndexes" functionality to merge documents into the correct segment. 2. By controlling FlushPolicy and other related classes, make sure each segment created (before merge) has only the documents that belong to one of the segments in the pre-built index. And create a dedicated MergePolicy to only merge segments belonging to one pre-built segment. Basically we think first one is easier to implement and second one is faster. Want to seek some ideas & suggestions & feedback here. Thanks Patrick Zhai >>> >>> >>> -- >>> Adrien >>> >>
Re: 8.8 Release
+1 Thanks for volunteering Le ven. 18 déc. 2020 à 01:41, Ishan Chattopadhyaya < ichattopadhy...@gmail.com> a écrit : > Sure, Houston. I'll wait another week. Have a good new year and merry > Christmas! > > On Fri, 18 Dec, 2020, 5:58 am Timothy Potter, > wrote: > >> Great point Houston! +1 on waiting until a week into January >> >> On Thu, Dec 17, 2020 at 4:46 PM Houston Putman >> wrote: >> >>> Thanks for volunteering Ishan. >>> >>> I think it might be a good idea to wait to cut and release 8.8 at least >>> a week into January. Many people are going to be away during the holiday >>> season, and particularly the last week of the year. Pushing into January >>> just gives more people a chance to look at the release and be involved. >>> >>> - Houston >>> >>> On Fri, Dec 11, 2020 at 3:26 PM Noble Paul wrote: >>> Thanks Ishan for volunteering On Fri, Dec 11, 2020 at 5:07 AM Christine Poerschke (BLOOMBERG/ LONDON) wrote: > > With a view towards including it in the release, I'd appreciate code review input on > > https://github.com/apache/lucene-solr/pull/1992 for > > https://issues.apache.org/jira/browse/SOLR-14939 (JSON facets: range faceting to support cache=false parameter) > > if anyone has some time next week perhaps? > > Thanks in advance! > > Christine > > From: dev@lucene.apache.org At: 12/10/20 18:01:58 > To: dev@lucene.apache.org > Subject: Re: 8.8 Release > > +1 > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Thu, Dec 10, 2020 at 11:23 AM David Smiley wrote: >> >> Thanks for volunteering! >> >> On Thu, Dec 10, 2020 at 11:11 AM Ishan Chattopadhyaya < ichattopadhy...@gmail.com> wrote: >>> >>> Hi Devs, >>> There are lots of changes accumulated and some underway. I wish to volunteer for a 8.8 release, if there are no objections. I'm planning to build the RC in three weeks, i.e. 31 December (and cut the branch about 3-4 days before that). Please let me know if someone has any concerns. >>> Thanks and regards, >>> Ishan >>> >> -- >> Sent from Gmail Mobile > > -- - Noble Paul - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Deterministic index construction
I don't know about addIndexes. Does that let you say which document goes where somehow? Wouldn't you have to select a subset of documents from each originally indexed segment? On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov wrote: > I think the idea is to exert control over the distribution of documents > among the segments, in a deterministic reproducible way. > > On Sat, Dec 19, 2020, 11:39 AM Adrien Grand wrote: > >> Have you considered leveraging Lucene's built-in index sorting? It >> supports concurrent indexing and is quite fast. >> >> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai wrote: >> >>> Hi >>> Our team is seeking a way of construct (or rebuild) a deterministic >>> sorted index concurrently (I know lucene could achieve that in a sequential >>> manner but that might be too slow for us sometimes) >>> Currently we have roughly 2 ideas, all assuming there's a pre-built >>> index and have dumped a doc-segment map so that IndexWriter would be able >>> to be aware of which doc belong to which segment: >>> 1. First build index in the normal way (concurrently), after the index >>> is built, using "addIndexes" functionality to merge documents into the >>> correct segment. >>> 2. By controlling FlushPolicy and other related classes, make sure each >>> segment created (before merge) has only the documents that belong to one of >>> the segments in the pre-built index. And create a dedicated MergePolicy to >>> only merge segments belonging to one pre-built segment. >>> >>> Basically we think first one is easier to implement and second one is >>> faster. Want to seek some ideas & suggestions & feedback here. >>> >>> Thanks >>> Patrick Zhai >>> >> >> >> -- >> Adrien >> >
Re: Deterministic index construction
I think the idea is to exert control over the distribution of documents among the segments, in a deterministic reproducible way. On Sat, Dec 19, 2020, 11:39 AM Adrien Grand wrote: > Have you considered leveraging Lucene's built-in index sorting? It > supports concurrent indexing and is quite fast. > > On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai wrote: > >> Hi >> Our team is seeking a way of construct (or rebuild) a deterministic >> sorted index concurrently (I know lucene could achieve that in a sequential >> manner but that might be too slow for us sometimes) >> Currently we have roughly 2 ideas, all assuming there's a pre-built index >> and have dumped a doc-segment map so that IndexWriter would be able to be >> aware of which doc belong to which segment: >> 1. First build index in the normal way (concurrently), after the index is >> built, using "addIndexes" functionality to merge documents into the correct >> segment. >> 2. By controlling FlushPolicy and other related classes, make sure each >> segment created (before merge) has only the documents that belong to one of >> the segments in the pre-built index. And create a dedicated MergePolicy to >> only merge segments belonging to one pre-built segment. >> >> Basically we think first one is easier to implement and second one is >> faster. Want to seek some ideas & suggestions & feedback here. >> >> Thanks >> Patrick Zhai >> > > > -- > Adrien >
Re: Deterministic index construction
Have you considered leveraging Lucene's built-in index sorting? It supports concurrent indexing and is quite fast. On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai wrote: > Hi > Our team is seeking a way of construct (or rebuild) a deterministic sorted > index concurrently (I know lucene could achieve that in a sequential manner > but that might be too slow for us sometimes) > Currently we have roughly 2 ideas, all assuming there's a pre-built index > and have dumped a doc-segment map so that IndexWriter would be able to be > aware of which doc belong to which segment: > 1. First build index in the normal way (concurrently), after the index is > built, using "addIndexes" functionality to merge documents into the correct > segment. > 2. By controlling FlushPolicy and other related classes, make sure each > segment created (before merge) has only the documents that belong to one of > the segments in the pre-built index. And create a dedicated MergePolicy to > only merge segments belonging to one pre-built segment. > > Basically we think first one is easier to implement and second one is > faster. Want to seek some ideas & suggestions & feedback here. > > Thanks > Patrick Zhai > -- Adrien