Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
chenboat merged PR #12744: URL: https://github.com/apache/pinot/pull/12744 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org
Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
chenboat commented on code in PR #12744: URL: https://github.com/apache/pinot/pull/12744#discussion_r1563397223 ## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java: ## @@ -181,6 +181,17 @@ private MutableRoaringBitmap getPinotDocIds(IndexSearcher indexSearcher, Mutable return actualDocIDs; } + @Override + public void commit() { +try { + _indexCreator.getIndexWriter().commit(); +} catch (Exception e) { + LOGGER.error("Failed to commit the realtime lucene text index for column {}, exception {}", _column, + e.getMessage()); Review Comment: I meant LOGGER.err("Failed.. .", _column, e); -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org
Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
itschrispeck commented on code in PR #12744: URL: https://github.com/apache/pinot/pull/12744#discussion_r1558836542 ## pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableIndex.java: ## @@ -60,4 +60,12 @@ public interface MutableIndex extends IndexReader { * @param docId The document id of the given row. A non-negative value. */ void add(@Nonnull Object[] values, @Nullable int[] dictIds, int docId); + + /** + * Commits the mutable index artifacts to disk. This is used in preparation for realtime segment conversion. + * commit() should perform any required actions before using mutable segment artifacts to optimize immutable Review Comment: commit to disk should be the main mechanism for index reuse for any indexes using this path - disk is an intermediary. Other actions refers to actions a MutableIndex implementation might want to perform before writing some artifacts to disk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org
Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
itschrispeck commented on code in PR #12744: URL: https://github.com/apache/pinot/pull/12744#discussion_r1558835084 ## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java: ## @@ -90,14 +102,17 @@ public static HashSet getDefaultEnglishStopWordsSet() { * to offline), we close this lucene index writer to release resources but don't commit. * @param config the text index config */ - public LuceneTextIndexCreator(String column, File segmentIndexDir, boolean commit, TextIndexConfig config) { + public LuceneTextIndexCreator(String column, File segmentIndexDir, boolean commit, boolean realtimeConversion, + @Nullable int[] immutableToMutableIdMap, TextIndexConfig config) { _textColumn = column; + +// to reuse the mutable index, it must be (1) not the realtime index, and (2) is realtime segment conversion Review Comment: clarified in the code - `LuceneTextIndexCreator` is used for both in `RealtimeLuceneTextIndex` and separately for the offline lucene text index. We use the `commit` flag (set only for offline index) to avoid trying to convert an index if the instance is created by `RealtimeLuceneTextIndex` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org
Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
itschrispeck commented on code in PR #12744: URL: https://github.com/apache/pinot/pull/12744#discussion_r1558833921 ## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java: ## @@ -181,6 +181,17 @@ private MutableRoaringBitmap getPinotDocIds(IndexSearcher indexSearcher, Mutable return actualDocIDs; } + @Override + public void commit() { +try { + _indexCreator.getIndexWriter().commit(); +} catch (Exception e) { + LOGGER.error("Failed to commit the realtime lucene text index for column {}, exception {}", _column, + e.getMessage()); Review Comment: didn't quite follow this, can you elaborate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org
Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
chenboat commented on code in PR #12744: URL: https://github.com/apache/pinot/pull/12744#discussion_r1558152978 ## pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableIndex.java: ## @@ -60,4 +60,12 @@ public interface MutableIndex extends IndexReader { * @param docId The document id of the given row. A non-negative value. */ void add(@Nonnull Object[] values, @Nullable int[] dictIds, int docId); + + /** + * Commits the mutable index artifacts to disk. This is used in preparation for realtime segment conversion. + * commit() should perform any required actions before using mutable segment artifacts to optimize immutable Review Comment: Comments in line 65-66 seem to be self-conflicting: commit to disk is just one possible action? the implementation can actually choose other action for preparation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org
Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
chenboat commented on code in PR #12744: URL: https://github.com/apache/pinot/pull/12744#discussion_r1558146827 ## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java: ## @@ -90,14 +102,17 @@ public static HashSet getDefaultEnglishStopWordsSet() { * to offline), we close this lucene index writer to release resources but don't commit. * @param config the text index config */ - public LuceneTextIndexCreator(String column, File segmentIndexDir, boolean commit, TextIndexConfig config) { + public LuceneTextIndexCreator(String column, File segmentIndexDir, boolean commit, boolean realtimeConversion, + @Nullable int[] immutableToMutableIdMap, TextIndexConfig config) { _textColumn = column; + +// to reuse the mutable index, it must be (1) not the realtime index, and (2) is realtime segment conversion Review Comment: what does (1) mean exactly? "not the realtime index"? the code says commit? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org
Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
chenboat commented on code in PR #12744: URL: https://github.com/apache/pinot/pull/12744#discussion_r1558134729 ## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneTextIndex.java: ## @@ -181,6 +181,17 @@ private MutableRoaringBitmap getPinotDocIds(IndexSearcher indexSearcher, Mutable return actualDocIDs; } + @Override + public void commit() { +try { + _indexCreator.getIndexWriter().commit(); +} catch (Exception e) { + LOGGER.error("Failed to commit the realtime lucene text index for column {}, exception {}", _column, + e.getMessage()); Review Comment: why not just log the error e? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org
Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
chenboat commented on code in PR #12744: URL: https://github.com/apache/pinot/pull/12744#discussion_r1558089422 ## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/RealtimeSegmentConverter.java: ## @@ -118,6 +118,9 @@ public void build(@Nullable SegmentVersion segmentVersion, ServerMetrics serverM genConfig.setNullHandlingEnabled(_nullHandlingEnabled); genConfig.setSegmentZKPropsConfig(_segmentZKPropsConfig); +// flush any artifacts to disk to improve immutable segment build Review Comment: Flushing to disk is just one kind of commit action, right? See the comments added for the commit() call. I think in general all kind of preparation is possible? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org
Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
chenboat commented on code in PR #12744: URL: https://github.com/apache/pinot/pull/12744#discussion_r1558086670 ## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java: ## @@ -932,6 +932,19 @@ public Object getValue(int docId, String column) { } } + /** + * Calls commit() on all mutable indexes. This is used in preparation for realtime segment conversion. + * .commit() can be implemented per index to perform any required actions before using mutable segment + * artifacts to optimize imutable segment build. Review Comment: imutable --> immutable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org
Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
wirybeaver commented on code in PR #12744: URL: https://github.com/apache/pinot/pull/12744#discussion_r1552866049 ## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java: ## @@ -218,6 +224,19 @@ public void init(SegmentGeneratorConfig config, SegmentCreationDataSource dataSo LOGGER.debug("tempIndexDir:{}", _tempIndexDir); } + // input int[] can be used to map sortedDocIds[immutableId] = mutableId (based on RecordReader iteration order) + // output int[] should be used to map output[mutableId] = immutableId + private int[] convertSortedDocIds(@Nullable int[] sortedDocIds) { Review Comment: IIUC, if the table have sort index, then the docIds need to be sorted accordingly when generate immutable segment? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org
Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
chenboat commented on code in PR #12744: URL: https://github.com/apache/pinot/pull/12744#discussion_r1552581249 ## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java: ## @@ -74,7 +85,8 @@ public static HashSet getDefaultEnglishStopWordsSet() { * @param column column name * @param segmentIndexDir segment index directory * @param commit true if the index should be committed (at the end after all documents have - * been added), false if index should not be committed + * been added), false if index should not be + * @param sortedDocIds sortedDocIds from segment conversion Review Comment: same as my earlier comment on the naming. Also since this is a public method, we need more explanation and some example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org
Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
chenboat commented on code in PR #12744: URL: https://github.com/apache/pinot/pull/12744#discussion_r1552580631 ## pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentIndexCreationDriverImpl.java: ## @@ -218,6 +224,19 @@ public void init(SegmentGeneratorConfig config, SegmentCreationDataSource dataSo LOGGER.debug("tempIndexDir:{}", _tempIndexDir); } + // input int[] can be used to map sortedDocIds[immutableId] = mutableId (based on RecordReader iteration order) + // output int[] should be used to map output[mutableId] = immutableId + private int[] convertSortedDocIds(@Nullable int[] sortedDocIds) { Review Comment: sortedDocIds makes people think the docIds in the array is sorted. In this case it is not -- you probably mean immutableId is sorted as array index. Can we find a better name here with some explanation and example? e.g., immutableIdToMutableId map. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org
Re: [PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
codecov-commenter commented on PR #12744: URL: https://github.com/apache/pinot/pull/12744#issuecomment-2026286837 ## [Codecov](https://app.codecov.io/gh/apache/pinot/pull/12744?dropdown=coverage=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) Report Attention: Patch coverage is `0%` with `100 lines` in your changes are missing coverage. Please review. > Project coverage is 0.00%. Comparing base [(`59551e4`)](https://app.codecov.io/gh/apache/pinot/commit/59551e45224f1535c4863fd577622b37366ccc97?dropdown=coverage=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) to head [(`6ba15ec`)](https://app.codecov.io/gh/apache/pinot/pull/12744?dropdown=coverage=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache). > Report is 184 commits behind head on master. | [Files](https://app.codecov.io/gh/apache/pinot/pull/12744?dropdown=coverage=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | Patch % | Lines | |---|---|---| | [...ment/creator/impl/text/LuceneTextIndexCreator.java](https://app.codecov.io/gh/apache/pinot/pull/12744?src=pr=tree=pinot-segment-local%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fpinot%2Fsegment%2Flocal%2Fsegment%2Fcreator%2Fimpl%2Ftext%2FLuceneTextIndexCreator.java_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC90ZXh0L0x1Y2VuZVRleHRJbmRleENyZWF0b3IuamF2YQ==) | 0.00% | [61 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12744?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | | [...t/creator/impl/SegmentIndexCreationDriverImpl.java](https://app.codecov.io/gh/apache/pinot/pull/12744?src=pr=tree=pinot-segment-local%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fpinot%2Fsegment%2Flocal%2Fsegment%2Fcreator%2Fimpl%2FSegmentIndexCreationDriverImpl.java_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9TZWdtZW50SW5kZXhDcmVhdGlvbkRyaXZlckltcGwuamF2YQ==) | 0.00% | [12 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12744?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | | [...inot/segment/spi/creator/IndexCreationContext.java](https://app.codecov.io/gh/apache/pinot/pull/12744?src=pr=tree=pinot-segment-spi%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fpinot%2Fsegment%2Fspi%2Fcreator%2FIndexCreationContext.java_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#diff-cGlub3Qtc2VnbWVudC1zcGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL3Bpbm90L3NlZ21lbnQvc3BpL2NyZWF0b3IvSW5kZXhDcmVhdGlvbkNvbnRleHQuamF2YQ==) | 0.00% | [10 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12744?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | | [...me/impl/invertedindex/RealtimeLuceneTextIndex.java](https://app.codecov.io/gh/apache/pinot/pull/12744?src=pr=tree=pinot-segment-local%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fpinot%2Fsegment%2Flocal%2Frealtime%2Fimpl%2Finvertedindex%2FRealtimeLuceneTextIndex.java_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9yZWFsdGltZS9pbXBsL2ludmVydGVkaW5kZXgvUmVhbHRpbWVMdWNlbmVUZXh0SW5kZXguamF2YQ==) | 0.00% | [7 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12744?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | | [...local/indexsegment/mutable/MutableSegmentImpl.java](https://app.codecov.io/gh/apache/pinot/pull/12744?src=pr=tree=pinot-segment-local%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fpinot%2Fsegment%2Flocal%2Findexsegment%2Fmutable%2FMutableSegmentImpl.java_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9pbmRleHNlZ21lbnQvbXV0YWJsZS9NdXRhYmxlU2VnbWVudEltcGwuamF2YQ==) | 0.00% | [6 Missing :warning: ](https://app.codecov.io/gh/apache/pinot/pull/12744?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=apache) | |
[PR] Improved segment build time for Lucene text index realtime to offline conversion [pinot]
itschrispeck opened a new pull request, #12744: URL: https://github.com/apache/pinot/pull/12744 **Problem:** We typically see long (7-10min) segment build times when using Lucene index with 1-1.5GB segment sizes. 70-80% of this time is spent building the Lucene text index. **Background:** In the existing implementation the Lucene index stores Pinot docIds: for the mutable segment these are the 'mutable' docIds, for the immutable segment we store each row with its new docId. Lucene queries return the matching Lucene DocIds, and we compute these on the fly for the mutable index, or from a mapping file for the immutable index. **Change Summary:** This change copies the mutable Lucene index during realtime segment conversion to reuse, instead of building a new Lucene index. To handle the potential docId change `sortedDocIds` is added to `IndexCreationContext` to compute a temporary mapping between the mutable docId and the immutable segment's docId. This temporary mapping is used during segment conversion to build the mapping file between the Lucene docId and the new immutable segment's docId. This mapping file is built during segment conversion, instead of during segment load in the traditional path. Internally we've seen roughly 40-60% improvement in overall segment build time. The lower peaks are from a table/tenant with this change, the higher ingestion delay peaks are from an identical table in a tenant without this change: https://github.com/apache/pinot/assets/27231838/0ab23a4c-f7d3-4332-9c5b-e662925c6f9c;> Testing: deployed internally, local testing, validated basic pause/restart/reload operations on a table to ensure no regression in TextIndexHandler index build. tags: ingestion `performance` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org