subject:"Re\: \[PR\] Add new parallel merge task executor for parallel actions within a single merge action \[lucene\]"

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-04-02 Thread via GitHub



benwtrent commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2032358113

   @mikemccand oh dang, I haven't been doing that. Thanks for picking up my 
slack!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-04-02 Thread via GitHub



mikemccand commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2032311146

   It looks like this awesome change was backported for 9.11.0?  I'll add the 
milestone.  So hard to remember to set the milestones on our issues/PRs...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-24 Thread via GitHub



dnhatn commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2016724945

   Elasticsearch CI has identified an issue related to this change. The 
PerFieldDocValuesFormat and PerFieldPostingsFormat, which mutate and reset the 
fieldInfos of the mergeState while executing a merge. Consequently, other 
ongoing merges may fail to access some fieldInfos.
   
   
https://github.com/apache/lucene/blob/e2788336d49c56050b3cea5944008bcffe35dcb1/lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldDocValuesFormat.java#L150-L158
   
   
https://github.com/apache/lucene/blob/04f335ad79cae019d1a552e41e0029d061d8e3d9/lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldPostingsFormat.java#L196-L214


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-21 Thread via GitHub



benwtrent merged PR #13190:
URL: https://github.com/apache/lucene/pull/13190


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-21 Thread via GitHub



jpountz commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2011813580

   > Nightly benchy hardwires the maxMergeCount=16, maxThreadCount=12
   
   Do you remember how you came up with these numbers? Is there some reasoning 
behind these numbers, or do they come from experimentation?
   
   I'm also curious if you know where the current `maxThreadCount = max(1, 
min(4, coreCount / 2))` is coming from. Is the `4` number trying to approximate 
the number of tiers of `TieredMergePolicy` (max seg size = 5GB, 500MB, 50MB, 
5MB ~= 2MB = min seg size) in order to allow one merge on each tier to run 
concurrently?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-21 Thread via GitHub



mikemccand commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2011765808

   > Thanks for pushing on this change, I like it. The fact that the extra 
merge concurrency may not starve merging from threads is good. I'm curious how 
the nightly benchmarks will react to it, given the high number of CPU cores of 
beast3.
   
   Nightly benchy is still forcing its own thread pool into the HNSW indexing 
Codec component (`Lucene99Hnsw*VectorsFormat`) -- once this change is merged, 
I'll remove that so we switch to this awesome default approach.  But maybe we 
won't see too much change in the nightly indexing throughput since it's already 
doing intra-merge concurrency "itself".

   > One question on my mind is whether this change should make us update the 
default number of merging threads that `ConcurrentMergeScheduler` configures 
(in a follow-up issue/PR).
   
   +1 to explore this.  Nightly benchy hardwires the maxMergeCount=16, 
maxThreadCount=12.  Maybe they should be higher :)
   
   But also note that the nightly benchy does not wait for merges on 
`IndexWriter.close`, so it's not really a fair test of merge performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-21 Thread via GitHub



jpountz commented on code in PR #13190:
URL: https://github.com/apache/lucene/pull/13190#discussion_r1533418454


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -445,8 +456,15 @@ private static String rateToString(double mbPerSec) {
   }
 
   @Override
-  public void close() {
-sync();
+  public void close() throws IOException {
+super.close();

Review Comment:
   nit: should we try to close as best as possible in the event of an 
exception, e.g. doing something like below
   
   ```java
   IOUtils.close(
   this::sync,
   super::close,
   intraMergeExecutor == null ? null : intraMergeExecutor::shutdown);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-21 Thread via GitHub



jpountz commented on code in PR #13190:
URL: https://github.com/apache/lucene/pull/13190#discussion_r1533418454


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -445,8 +456,15 @@ private static String rateToString(double mbPerSec) {
   }
 
   @Override
-  public void close() {
-sync();
+  public void close() throws IOException {
+super.close();

Review Comment:
   nit: should we try to close as cleanly as possible in the event of an 
exception, e.g. doing something like below
   
   ```java
   IOUtils.close(
   this::sync,
   super::close,
   intraMergeExecutor == null ? null : intraMergeExecutor::shutdown);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



benwtrent commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2010582889

   @jpountz added comments to the creation of the ratelimiter in CMS and the 
limiter itself. Also updated the close interaction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



jpountz commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2010548779

   @benwtrent FYI I left a few minor comments: on 
`ConcurrentMergeScheduler#close`, and about adding a comment for the limitation 
you identified that a merge may not get throttled if the sum of bytes written 
exceeds the rate but none of the merge threads for the same merge exceed the 
rate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



benwtrent commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2010260555

   If y'all are good with it, I will merge this tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



mikemccand commented on code in PR #13190:
URL: https://github.com/apache/lucene/pull/13190#discussion_r1532364018


##
lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java:
##
@@ -118,24 +118,32 @@ private long maybePause(long bytes, long curNS) throws 
MergePolicy.MergeAbortedE
   throw new MergePolicy.MergeAbortedException("Merge aborted.");
 }
 
-double rate = mbPerSec; // read from volatile rate once.
-double secondsToPause = (bytes / 1024. / 1024.) / rate;
-
-// Time we should sleep until; this is purely instantaneous
-// rate (just adds seconds onto the last time we had paused to);
-// maybe we should also offer decayed recent history one?
-long targetNS = lastNS + (long) (10 * secondsToPause);
-
-long curPauseNS = targetNS - curNS;
-
-// We don't bother with thread pausing if the pause is smaller than 2 msec.
-if (curPauseNS <= MIN_PAUSE_NS) {
-  // Set to curNS, not targetNS, to enforce the instant rate, not
-  // the "averaged over all history" rate:
-  lastNS = curNS;
+final double rate = mbPerSec; // read from volatile rate once.
+final double secondsToPause = (bytes / 1024. / 1024.) / rate;
+
+AtomicLong curPauseNSSetter = new AtomicLong();
+lastNS.updateAndGet(
+last -> {
+  // Time we should sleep until; this is purely instantaneous
+  // rate (just adds seconds onto the last time we had paused to);
+  // maybe we should also offer decayed recent history one?
+  long targetNS = last + (long) (10 * secondsToPause);
+  long curPauseNS = targetNS - curNS;
+  // We don't bother with thread pausing if the pause is smaller than 
2 msec.
+  if (curPauseNS <= MIN_PAUSE_NS) {
+// Set to curNS, not targetNS, to enforce the instant rate, not
+// the "averaged over all history" rate:
+curPauseNSSetter.set(0);
+return curNS;
+  }

Review Comment:
   > > is giving us here, ensuring we accurately account for all bytes written 
by N threads within a single merge
   > 
   > That is not what it is doing. It is ensuring that the throttling 
timestamps stay in sync :/. Bytes being throttled are still per 
RateLimitedIndexOutput, which means they are per thread.
   
   Ahh gotchya.  Each `RateLimitedIndexOutput` tracks its own 
`bytesSinceLastPause` and then invokes `pause` with its own private byte count, 
OK.  So, yes, with the current approach we rate limit MB/sec per thread, not 
per merge.  I think that's fine.  Best effort!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



jpountz commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009561079

   > Maybe we don't need the IO write rate limiter anymore?
   
   Nevermind, I see that you added more details on #13193.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



jpountz commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009512368

   > Maybe we don't need the IO write rate limiter anymore?
   
   I'm curious what you have in mind. Are you considering throttling merges 
purely based on merge concurrency then? E.g. slow indexing -> single merge at a 
time, heavy indexing -> many parallel merges? This wouldn't sound crazy to me 
as I don't recally seeing indexing/merging saturate I/O.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



benwtrent commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009500202

   I re-ran `IndexGeoNames` with the latest commits on this PR. I didn't 
override CMS settings at all & I allowed merging to occur while indexing. This 
was done on my macbook, so the numbers might be a little fuzzy. 
   
   I indexed with 4 threads, and did the batching.
   
   RamBufferSize: 12MB
   Baseline:
   ```
   8221789: 28.378 sec
   forceMerge took 18856 ms
   ```
   
   Candidate:
   ```
   8221789: 27.428 sec
   forceMerge took 6581 ms
   ```
   
   RamBufferSize: 1MB
   
   Baseline:
   ```
   8221789: 49.209 sec
   forceMerge took 17235 ms
   ```
   
   Candidate:
   ```
   8221789: 46.078 sec
   forceMerge took 4638 ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



mikemccand commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009418841

   > Selfishly, I wish for the second option as its the simplest.
   
   +1 to keep with the second approach.  It is best effort, and, the "instant 
vs burst-bucket" bug above is yet more weirdness about it :)
   
   Another thought on the "instant vs burst-bucket" bug: I suppose in some 
case, e.g. bulk merging of stored fields (no deletions, so we are just copying 
byte blocks), there is very little CPU and tons of IO and in cases like that, 
the instant vs burst-bucket approaches would be essentially the same (both wind 
up throttling based on the instant rate since the burst bucket is quickly 
exhausted).  I'll open a spinoff ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



benwtrent commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009457766

   > How do we control the risk that a massive merge with KNN vectors soaks up 
all available concurrency from the shared Executor for intra-merge concurrency 
(all threads doing HNSW merging) and then starves smaller merges that would 
finish quickly?
   
   Intra-merge threads do not count against the `maxThreadCount` and are not 
tracked in `mergeThreads`. Additionally, intra-merge threads only allow up to 
`maxThreadCount - mergeThreads.size() - 1` threads to ever run on its pool.
   
   With this change, it is possible that a chunk of work delegated starves an 
intra-merge thread and counts against that parallelism, but it wouldn't block 
other `mergeThreads` from running. Those other `mergeThreads` just might have 
to run on themselves instead of off of the intra-merge thread-pool.
   
   I do think there is some future work here to unify thread tracking between 
all the inter merge threads, but all the custom logic in `mergeThreads` 
and how we create, track, etc. just seemed like too much to figure out in 
addition to adding intra-merge parallelism.
   
   > Maybe we don't need the IO write rate limiter anymore?
   
   I honestly don't know. There is still logic in the CMS that determines if we 
are on a spinning disk vs. SSD, which is sort of crazy to me :D.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



mikemccand commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009439976

   I opened https://github.com/apache/lucene/issues/13193 specifically about 
the "instant vs burst bucket" IO rate limiting approaches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



mikemccand commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009413290

   Phew, it took me two hours of strongly caffeinated time to catch up on this! 
 Thank you @benwtrent.
   
   What an awesome change, modernizing Lucene's merge concurrency so that 
intra-merge (just HNSW, but maybe other index parts soon) concurrency is 
enabled by default.
   
   (Separately, the nightly benchy had a silly off-by-one bug preventing it 
from resuming after you reverted the previous PR ... I think I've fixed that 
bug now, and kicked off a one-off nightly benchy run that will hopefully finish 
late today).
   
   Some high level questions (sorry if these were already asked/answered):
 * How do we control the risk that a massive merge with KNN vectors soaks 
up all available concurrency from the shared Executor for intra-merge 
concurrency (all threads doing HNSW merging) and then starves smaller merges 
that would finish quickly?
 * Maybe we don't need the IO write rate limiter anymore?  It's a tricky 
setting because the `mbPerSec` you set is then multiplied by the number of 
concurrent merges that are running.  It is a per-merge setting, not a global 
setting, so it's kinda trappy today.
   
   Finally, I worry that the rate-limiter's simplistic "instantaneous" measure 
is making it effectively super buggy (there is a TODO about this), because 
merging that does a lot of CPU work and then writes a lot of bytes will be 
effectively throttled to far below the target `mbPerSec` in aggregate.  A 
better solution might be something like the "burst IOPs bucket" that AWS (and 
likely other cloud providers) offer ("Burst IOPS is a feature of Amazon Web 
Services (AWS) EBS volume types that allows applications to store unused IOPS 
in a burst bucket and then drain them when needed" -- thank you Gemini for the 
summary).  This is clearly not a blocker for this issue, but we really should 
(separately) fix it.  I'll open a follow-on issue about this ... but to me it's 
another reason to maybe remove this feature entirely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



mikemccand commented on code in PR #13190:
URL: https://github.com/apache/lucene/pull/13190#discussion_r1531945845


##
lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java:
##
@@ -118,24 +118,32 @@ private long maybePause(long bytes, long curNS) throws 
MergePolicy.MergeAbortedE
   throw new MergePolicy.MergeAbortedException("Merge aborted.");
 }
 
-double rate = mbPerSec; // read from volatile rate once.
-double secondsToPause = (bytes / 1024. / 1024.) / rate;
-
-// Time we should sleep until; this is purely instantaneous
-// rate (just adds seconds onto the last time we had paused to);
-// maybe we should also offer decayed recent history one?
-long targetNS = lastNS + (long) (10 * secondsToPause);
-
-long curPauseNS = targetNS - curNS;
-
-// We don't bother with thread pausing if the pause is smaller than 2 msec.
-if (curPauseNS <= MIN_PAUSE_NS) {
-  // Set to curNS, not targetNS, to enforce the instant rate, not
-  // the "averaged over all history" rate:
-  lastNS = curNS;
+final double rate = mbPerSec; // read from volatile rate once.
+final double secondsToPause = (bytes / 1024. / 1024.) / rate;
+
+AtomicLong curPauseNSSetter = new AtomicLong();
+lastNS.updateAndGet(
+last -> {
+  // Time we should sleep until; this is purely instantaneous
+  // rate (just adds seconds onto the last time we had paused to);
+  // maybe we should also offer decayed recent history one?
+  long targetNS = last + (long) (10 * secondsToPause);
+  long curPauseNS = targetNS - curNS;
+  // We don't bother with thread pausing if the pause is smaller than 
2 msec.
+  if (curPauseNS <= MIN_PAUSE_NS) {
+// Set to curNS, not targetNS, to enforce the instant rate, not
+// the "averaged over all history" rate:
+curPauseNSSetter.set(0);
+return curNS;
+  }

Review Comment:
   Sorry I'm trying to catch up here and will probably ask a bunch of stupid 
questions :)  Thank you @benwtrent for persisting in this hairy logic!
   
   > (I'm considering making `pause` synchronized rather than `maybePause` so 
that `System.nanoTime()` is computed within the lock and the pausing logic 
accounts for the fact that some time may have been spent waiting on the lock 
already.)
   
   Couldn't we move the `curNS = System.nanoTime()` inside the 
locked/updateAndGet'd part of `maybePause`?  I like the thread safety that the 
`updateAndGet` is giving us here, ensuring we accurately account for all bytes 
written by N threads within a single merge.  Also, I don't expect the added 
sync required here will hurt performance much: Lucene is doing much work to 
produce these bytes being written already, so conflict should be rare-ish.
   
   Also, if we make `pause` sync'd, it will cause scary looking `jstack` thread 
dumps?  Making it look like one thread is sleeping while holding a lock and 
blocking other threads (which indeed is what it'd be doing).  Versus the thread 
stacks we'd see w/ the current approach that make it quite clear that all N 
threads are intentionally stalling in `mergeProgress.pauseNanos`?  It would 
reduce the horror reaction we'd see peeking at thread dumps maybe ...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



benwtrent commented on code in PR #13190:
URL: https://github.com/apache/lucene/pull/13190#discussion_r1531964164


##
lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java:
##
@@ -118,24 +118,32 @@ private long maybePause(long bytes, long curNS) throws 
MergePolicy.MergeAbortedE
   throw new MergePolicy.MergeAbortedException("Merge aborted.");
 }
 
-double rate = mbPerSec; // read from volatile rate once.
-double secondsToPause = (bytes / 1024. / 1024.) / rate;
-
-// Time we should sleep until; this is purely instantaneous
-// rate (just adds seconds onto the last time we had paused to);
-// maybe we should also offer decayed recent history one?
-long targetNS = lastNS + (long) (10 * secondsToPause);
-
-long curPauseNS = targetNS - curNS;
-
-// We don't bother with thread pausing if the pause is smaller than 2 msec.
-if (curPauseNS <= MIN_PAUSE_NS) {
-  // Set to curNS, not targetNS, to enforce the instant rate, not
-  // the "averaged over all history" rate:
-  lastNS = curNS;
+final double rate = mbPerSec; // read from volatile rate once.
+final double secondsToPause = (bytes / 1024. / 1024.) / rate;
+
+AtomicLong curPauseNSSetter = new AtomicLong();
+lastNS.updateAndGet(
+last -> {
+  // Time we should sleep until; this is purely instantaneous
+  // rate (just adds seconds onto the last time we had paused to);
+  // maybe we should also offer decayed recent history one?
+  long targetNS = last + (long) (10 * secondsToPause);
+  long curPauseNS = targetNS - curNS;
+  // We don't bother with thread pausing if the pause is smaller than 
2 msec.
+  if (curPauseNS <= MIN_PAUSE_NS) {
+// Set to curNS, not targetNS, to enforce the instant rate, not
+// the "averaged over all history" rate:
+curPauseNSSetter.set(0);
+return curNS;
+  }

Review Comment:
   > Couldn't we move the `curNS = System.nanoTime()` inside the 
locked/updateAndGet'd part of maybePause? 
   
   Yes, we could, otherwise if there are conflicts between threads (thus having 
to do the `updateAndGet` more than once), the `curNS` gets further in the past.
   
   > is giving us here, ensuring we accurately account for all bytes written by 
N threads within a single merge
   
   That is not what it is doing. It is ensuring that the throttling timestamps 
stay in sync :/. Bytes being throttled are still per RateLimitedIndexOutput, 
which means they are per thread.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



benwtrent commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009404137

   @mikemccand @jpountz 
   
   So, thinking about this more as I fell asleep.
   
   This is how throttling will work as it is in this PR:
   
- Throttling is per thread. Meaning, a intra-merge thread only gets 
throttled once the bytes it has written get to the rate limit set
- Consequently, we may get to `rateLimitBytes*numIntraMergeIO` bytes before 
a single thread gets throttled.
   
   This makes us throttle less often given how many bytes are actually written. 
Maybe this is OK as the throttling logic could "catch up" if things continue to 
get backed up? Throttling has always been a "best effort" thing anyways.
   
   Even if we somehow made the RateLimiter throttle every thread (via some 
global semaphore or something...), we would still only throttle once one of the 
multiple threads hit the byte throttling limit.
   
   IMO, either we: 
   
- account for bytes globally (per directory) & throttle globally (global 
lock that pauses all threads)
- accept that throttling is per thread and bytes used is measured in 
individual threads.
   
   Selfishly, I wish for the second option as its the simplest. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub



jpountz commented on code in PR #13190:
URL: https://github.com/apache/lucene/pull/13190#discussion_r1531735553


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -446,7 +454,13 @@ private static String rateToString(double mbPerSec) {
 
   @Override
   public void close() {
-sync();
+try {

Review Comment:
   should we call `super.close()` as well, to close `MergeScheduler`'s 
`SameThreadExecutorService` which `ConcurrentMergeScheduler` uses for small 
merges?
   
   It's not a big deal not to close it, but it would help catch if we ever send 
tasks to this executor after closing.



##
lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java:
##
@@ -118,24 +118,32 @@ private long maybePause(long bytes, long curNS) throws 
MergePolicy.MergeAbortedE
   throw new MergePolicy.MergeAbortedException("Merge aborted.");
 }
 
-double rate = mbPerSec; // read from volatile rate once.
-double secondsToPause = (bytes / 1024. / 1024.) / rate;
-
-// Time we should sleep until; this is purely instantaneous
-// rate (just adds seconds onto the last time we had paused to);
-// maybe we should also offer decayed recent history one?
-long targetNS = lastNS + (long) (10 * secondsToPause);
-
-long curPauseNS = targetNS - curNS;
-
-// We don't bother with thread pausing if the pause is smaller than 2 msec.
-if (curPauseNS <= MIN_PAUSE_NS) {
-  // Set to curNS, not targetNS, to enforce the instant rate, not
-  // the "averaged over all history" rate:
-  lastNS = curNS;
+final double rate = mbPerSec; // read from volatile rate once.
+final double secondsToPause = (bytes / 1024. / 1024.) / rate;
+
+AtomicLong curPauseNSSetter = new AtomicLong();
+lastNS.updateAndGet(

Review Comment:
   >  the output isn't really throttled until a single thread exceeds the limit.
   
   This limitation feels ok to me, let's just add a comment about it? 
Intuitively, the write rate at merge time is rather bursty, so if the sum of 
the bytes written by all threads running this merge exceeds the limit, then 
there would often be one thread that exceeds the limit on its own as well.



##
lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java:
##
@@ -118,24 +118,32 @@ private long maybePause(long bytes, long curNS) throws 
MergePolicy.MergeAbortedE
   throw new MergePolicy.MergeAbortedException("Merge aborted.");
 }
 
-double rate = mbPerSec; // read from volatile rate once.
-double secondsToPause = (bytes / 1024. / 1024.) / rate;
-
-// Time we should sleep until; this is purely instantaneous
-// rate (just adds seconds onto the last time we had paused to);
-// maybe we should also offer decayed recent history one?
-long targetNS = lastNS + (long) (10 * secondsToPause);
-
-long curPauseNS = targetNS - curNS;
-
-// We don't bother with thread pausing if the pause is smaller than 2 msec.
-if (curPauseNS <= MIN_PAUSE_NS) {
-  // Set to curNS, not targetNS, to enforce the instant rate, not
-  // the "averaged over all history" rate:
-  lastNS = curNS;
+final double rate = mbPerSec; // read from volatile rate once.
+final double secondsToPause = (bytes / 1024. / 1024.) / rate;
+
+AtomicLong curPauseNSSetter = new AtomicLong();
+lastNS.updateAndGet(
+last -> {
+  // Time we should sleep until; this is purely instantaneous
+  // rate (just adds seconds onto the last time we had paused to);
+  // maybe we should also offer decayed recent history one?
+  long targetNS = last + (long) (10 * secondsToPause);
+  long curPauseNS = targetNS - curNS;
+  // We don't bother with thread pausing if the pause is smaller than 
2 msec.
+  if (curPauseNS <= MIN_PAUSE_NS) {
+// Set to curNS, not targetNS, to enforce the instant rate, not
+// the "averaged over all history" rate:
+curPauseNSSetter.set(0);
+return curNS;
+  }

Review Comment:
   It's hard for me to reason about this as well.
   
   I'm wondering about keeping `maybePause` as-is and making 
`MergeRateLimiter#pause` synchronized, essentially trying to make the pausing 
logic behave as if threads were writing bytes sequentially rather than in 
parallel. (I'm considering making `pause` synchronized rather than `maybePause` 
so that `System.nanoTime()` is computed within the lock and the pausing logic 
accounts for the fact that some time may have been spent waiting on the lock 
already.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-19 Thread via GitHub



benwtrent commented on code in PR #13190:
URL: https://github.com/apache/lucene/pull/13190#discussion_r1531018293


##
lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java:
##
@@ -118,24 +118,32 @@ private long maybePause(long bytes, long curNS) throws 
MergePolicy.MergeAbortedE
   throw new MergePolicy.MergeAbortedException("Merge aborted.");
 }
 
-double rate = mbPerSec; // read from volatile rate once.
-double secondsToPause = (bytes / 1024. / 1024.) / rate;
-
-// Time we should sleep until; this is purely instantaneous
-// rate (just adds seconds onto the last time we had paused to);
-// maybe we should also offer decayed recent history one?
-long targetNS = lastNS + (long) (10 * secondsToPause);
-
-long curPauseNS = targetNS - curNS;
-
-// We don't bother with thread pausing if the pause is smaller than 2 msec.
-if (curPauseNS <= MIN_PAUSE_NS) {
-  // Set to curNS, not targetNS, to enforce the instant rate, not
-  // the "averaged over all history" rate:
-  lastNS = curNS;
+final double rate = mbPerSec; // read from volatile rate once.
+final double secondsToPause = (bytes / 1024. / 1024.) / rate;
+
+AtomicLong curPauseNSSetter = new AtomicLong();
+lastNS.updateAndGet(

Review Comment:
   So, the limits are on a per `RateLimitedIndexOutput` bases, not necessarily 
per thread. Since all `RateLimitedIndexOutput` constructed share the same 
rate-limiter, it might pause all threads?
   
   But I don't think so, as each process could have its own 
`RateLimitedIndexOutput` which is tracking its own bytes locally and while the 
combined total of all threads might exceed the throttling limit, the output 
isn't really throttled until a single thread exceeds the limit. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-19 Thread via GitHub



benwtrent commented on code in PR #13190:
URL: https://github.com/apache/lucene/pull/13190#discussion_r1531011510


##
lucene/core/src/java/org/apache/lucene/index/MergePolicy.java:
##
@@ -136,13 +136,13 @@ public boolean isAborted() {
  */
 public void pauseNanos(long pauseNanos, PauseReason reason, 
BooleanSupplier condition)
 throws InterruptedException {
-  if (Thread.currentThread() != owner) {
+  /*  if (Thread.currentThread() != owner) {
 throw new RuntimeException(
 "Only the merge owner thread can call pauseNanos(). This thread: "
 + Thread.currentThread().getName()
 + ", owner thread: "
 + owner);
-  }
+  }*/

Review Comment:
   > I guess not, but now we need to make sure that the rate is applied across 
all threads rather than per-thread?
   
   Looking at the code, it seems like it was previously assumed that only one 
`createOutput` was called at a time for a merge. Now, it could be that more 
than one is called. 
   
   I guess this means we need a `RateLimitingDirectory` that passes things in 
to the `RateLimitedIndexOutput` ensure global rate limiting is controlled 
across all outputs.  Is this what you are talking about?
   
   I am not sure that `pauseNanos` should directly know about other threads and 
pause them. It seems better to put this up on the directory level.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-19 Thread via GitHub



benwtrent commented on code in PR #13190:
URL: https://github.com/apache/lucene/pull/13190#discussion_r1530907881


##
lucene/core/src/java/org/apache/lucene/store/RateLimitedIndexOutput.java:
##
@@ -28,30 +30,30 @@ public final class RateLimitedIndexOutput extends 
FilterIndexOutput {
   private final RateLimiter rateLimiter;
 
   /** How many bytes we've written since we last called rateLimiter.pause. */
-  private long bytesSinceLastPause;
+  private final AtomicLong bytesSinceLastPause = new AtomicLong(0);

Review Comment:
   Ah, yeah, I could remove all the concurrency checks here. I was just 
concerned as the above assertions tripped because of the threads. but you are 
correct, they are still ultimately only written to via one thread.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-19 Thread via GitHub



jpountz commented on code in PR #13190:
URL: https://github.com/apache/lucene/pull/13190#discussion_r1530384809


##
lucene/core/src/java/org/apache/lucene/index/MergePolicy.java:
##
@@ -136,13 +136,13 @@ public boolean isAborted() {
  */
 public void pauseNanos(long pauseNanos, PauseReason reason, 
BooleanSupplier condition)
 throws InterruptedException {
-  if (Thread.currentThread() != owner) {
+  /*  if (Thread.currentThread() != owner) {
 throw new RuntimeException(
 "Only the merge owner thread can call pauseNanos(). This thread: "
 + Thread.currentThread().getName()
 + ", owner thread: "
 + owner);
-  }
+  }*/

Review Comment:
   I guess not, but now we need to make sure that the rate is applied across 
all threads rather than per-thread?



##
lucene/core/src/java/org/apache/lucene/store/RateLimitedIndexOutput.java:
##
@@ -28,30 +30,30 @@ public final class RateLimitedIndexOutput extends 
FilterIndexOutput {
   private final RateLimiter rateLimiter;
 
   /** How many bytes we've written since we last called rateLimiter.pause. */
-  private long bytesSinceLastPause;
+  private final AtomicLong bytesSinceLastPause = new AtomicLong(0);

Review Comment:
   I'm still unclear why this needs to become an `AtomicLong`, if 
`bytesSinceLastPause` could be updated/read concurrently, then this would imply 
that multiple `writeXXX` methods would also be called concurrently, which 
doesn't make much sense as bytes would be written to disk in a 
non-deterministic order?



##
lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java:
##
@@ -118,24 +118,32 @@ private long maybePause(long bytes, long curNS) throws 
MergePolicy.MergeAbortedE
   throw new MergePolicy.MergeAbortedException("Merge aborted.");
 }
 
-double rate = mbPerSec; // read from volatile rate once.
-double secondsToPause = (bytes / 1024. / 1024.) / rate;
-
-// Time we should sleep until; this is purely instantaneous
-// rate (just adds seconds onto the last time we had paused to);
-// maybe we should also offer decayed recent history one?
-long targetNS = lastNS + (long) (10 * secondsToPause);
-
-long curPauseNS = targetNS - curNS;
-
-// We don't bother with thread pausing if the pause is smaller than 2 msec.
-if (curPauseNS <= MIN_PAUSE_NS) {
-  // Set to curNS, not targetNS, to enforce the instant rate, not
-  // the "averaged over all history" rate:
-  lastNS = curNS;
+final double rate = mbPerSec; // read from volatile rate once.
+final double secondsToPause = (bytes / 1024. / 1024.) / rate;
+
+AtomicLong curPauseNSSetter = new AtomicLong();
+lastNS.updateAndGet(

Review Comment:
   If we used a lock for this whole method, then this may help pause all 
threads that share the same rate limiter, and effectively apply the rate limit 
across all threads that run the same merge, rather than applying the rate limit 
on a per-thread basis?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-19 Thread via GitHub



benwtrent commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2007100589

   @jpountz 
   >  the caller needs to manage synchronization themselves anyway if they want 
bytes to be written in the correct order?
   
   The MT safety isn't really around the bytes. I agree with your assessment 
there. But its more around the throttling. The way I read the class is that 
different threads could be writing to different files but both could experience 
throttling and we should account for that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-19 Thread via GitHub



jpountz commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2006570648

   I understand that we need to make changes to account for the fact that 
multiple threads may be contributing to the same merge concurrently, but I 
would not expect `RateLimitedIndexOutput` to need to be thread-safe: the caller 
needs to manage synchronization themselves anyway if they want bytes to be 
written in the correct order?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-18 Thread via GitHub



benwtrent commented on code in PR #13190:
URL: https://github.com/apache/lucene/pull/13190#discussion_r1528642636


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -281,11 +297,11 @@ public IndexOutput createOutput(String name, IOContext 
context) throws IOExcepti
 
 // Because rateLimiter is bound to a particular merge thread, this 
method should
 // always be called from that context. Verify this.
-assert mergeThread == Thread.currentThread()
-: "Not the same merge thread, current="
-+ Thread.currentThread()
-+ ", expected="
-+ mergeThread;
+/*assert mergeThread == Thread.currentThread()
+: "Not the same merge thread, current="
++ Thread.currentThread()
++ ", expected="
++ mergeThread;*/

Review Comment:
   I don't know about this, it seems like paranoid check that may not be 
necessary if RLIO was threadsafe?



##
lucene/core/src/java/org/apache/lucene/store/RateLimitedIndexOutput.java:
##
@@ -28,30 +30,30 @@ public final class RateLimitedIndexOutput extends 
FilterIndexOutput {
   private final RateLimiter rateLimiter;
 
   /** How many bytes we've written since we last called rateLimiter.pause. */
-  private long bytesSinceLastPause;
+  private final AtomicLong bytesSinceLastPause = new AtomicLong(0);

Review Comment:
   This obviously creates some overhead now as RLIO could be accessed via 
multiple threads.  I will likely have to benchmark to see the impact.



##
lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java:
##
@@ -118,24 +118,32 @@ private long maybePause(long bytes, long curNS) throws 
MergePolicy.MergeAbortedE
   throw new MergePolicy.MergeAbortedException("Merge aborted.");
 }
 
-double rate = mbPerSec; // read from volatile rate once.
-double secondsToPause = (bytes / 1024. / 1024.) / rate;
-
-// Time we should sleep until; this is purely instantaneous
-// rate (just adds seconds onto the last time we had paused to);
-// maybe we should also offer decayed recent history one?
-long targetNS = lastNS + (long) (10 * secondsToPause);
-
-long curPauseNS = targetNS - curNS;
-
-// We don't bother with thread pausing if the pause is smaller than 2 msec.
-if (curPauseNS <= MIN_PAUSE_NS) {
-  // Set to curNS, not targetNS, to enforce the instant rate, not
-  // the "averaged over all history" rate:
-  lastNS = curNS;
+final double rate = mbPerSec; // read from volatile rate once.
+final double secondsToPause = (bytes / 1024. / 1024.) / rate;
+
+AtomicLong curPauseNSSetter = new AtomicLong();
+lastNS.updateAndGet(

Review Comment:
   We would need to lock here or do an `updateAndGet`. `lastNS` is only updated 
in this particular method, so using atomics seemed better to me.



##
lucene/core/src/java/org/apache/lucene/store/RateLimitedIndexOutput.java:
##
@@ -62,30 +64,42 @@ public void writeBytes(byte[] b, int offset, int length) 
throws IOException {
 
   @Override
   public void writeInt(int i) throws IOException {
-bytesSinceLastPause += Integer.BYTES;
+bytesSinceLastPause.addAndGet(Integer.BYTES);
 checkRate();
 out.writeInt(i);
   }
 
   @Override
   public void writeShort(short i) throws IOException {
-bytesSinceLastPause += Short.BYTES;
+bytesSinceLastPause.addAndGet(Short.BYTES);
 checkRate();
 out.writeShort(i);
   }
 
   @Override
   public void writeLong(long i) throws IOException {
-bytesSinceLastPause += Long.BYTES;
+bytesSinceLastPause.addAndGet(Long.BYTES);
 checkRate();
 out.writeLong(i);
   }
 
   private void checkRate() throws IOException {
-if (bytesSinceLastPause > currentMinPauseCheckBytes) {
-  rateLimiter.pause(bytesSinceLastPause);
-  bytesSinceLastPause = 0;
-  currentMinPauseCheckBytes = rateLimiter.getMinPauseCheckBytes();
+AtomicLong localBytesSinceLastPause = new AtomicLong(0);
+AtomicBoolean shouldPause = new AtomicBoolean(false);
+bytesSinceLastPause.updateAndGet(
+bytes -> {
+  if (bytes > currentMinPauseCheckBytes.get()) {
+shouldPause.set(true);
+currentMinPauseCheckBytes.set(rateLimiter.getMinPauseCheckBytes());
+localBytesSinceLastPause.set(bytes);
+return 0;
+  } else {
+shouldPause.set(false);
+  }

Review Comment:
   I figured if another thread reached this throttling and is paused & reset 
the counter, than this current thread shouldn't pause. Throttle tracking is 
effectively global and each thread uses the same localBytesSinceLastPause, so 
the work done by another thread could cause a current thread to pause. Or 
another thread pausing could allow this thread to continue work.



##

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-18 Thread via GitHub



benwtrent commented on PR #13190:
URL: https://github.com/apache/lucene/pull/13190#issuecomment-2004024374

   @dweiss @mikemccand I am currently iterating on how to best make 
`RateLimitedIndexOutput` `MergePolicy` and `MergeRateLimiter` thread safe. 
   
   Right now, it is all assumed that the interactions with all these are from 
the same thread, this obviously breaks when we add intra-merge parallelism. 
   
   I was hoping y'all had some thoughts on how these should all work together 
with intra-merge parallelism? 
   
   Is it enough to make these classes threadsafe and remove assertions? 
   
   Do we want to somehow figure out if the ultimate calling thread was a 
MergeThread? (This is possible, but will require some wrangling on the 
tp-executor to keep track of which thread belongs where...)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-18 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-2003789175

   I am going to revert the change and open a new PR for iterating a fix. 
`RateLimitedIndexOutput` isn't threadsafe and our rate limiting assumes a 
single thread. 
   
   With this commit, more than one thread could be writing and that needs to be 
accounted for.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-16 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-2002000206

   Ah, that is likely due to this change. I bet throttling is occurring but in 
one of the other threadpool threads. 
   
   I didn't override the thread construction, so maybe that is enough? 
   
   I can look more closely next week


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-16 Thread via GitHub



mikemccand commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-2001975632

   This looks like a great change! ... Lucene finally catching up to modern 
concurrency :)  I have not looked closely yet.
   
   But nightly benchy is upset, prolly related to this, when running 
`runFacetsBenchmarks.py`.  I haven't looked closely yet, and will be offline 
mostly next couple days (sorry for the hit!), but here's the exception is 
popped out:
   
   ```
   Exception in thread "main" org.apache.lucene.store.AlreadyClosedException: 
this IndexWriter is closed
   at 
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:913)
   at 
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:926)
   at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:4070)
   at perf.facets.IndexFacets.main(IndexFacets.java:127)
   Caused by: java.lang.RuntimeException: Only the merge owner thread can call 
pauseNanos(). This thread: pool-1-thread-1, owner thread: Thread[#290,Lucene 
Merge Thread #12,5,main]
   at 
org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:142)
   at 
org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:147)
   at 
org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:92)
   at 
org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:86)
   at 
org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:55)
   at 
org.apache.lucene.store.ByteBuffersDataOutput.copyTo(ByteBuffersDataOutput.java:339)
   at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlock(Lucene90BlockTreeTermsWriter.java:1022)
   at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlocks(Lucene90BlockTreeTermsWriter.java:760)
   at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.pushTerm(Lucene90BlockTreeTermsWriter.java:1133)
   at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1089)
   at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:399)
   at 
org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95)
   at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205)
   at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:236)
   at 
org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:325)
   at 
org.apache.lucene.index.SegmentMerger.lambda$merge$0(SegmentMerger.java:147)
   at 
org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$createTask$0(TaskExecutor.java:117)
   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
   at 
org.apache.lucene.index.ConcurrentMergeScheduler$CachedExecutor.lambda$execute$0(ConcurrentMergeScheduler.java:978)
   at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
   at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
   at java.base/java.lang.Thread.run(Thread.java:1583)
   Suppressed: java.lang.RuntimeException: Only the merge owner thread 
can call pauseNanos(). This thread: pool-1-thread-2, owner thread: 
Thread[#290,Lucene Merge Thread #12,5,main]
   at 
org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:142)
   at 
org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:147)
   at 
org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:92)
   at 
org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:86)
   at 
org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:55)
   at 
org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:54)
   at 
org.apache.lucene.util.packed.DirectWriter.flush(DirectWriter.java:97)
   at 
org.apache.lucene.util.packed.DirectWriter.add(DirectWriter.java:83)
   at 
org.apache.lucene.codecs.lucene90.Lucene90DocValuesConsumer.writeValuesSingleBlock(Lucene90DocValuesConsumer.java:349)
   at 
org.apache.lucene.codecs.lucene90.Lucene90DocValuesConsumer.writeValues(Lucene90DocValuesConsumer.java:328)
   at 
org.apache.lucene.codecs.lucene90.Lucene90DocValuesConsumer.doAddSortedNumericField(Lucene90DocValuesConsumer.java:707)
   at

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-14 Thread via GitHub



benwtrent merged PR #13124:
URL: https://github.com/apache/lucene/pull/13124


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-14 Thread via GitHub



zhaih commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1525307991


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -902,12 +932,52 @@ private static String getSegmentName(MergePolicy.OneMerge 
merge) {
   }
 
   static {
-TestSecrets.setConcurrentMergeSchedulerAccess(
-new ConcurrentMergeSchedulerAccess() {
-  @Override
-  public void setSuppressExceptions(ConcurrentMergeScheduler cms) {
-cms.setSuppressExceptions();
-  }
-});
+
TestSecrets.setConcurrentMergeSchedulerAccess(ConcurrentMergeScheduler::setSuppressExceptions);
+  }
+
+  private class CachedExecutor implements Executor {

Review Comment:
   Can we have some simple javadoc explaining what this executor is doing? E.g. 
will use a spare thread if it is within configured max thread and will run 
directly on caller thread if we have no extra thread available?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-14 Thread via GitHub



jpountz commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1525270702


##
lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java:
##
@@ -130,19 +135,31 @@ MergeState merge() throws IOException {
 IOContext.READ,
 segmentWriteState.segmentSuffix);
 
+TaskExecutor taskExecutor = new 
TaskExecutor(mergeState.intraMergeTaskExecutor);
+List> mergingTasks = new ArrayList<>();
+
 if (mergeState.mergeFieldInfos.hasNorms()) {
   mergeWithLogging(this::mergeNorms, segmentWriteState, segmentReadState, 
"norms", numMerged);
 }
 
 mergeWithLogging(this::mergeTerms, segmentWriteState, segmentReadState, 
"postings", numMerged);

Review Comment:
   Presumably, the worst-case scenario is not much worse than if you have many 
concurrent merges, so I think it's fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-14 Thread via GitHub



benwtrent commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1525266457


##
lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java:
##
@@ -130,19 +135,31 @@ MergeState merge() throws IOException {
 IOContext.READ,
 segmentWriteState.segmentSuffix);
 
+TaskExecutor taskExecutor = new 
TaskExecutor(mergeState.intraMergeTaskExecutor);
+List> mergingTasks = new ArrayList<>();
+
 if (mergeState.mergeFieldInfos.hasNorms()) {
   mergeWithLogging(this::mergeNorms, segmentWriteState, segmentReadState, 
"norms", numMerged);
 }
 
 mergeWithLogging(this::mergeTerms, segmentWriteState, segmentReadState, 
"postings", numMerged);

Review Comment:
   OK, I added a commit that does this. The candidate merge time is now:
   
   `forceMerge took 4965 ms`
   vs. baseline (i re-ran to make sure)
   `forceMerge took 15783 ms`
   
   So, 3x faster? Seems pretty good. The major downside is that now memory 
usage on merge might increase as we are potentially doing all this merge 
activity at the same time. If we are cool with that, this seems like a nice 
speed improvement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-14 Thread via GitHub



jpountz commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1525217090


##
lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java:
##
@@ -130,19 +135,31 @@ MergeState merge() throws IOException {
 IOContext.READ,
 segmentWriteState.segmentSuffix);
 
+TaskExecutor taskExecutor = new 
TaskExecutor(mergeState.intraMergeTaskExecutor);
+List> mergingTasks = new ArrayList<>();
+
 if (mergeState.mergeFieldInfos.hasNorms()) {
   mergeWithLogging(this::mergeNorms, segmentWriteState, segmentReadState, 
"norms", numMerged);
 }
 
 mergeWithLogging(this::mergeTerms, segmentWriteState, segmentReadState, 
"postings", numMerged);

Review Comment:
   Can we have a parallel task that handles norms + terms so that the order is 
respected?



##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -902,12 +932,57 @@ private static String getSegmentName(MergePolicy.OneMerge 
merge) {
   }
 
   static {
-TestSecrets.setConcurrentMergeSchedulerAccess(
-new ConcurrentMergeSchedulerAccess() {
-  @Override
-  public void setSuppressExceptions(ConcurrentMergeScheduler cms) {
-cms.setSuppressExceptions();
+
TestSecrets.setConcurrentMergeSchedulerAccess(ConcurrentMergeScheduler::setSuppressExceptions);
+  }
+
+  private class ScaledExecutor implements Executor {
+
+private final AtomicInteger activeCount = new AtomicInteger(0);
+private final ThreadPoolExecutor executor;
+
+public ScaledExecutor() {
+  this.executor =
+  new ThreadPoolExecutor(0, 1024, 1L, TimeUnit.MINUTES, new 
SynchronousQueue<>());
+}
+
+void shutdown() {
+  executor.shutdown();
+}
+
+@Override
+public void execute(Runnable command) {
+  assert mergeThreads.contains(Thread.currentThread()) : "caller is not a 
merge thread";

Review Comment:
   I'd expect this assertion to no longer be valid, since SegmentMerger may 
fork into this executor for vectors, and then the task for vectors may want to 
further fork into a separate thread? So the caller may be either a MergeThread, 
or a thread from the wrapper thread pool?



##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +936,55 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor implements Executor {

Review Comment:
   nit: should it be called `CachedExecutor` now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1995183835

   @jpountz ok, that naive attempt failed as norms & terms apparently need to 
be merged in order (one or the other would fail due to missing files...). 
   
   I am not sure if this is true with other things. But when I adjusted for 
this (latest commit) I did see a speed improvement.
   
   My test was over IndexGeoNames, I did no segment merging and flushed every 
12MB to ensure I got many segments. I also set CMS merge threads to `8` 
threads. 
   
   Baseline forcemerge(1): 17793 ms
   Candidates forcemerge(1): 13739 ms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub



zhaih commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1523620861


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##
@@ -152,7 +154,25 @@ public Lucene99HnswVectorsFormat() {
* @param beamWidth the size of the queue maintained during graph 
construction.
*/
   public Lucene99HnswVectorsFormat(int maxConn, int beamWidth) {
-this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null);
+super("Lucene99HnswVectorsFormat");
+if (maxConn <= 0 || maxConn > MAXIMUM_MAX_CONN) {
+  throw new IllegalArgumentException(
+  "maxConn must be positive and less than or equal to "
+  + MAXIMUM_MAX_CONN
+  + "; maxConn="
+  + maxConn);
+}
+if (beamWidth <= 0 || beamWidth > MAXIMUM_BEAM_WIDTH) {
+  throw new IllegalArgumentException(
+  "beamWidth must be positive and less than or equal to "
+  + MAXIMUM_BEAM_WIDTH
+  + "; beamWidth="
+  + beamWidth);
+}
+this.maxConn = maxConn;
+this.beamWidth = beamWidth;
+this.mergeExec = null;
+this.numMergeWorkers = 1;

Review Comment:
   Yeah I agree, so for now we'll just fix it at 8?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994921965

   > Yes exactly, something very simple, mostly to exercise intra-merge 
concurrency with more than just vectors.
   
   Latest commit adds `TaskExecutor` actions to merge to allow different 
merging field types to be merged in parallel. 
   
   I needed to use `TaskExecutor` as it abstracts out the exception handling 
(all these methods throw `IOException`).
   
   I still need to benchmark it with Lucene-util


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub



jpountz commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994458100

   Yes exactly, something very simple, mostly to exercise intra-merge 
concurrency with more than just vectors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994399018

   @jpountz would you prefer something like the original patch from @dweiss ? I 
can submit the merging actions independently to the intra-merge executor.
   
   Anything more (like figuring out how to make postings merge truly parallel), 
will be quite a lift for me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub



jpountz commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994373820

   Thanks for sharing performance numbers @benwtrent, very interesting. Also 
double checking if you saw my above comment: 
https://github.com/apache/lucene/pull/13124#pullrequestreview-1930418114, I'd 
have more confidence about the way merge concurrency is exposed if we took 
advantage of it in more than one place. If there's some difficulty with it, I'm 
happy with looking into it in a follow-up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994372130

   @zhaih @jpountz I am going to create a separate issue around making HNSW 
worker slicing automatic. It will require a bunch of its own benchmarking and 
work and honestly seems orthogonal to this particular change as HNSW merging 
still works fine when statically setting the number of workers & using the CMS 
thread pool.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub



benwtrent commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1523142360


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##
@@ -152,7 +154,25 @@ public Lucene99HnswVectorsFormat() {
* @param beamWidth the size of the queue maintained during graph 
construction.
*/
   public Lucene99HnswVectorsFormat(int maxConn, int beamWidth) {
-this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null);
+super("Lucene99HnswVectorsFormat");
+if (maxConn <= 0 || maxConn > MAXIMUM_MAX_CONN) {
+  throw new IllegalArgumentException(
+  "maxConn must be positive and less than or equal to "
+  + MAXIMUM_MAX_CONN
+  + "; maxConn="
+  + maxConn);
+}
+if (beamWidth <= 0 || beamWidth > MAXIMUM_BEAM_WIDTH) {
+  throw new IllegalArgumentException(
+  "beamWidth must be positive and less than or equal to "
+  + MAXIMUM_BEAM_WIDTH
+  + "; beamWidth="
+  + beamWidth);
+}
+this.maxConn = maxConn;
+this.beamWidth = beamWidth;
+this.mergeExec = null;
+this.numMergeWorkers = 1;

Review Comment:
   > Oh I see, you haven't yet put in the change for auto figure out number of 
workers
   
   I am wanting to do that separately. It will require its own benchmarking, 
etc. and this change is good even without choosing an automatic split for the 
hnsw graph merge.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-13 Thread via GitHub



zhaih commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1522601711


##
lucene/core/src/java/org/apache/lucene/index/MergeScheduler.java:
##
@@ -52,6 +56,14 @@ public Directory wrapForMerge(OneMerge merge, Directory in) {
 return in;
   }
 
+  /**
+   * Provides an executor for parallelism during a single merge operation. By 
default, the method
+   * returns `null` indicating that there is no parallelism during a single 
merge operation.

Review Comment:
   I think it's ok to force this method always return non-null? Since we 
already have provided a good default?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##
@@ -152,7 +154,25 @@ public Lucene99HnswVectorsFormat() {
* @param beamWidth the size of the queue maintained during graph 
construction.
*/
   public Lucene99HnswVectorsFormat(int maxConn, int beamWidth) {
-this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null);
+super("Lucene99HnswVectorsFormat");
+if (maxConn <= 0 || maxConn > MAXIMUM_MAX_CONN) {
+  throw new IllegalArgumentException(
+  "maxConn must be positive and less than or equal to "
+  + MAXIMUM_MAX_CONN
+  + "; maxConn="
+  + maxConn);
+}
+if (beamWidth <= 0 || beamWidth > MAXIMUM_BEAM_WIDTH) {
+  throw new IllegalArgumentException(
+  "beamWidth must be positive and less than or equal to "
+  + MAXIMUM_BEAM_WIDTH
+  + "; beamWidth="
+  + beamWidth);
+}
+this.maxConn = maxConn;
+this.beamWidth = beamWidth;
+this.mergeExec = null;
+this.numMergeWorkers = 1;

Review Comment:
   Can remove those 2?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##
@@ -152,7 +154,25 @@ public Lucene99HnswVectorsFormat() {
* @param beamWidth the size of the queue maintained during graph 
construction.
*/
   public Lucene99HnswVectorsFormat(int maxConn, int beamWidth) {
-this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null);
+super("Lucene99HnswVectorsFormat");
+if (maxConn <= 0 || maxConn > MAXIMUM_MAX_CONN) {
+  throw new IllegalArgumentException(
+  "maxConn must be positive and less than or equal to "
+  + MAXIMUM_MAX_CONN
+  + "; maxConn="
+  + maxConn);
+}
+if (beamWidth <= 0 || beamWidth > MAXIMUM_BEAM_WIDTH) {
+  throw new IllegalArgumentException(
+  "beamWidth must be positive and less than or equal to "
+  + MAXIMUM_BEAM_WIDTH
+  + "; beamWidth="
+  + beamWidth);
+}
+this.maxConn = maxConn;
+this.beamWidth = beamWidth;
+this.mergeExec = null;
+this.numMergeWorkers = 1;

Review Comment:
   Oh I see, you haven't yet put in the change for auto figure out number of 
workers



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-12 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1992172143

   @jpountz 
   
   OK, so I did some benchmarking to look at the impact of this change. 500k 
docs, flushing segments every 1MB and then force merging. 
   
   For all of these, the number of workers is `8` and the threads available are 
`8`.
   
   Baseline single threaded (no concurrency in HNSW)
   ```
   Indexed: 467617ms
   Force merge: 593037 ms
   ```
   
   Baseline multi-threaded (executor separate from CMS)
   ```
   Indexed: 173143ms
   Force merge: 120341 ms
   ```
   
   Candidate, CMS with 8 merge threads, sharing with intra-merge (what this PR 
is doing). Shows that we share the threads with the CMS background threads. 
   
   There is a ton of merging work being done, so usually just 1 thread is being 
passed to the intra-merge executor. This is an expected result and working as 
intended.
   However, we see a speed up in force-merge as no other merge activity is 
occurring and it can use all the provided CMS threads.
   ```
   Indexed: 424924ms
   Force merge: 121705 ms
   ```
   
   To confirm that this is indeed the case and that indexing wasn’t slowed down 
due to some other weird overhead, I gave 2x as many threads to intra-merging. 
This effectively removes the limit that CMS is providing to intra-merges. 
   Shows that that the indexing is now inline with how it is now, basically 
using 2x as many threads as configured for CMS (8 merge threads and 8 
intra-merge threads).
   ```
   Indexed: 171825ms
   Force merge: 121886 ms
   ```
   
   To make sure we don’t slow down too much compared to baseline single 
threaded, I gave 8 workers to HNSW merge, but only provided a 
SameThreadExecutorService.
   
   I did this to justify the default to a SameThreadExecutorService, even when 
HNSW has > 1 worker.
   
   Well within the runtime of baseline.
   ```
   Indexed: 418529ms
   Force merge: 524468 ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-12 Thread via GitHub



jpountz commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1521095208


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +941,58 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor implements Executor {
+
+private final AtomicInteger activeCount = new AtomicInteger(0);
+private final ThreadPoolExecutor executor;
+
+public ScaledExecutor() {
+  this.executor =
+  new ThreadPoolExecutor(
+  0, Math.max(1, maxThreadCount), 1L, TimeUnit.MINUTES, new 
SynchronousQueue<>());
+}
+
+void shutdown() {
+  executor.shutdown();
+}
+
+private void updatePoolSize() {
+  executor.setMaximumPoolSize(Math.max(1, maxThreadCount));
+}

Review Comment:
   It feels like we don't actually need this since we already have control on 
the thread pool size via `maxThreadCount`. What about setting a high max thread 
count in the ctor and never updating it (ie. making it a "cached" executor 
rather than a "scaled" executor).



##
lucene/core/src/java/org/apache/lucene/index/MergeScheduler.java:
##
@@ -52,6 +53,14 @@ public Directory wrapForMerge(OneMerge merge, Directory in) {
 return in;
   }
 
+  /**
+   * Provides an executor for parallelism during a single merge operation. By 
default, the method
+   * returns `null` indicating that there is no parallelism during a single 
merge operation.
+   */
+  public Executor getIntraMergeExecutor(OneMerge merge) {
+return null;

Review Comment:
   What about returning `SameThreadExecutor` by default instead?



##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +936,68 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor implements Executor {
+
+private final AtomicInteger activeCount = new AtomicInteger(0);
+private final ThreadPoolExecutor executor;
+
+public ScaledExecutor() {
+  this.executor =
+  new ThreadPoolExecutor(
+  0, Math.max(1, maxThreadCount), 1L, TimeUnit.MINUTES, new 
SynchronousQueue<>());
+}
+
+void shutdown() {
+  executor.shutdown();
+}
+
+private void updatePoolSize() {
+  executor.setMaximumPoolSize(Math.max(1, maxThreadCount));
+}
+
+@Override
+public void execute(Runnable command) {
+  assert mergeThreads.contains(Thread.currentThread()) : "caller is not a 
merge thread";
+  Thread currentThread = Thread.currentThread();
+  if (currentThread instanceof MergeThread mergeThread) {
+execute(mergeThread, command);
+  } else {
+command.run();
+  }
+}
+
+private void execute(MergeThread mergeThread, Runnable command) {
+  // don't do multithreaded merges for small merges
+  if (mergeThread.merge.estimatedMergeBytes < MIN_BIG_MERGE_MB * 1024 * 
1024) {
+command.run();
+  } else {
+final boolean isThreadAvailable;
+// we need to check if a thread is available before submitting the 
task to the executor
+// synchronize on CMS to get an accurate count of current threads
+synchronized (ConcurrentMergeScheduler.this) {
+  int max = maxThreadCount - mergeThreads.size();
+  int value = activeCount.get();
+  if (value < max) {
+activeCount.incrementAndGet();
+isThreadAvailable = true;
+  } else {
+isThreadAvailable = false;
+  }
+}

Review Comment:
   @benwtrent ++ There is a risk of letting bigger merges starve smaller merges 
from threads otherwise, and I'm not sure how we could fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-05 Thread via GitHub



msokolov commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1513268905


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +936,68 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor implements Executor {
+
+private final AtomicInteger activeCount = new AtomicInteger(0);
+private final ThreadPoolExecutor executor;
+
+public ScaledExecutor() {
+  this.executor =
+  new ThreadPoolExecutor(
+  0, Math.max(1, maxThreadCount), 1L, TimeUnit.MINUTES, new 
SynchronousQueue<>());
+}
+
+void shutdown() {
+  executor.shutdown();
+}
+
+private void updatePoolSize() {
+  executor.setMaximumPoolSize(Math.max(1, maxThreadCount));
+}
+
+@Override
+public void execute(Runnable command) {
+  assert mergeThreads.contains(Thread.currentThread()) : "caller is not a 
merge thread";
+  Thread currentThread = Thread.currentThread();
+  if (currentThread instanceof MergeThread mergeThread) {
+execute(mergeThread, command);
+  } else {
+command.run();
+  }
+}
+
+private void execute(MergeThread mergeThread, Runnable command) {
+  // don't do multithreaded merges for small merges
+  if (mergeThread.merge.estimatedMergeBytes < MIN_BIG_MERGE_MB * 1024 * 
1024) {
+command.run();
+  } else {
+final boolean isThreadAvailable;
+// we need to check if a thread is available before submitting the 
task to the executor
+// synchronize on CMS to get an accurate count of current threads
+synchronized (ConcurrentMergeScheduler.this) {
+  int max = maxThreadCount - mergeThreads.size();
+  int value = activeCount.get();
+  if (value < max) {
+activeCount.incrementAndGet();
+isThreadAvailable = true;
+  } else {
+isThreadAvailable = false;
+  }
+}

Review Comment:
   I am going to just make stuff up here since I don't know this code well at 
all -- but I think I was assuming there was a ThreadPoolExecutor somewhere that 
was doling out threads for both of these operations. Now I'm realizing perhaps 
that's not the case and CMS is doing all that tracking on its own. I wonder if 
we should open a separate issue to migrate CMS to use JDK's thread pool 
abstractions? It could make some of this easier to handle.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-05 Thread via GitHub



benwtrent commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1513208670


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +936,68 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor implements Executor {
+
+private final AtomicInteger activeCount = new AtomicInteger(0);
+private final ThreadPoolExecutor executor;
+
+public ScaledExecutor() {
+  this.executor =
+  new ThreadPoolExecutor(
+  0, Math.max(1, maxThreadCount), 1L, TimeUnit.MINUTES, new 
SynchronousQueue<>());
+}
+
+void shutdown() {
+  executor.shutdown();
+}
+
+private void updatePoolSize() {
+  executor.setMaximumPoolSize(Math.max(1, maxThreadCount));
+}
+
+@Override
+public void execute(Runnable command) {
+  assert mergeThreads.contains(Thread.currentThread()) : "caller is not a 
merge thread";
+  Thread currentThread = Thread.currentThread();
+  if (currentThread instanceof MergeThread mergeThread) {
+execute(mergeThread, command);
+  } else {
+command.run();
+  }
+}
+
+private void execute(MergeThread mergeThread, Runnable command) {
+  // don't do multithreaded merges for small merges
+  if (mergeThread.merge.estimatedMergeBytes < MIN_BIG_MERGE_MB * 1024 * 
1024) {
+command.run();
+  } else {
+final boolean isThreadAvailable;
+// we need to check if a thread is available before submitting the 
task to the executor
+// synchronize on CMS to get an accurate count of current threads
+synchronized (ConcurrentMergeScheduler.this) {
+  int max = maxThreadCount - mergeThreads.size();
+  int value = activeCount.get();
+  if (value < max) {
+activeCount.incrementAndGet();
+isThreadAvailable = true;
+  } else {
+isThreadAvailable = false;
+  }
+}

Review Comment:
   One major decision I made here is that CMS thread throttling knows nothing 
about intra-merge threads.
   
   This means, if there are intra-merge threads running, it is possible that 
concurrency creeps above `maxThreadCount` until the intra-merge thread(s) 
finish(es).
   
   Making throttling & overall CMS behavior more aware of intra-merge thread 
usage will be a much trickier change. I wasn't sure it was worth it. I am open 
to opinions on this @msokolov @jpountz .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-28 Thread via GitHub



shubhamvishu commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1506927786


##
lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java:
##
@@ -56,13 +58,19 @@ final class SegmentMerger {
   InfoStream infoStream,
   Directory dir,
   FieldInfos.FieldNumbers fieldNumbers,
-  IOContext context)
+  IOContext context,
+  Executor parallelMergeTaskExecutor)
   throws IOException {
 if (context.context != IOContext.Context.MERGE) {
   throw new IllegalArgumentException(
   "IOContext.context should be MERGE; got: " + context.context);
 }
-mergeState = new MergeState(readers, segmentInfo, infoStream);
+mergeState =
+new MergeState(
+readers,
+segmentInfo,
+infoStream,
+parallelMergeTaskExecutor == null ? null : new 
TaskExecutor(parallelMergeTaskExecutor));

Review Comment:
   > it's a bit weird to use a class from the search package for merging 
(TaskExecutor)
   
   +1 .. @jpountz since now we are using `TaskExecutor` on both search and 
indexing use cases does it make sense to move it into `org.apache.lucene.util` 
package maybe?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-27 Thread via GitHub



benwtrent commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1503892704


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +936,58 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor extends ThreadPoolExecutor {
+
+AtomicInteger activeCount = new AtomicInteger(0);
+
+public ScaledExecutor() {
+  super(
+  Math.max(0, maxThreadCount - 1),
+  Math.max(1, maxThreadCount - 1),
+  Long.MAX_VALUE,
+  TimeUnit.NANOSECONDS,
+  new SynchronousQueue<>());
+}
+
+private void updatePoolSize() {
+  int newMax = Math.max(0, maxThreadCount - 1);
+  if (newMax > getCorePoolSize()) {
+setMaximumPoolSize(Math.max(newMax, 1));
+setCorePoolSize(newMax);
+  } else {
+setCorePoolSize(newMax);
+setMaximumPoolSize(Math.max(newMax, 1));

Review Comment:
   The order of setting the max is important. 
   
   But I am gonna change this. I think the core thread count should be `0` and 
we shouldn't update it at all and allow the threadpool to scale.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-26 Thread via GitHub



zhaih commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1503101967


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +936,58 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor extends ThreadPoolExecutor {
+
+AtomicInteger activeCount = new AtomicInteger(0);
+
+public ScaledExecutor() {
+  super(
+  Math.max(0, maxThreadCount - 1),
+  Math.max(1, maxThreadCount - 1),
+  Long.MAX_VALUE,
+  TimeUnit.NANOSECONDS,
+  new SynchronousQueue<>());
+}
+
+private void updatePoolSize() {
+  int newMax = Math.max(0, maxThreadCount - 1);
+  if (newMax > getCorePoolSize()) {
+setMaximumPoolSize(Math.max(newMax, 1));
+setCorePoolSize(newMax);
+  } else {
+setCorePoolSize(newMax);
+setMaximumPoolSize(Math.max(newMax, 1));

Review Comment:
   Aren't those two branch the same?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-26 Thread via GitHub



msokolov commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1964229191

   ooh exciting! I left some comments in a related issue that were maybe a 
little clueless given all the discussion here that I missed until now. Still 
I'm happy about the direction this seems to be going re: moving the 
configuration out of codec and into merge policy, and coordinating thread usage 
across inter/intra segment merges


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-25 Thread via GitHub



benwtrent commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1502165551


##
lucene/core/src/java/org/apache/lucene/index/MergeScheduler.java:
##
@@ -52,6 +56,14 @@ public Directory wrapForMerge(OneMerge merge, Directory in) {
 return in;
   }
 
+  /**
+   * Provides an executor for parallelism during a single merge operation. By 
default, this method
+   * returns an executor that runs tasks in the calling thread.
+   */
+  public Executor getInterMergeExecutor(OneMerge merge) {

Review Comment:
   臘 `inter` vs `intra` is just as bad as `affect` and `effect`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-25 Thread via GitHub



benwtrent commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1502165014


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +936,58 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor extends ThreadPoolExecutor {
+
+AtomicInteger activeCount = new AtomicInteger(0);
+
+public ScaledExecutor() {
+  super(
+  Math.max(0, maxThreadCount - 1),
+  Math.max(1, maxThreadCount - 1),
+  Long.MAX_VALUE,
+  TimeUnit.NANOSECONDS,
+  new SynchronousQueue<>());
+}
+
+private void updatePoolSize() {
+  int newMax = Math.max(0, maxThreadCount - 1);
+  if (newMax > getCorePoolSize()) {
+setMaximumPoolSize(Math.max(newMax, 1));
+setCorePoolSize(newMax);
+  } else {
+setCorePoolSize(newMax);
+setMaximumPoolSize(Math.max(newMax, 1));
+  }
+}
+
+boolean incrementUpTo(int max) {
+  while (true) {
+int value = activeCount.get();
+if (value >= max) {
+  return false;
+}
+if (activeCount.compareAndSet(value, value + 1)) {
+  return true;
+}

Review Comment:
   @jpountz you are correct, my logic here isn't what it needs to be. I will 
fix it up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-25 Thread via GitHub



benwtrent commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1502163736


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +936,58 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor extends ThreadPoolExecutor {
+
+AtomicInteger activeCount = new AtomicInteger(0);
+
+public ScaledExecutor() {
+  super(
+  Math.max(0, maxThreadCount - 1),
+  Math.max(1, maxThreadCount - 1),
+  Long.MAX_VALUE,
+  TimeUnit.NANOSECONDS,
+  new SynchronousQueue<>());
+}

Review Comment:
   > Thinking out loud: for the case when tens of index writers are open in the 
same JVM, we may want to configure a timeout on threads in order to avoid 
spending too much heap on idle threads?
   
   The timeout would only apply to non-core threads. But, your point is taken, 
it is possible for merges to be idle for a while and we don't want threads just 
sitting around taking up unnecessary space. 
   
   So, I can make `core` always `0`, and rely on the pool dynamically adding 
threads up to max.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-25 Thread via GitHub



benwtrent commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1502160415


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +936,58 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor extends ThreadPoolExecutor {

Review Comment:
   Yeah no doubt. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-25 Thread via GitHub



jpountz commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1501838905


##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +936,58 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor extends ThreadPoolExecutor {

Review Comment:
   nit: I usually have a preference for composition over inheritance, ie. could 
we wrap the thread-pool executor instead of wrapping it?



##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +936,58 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor extends ThreadPoolExecutor {
+
+AtomicInteger activeCount = new AtomicInteger(0);
+
+public ScaledExecutor() {
+  super(
+  Math.max(0, maxThreadCount - 1),
+  Math.max(1, maxThreadCount - 1),
+  Long.MAX_VALUE,
+  TimeUnit.NANOSECONDS,
+  new SynchronousQueue<>());
+}
+
+private void updatePoolSize() {
+  int newMax = Math.max(0, maxThreadCount - 1);
+  if (newMax > getCorePoolSize()) {
+setMaximumPoolSize(Math.max(newMax, 1));
+setCorePoolSize(newMax);
+  } else {
+setCorePoolSize(newMax);
+setMaximumPoolSize(Math.max(newMax, 1));
+  }
+}
+
+boolean incrementUpTo(int max) {
+  while (true) {
+int value = activeCount.get();
+if (value >= max) {
+  return false;
+}
+if (activeCount.compareAndSet(value, value + 1)) {
+  return true;
+}

Review Comment:
   If I read correctly, this tries to keep `activeCount <= maxThreadCount`. I 
was thinking we should try to keep `activeCount <= maxThreadCount - 
mergeThreadCount()`. Otherwise we're effectively using more than 
`maxThreadCount` for merging in total, which I find a bit surprising?



##
lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java:
##
@@ -910,4 +936,58 @@ public void setSuppressExceptions(ConcurrentMergeScheduler 
cms) {
   }
 });
   }
+
+  private class ScaledExecutor extends ThreadPoolExecutor {
+
+AtomicInteger activeCount = new AtomicInteger(0);
+
+public ScaledExecutor() {
+  super(
+  Math.max(0, maxThreadCount - 1),
+  Math.max(1, maxThreadCount - 1),
+  Long.MAX_VALUE,
+  TimeUnit.NANOSECONDS,
+  new SynchronousQueue<>());
+}

Review Comment:
   Thinking out loud: for the case when tens of index writers are open in the 
same JVM, we may want to configure a timeout on threads in order to avoid 
spending too much heap on idle threads?



##
lucene/core/src/java/org/apache/lucene/index/MergeScheduler.java:
##
@@ -52,6 +56,14 @@ public Directory wrapForMerge(OneMerge merge, Directory in) {
 return in;
   }
 
+  /**
+   * Provides an executor for parallelism during a single merge operation. By 
default, this method
+   * returns an executor that runs tasks in the calling thread.
+   */
+  public Executor getInterMergeExecutor(OneMerge merge) {

Review Comment:
   Should it be get**Intra**MergeExecutor?



##
lucene/core/src/java/org/apache/lucene/index/NoMergeScheduler.java:
##
@@ -52,4 +53,9 @@ public Directory wrapForMerge(OneMerge merge, Directory in) {
   public MergeScheduler clone() {
 return this;
   }
+
+  @Override
+  public Executor getInterMergeExecutor(OneMerge merge) {
+return null;

Review Comment:
   nit: throw instead?



##
lucene/core/src/java/org/apache/lucene/index/SegmentMerger.java:
##
@@ -56,13 +58,19 @@ final class SegmentMerger {
   InfoStream infoStream,
   Directory dir,
   FieldInfos.FieldNumbers fieldNumbers,
-  IOContext context)
+  IOContext context,
+  Executor parallelMergeTaskExecutor)
   throws IOException {
 if (context.context != IOContext.Context.MERGE) {
   throw new IllegalArgumentException(
   "IOContext.context should be MERGE; got: " + context.context);
 }
-mergeState = new MergeState(readers, segmentInfo, infoStream);
+mergeState =
+new MergeState(
+readers,
+segmentInfo,
+infoStream,
+parallelMergeTaskExecutor == null ? null : new 
TaskExecutor(parallelMergeTaskExecutor));

Review Comment:
   Nit: it's a bit weird to use a class from the search package for merging 
(TaskExecutor). Should merging get access to the raw Executor, which is a bit 
more flexible (I don't know if all formats will be able to split work into a 
list of tasks up-front)? Vectors could still wrap inside a `TaskExecutor` for 
convenience?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-23 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1961990406

   OK, I took a stab at this @jpountz latest commit has a POC. Need to test. 
But early performance testing shows indexing speed is better and forcemerge 
speed is just about the same. I am still using the numWorkers setting from HNSW 
to determine workers. This is mainly to prevent testing too many performance 
changes at a time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-23 Thread via GitHub



zhaih commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1961832502

   > I am not thinking about binding them. I think that MergeScheduler itself 
should be extended to return a TaskExecutor (probably defaulting to null to 
indicate none, or maybe SameThreadExecutorService).
   
   Sounds good
   
   > I really don't think users should configure numWorkers or workPerThread at 
all. I would much prefer us supply good defaults and remove configuration.
   
   No I'm not suggesting that either, I'm open to let user specify or we 
provide default, just want to make sure the performance is not affected (by not 
pre-dividing the doc space into each thread).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-23 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1961808528

   @zhaih I don't see any reason why we also cannot extend the SerlialMS and 
allow multiple threads per merge ran. 
   
   We will have to update the base `MergeScheduler` class anyways as the merges 
being ran don't know anything about who kicked off the merge (and shouldn't).
   
   The reason for all the CMS discussion is that it is the hardest to implement 
correctly (to me anyways...). For SerialMS, users could provide a number and 
its a static executor with that number of threads.
   
   > But still keep the current way of merge to keep the performance?
   
   I really don't think users should configure numWorkers or workPerThread at 
all. I would much prefer us supply good defaults and remove configuration.
   
   If we did the parallelism outside the codec to being with, I don't think we 
would have added any configurable values to the HNSW codec.
   
   > So if we bind those two together whether we potentially prevent a part of 
users using the intra-segment merges?
   
   I am not thinking about binding them. I think that `MergeScheduler` itself 
should be extended to return a TaskExecutor (probably defaulting to `null` to 
indicate none, or maybe `SameThreadExecutorService`).
   
   CMS is just the most difficult one to figure out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-23 Thread via GitHub



zhaih commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1961788368

   So the current way of HNSW concurrent merge implemented is: each worker will 
try to use an AtomicInteger to coordinate and only do a small batch of work 
(1024 documents) each time. The advantage is we are able to load balance 
between workers and I remember this did brings some (5-10%) performance gain 
when I was testing it. 
   
   Maybe, instead of specify a numWorkers per merge, we can default a expected 
work load per thread (like 10K), and then allocate numWorkers dynamically? But 
still keep the current way of merge to keep the performance?
   
   One thing I'm worried about putting all things into CMS is that we're 
binding intra segment merge with CMS. But to my understanding using CMS means 
we're using background thread to merge and merge become indeterministic, such 
that there are still some part of users are using SMS (SerialMS) or similar 
thing to keep the deterministic of merging. But on the other hand the HNSW 
concurrent merge does not affect that aspect at all, no matter how many threads 
you're using it won't affect the determinism of merge result. So if we bind 
those two together whether we potentially prevent a part of users using the 
intra-segment merges?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-23 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1961730452

   > Intra-merge concurrency would take advantage of the fact that there will 
sometimes be fewer active merges than threads to enable intra-merge 
concurrency. 
   
   Sorry for being so dense, `Executor#execute` finally clicked it for me. 
`Executor#execute` can check for `currentRunningThreads + mergeThreads.size()`. 
If its larger than `>= maxThreadCount` execute in the current thread, otherwise 
spawn or give a thread from a pool. This would then increment 
`currentRunningThreads`.
   
   I will spin a little bit on this to see what I can come up with.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-23 Thread via GitHub



jpountz commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1961602616

   Maybe some of these things are too ambitious, but ideally I'd like it to 
work this way.
   
   `ConcurrentMergeScheduler` already tracks a `maxMergeCount` which controls 
the max number of running merges and a `maxThreadCount` that tracks the max 
number of threads that merges may use at most. Ideally I'd like 
`maxThreadCount` to include both threads used for inter-merge concurrency and 
intra-merge concurrency. So this is similar to your first suggestion except 
that I'm bounding the total number of threads to `maxThreadCount` rather than 
`maxThreadCount + maxMergeCount`.
   
   Intra-merge concurrency would take advantage of the fact that there will 
sometimes be fewer active merges than threads to enable intra-merge 
concurrency. E.g. we could have a pool of threads for intra-merge concurrency 
that would try to ensure that its number of active threads is always less than 
or equals to `max(0, maxThreadCount - mergeThreads.size())`. For instance 
`Executor#execute` could be implemented such that it runs the runnable in the 
current thread if the number of active merges plus the number of active threads 
in the intra-merge thread pool is greater than or equal to `maxThreadCount`. 
Otherwise it would fork to the intra-merge thread pool.
   
   Concurrent merging for vectors wants to know the number of available workers 
today, but maybe we can change the logic (like you suggested) to split the doc 
ID space into some number of slices, e.g. max(128, maxDoc / 2^16), and 
sequentially send these slices to `Executor#execute` (sometimes running in the 
same thread, sometime forked to the intra-merge threadpool), except the last 
one that would be forced to run in the current thread (like we used to do in 
`IndexSearcher` until recently).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-23 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1961417895

   OK, my head is spinning a bit trying to grok the side-effects of the CMS. I 
think I understand most of it now. 
   
   Currently we have two adjustable parameters. 
   
   `maxThreadCount` dictates merge through put, or now many merges we have at a 
time.
   
   `mergeWorkerCount` dictates back pressure, or when we stop allowing merges 
to be queued at all & block upstream indexing from continuing. 
   
   Throughout the life of a merge, it can become (un)paused numerous times, 
this is controlled via the `RateLimitedIndexOutput` which only rate limits 
WRITING results, not reading or anything else related to threads. 
   
   Now we want to add a new configuration, `maxMergeThreads`, which will 
control merge latency by adding parallelism to each individual thread. 
   
   Since merge thread pausing only has to do with index output, I am not sure 
we need to add any individual thread throttling other than what’s already 
there. The Directory wrapper will pause/throttle all writing occurring for the 
merge. This is acceptable even if the merge is using multiple threads. 
   
   I also think that small merges (<50MB) should never be allowed to run over 
multiple threads. Similar to how we never throttle those because they are so 
small, the benefit in latency reduction will be minuscule and we should reserve 
the extra thread usage to larger merges. 
   
   What I am stuck on is on this: 
- Should we use a common pool for all merges? Thus restricting the total 
threads used by merging to be `maxThreadCount` + `maxMergeThreads`? This will 
simplify the logic in the CMS significantly as a single task executor can be 
used for all merges.
- Or should we use an individual executor/pool per merge? Thus restricting 
total threads by `maxMergeThreads` * `maxMergeThreads` (or some fraction of 
`maxMergeThreads`)? This could get interesting… How do we determine how many 
threads each merge can get? Are we ok with creating a new task executor on 
every larger merge and then closing it?
   
   What do you think @jpountz?
   
   One other thing we need to fix is this idea of “numWorkers” in HNSW. It 
seems like it should just pick optimal slices given the number of vectors it 
has (similar to multi-segment search stuff). Chunk itself into those slices and 
then be at the behest of the task executor. What say you @zhaih ? It seems 
weird to use `numWorkers` as a way to say “only use these many threads” when we 
have no knowledge of how many threads the task executor actually has.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-22 Thread via GitHub



benwtrent commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1499916055


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##
@@ -152,7 +153,25 @@ public Lucene99HnswVectorsFormat() {
* @param beamWidth the size of the queue maintained during graph 
construction.
*/
   public Lucene99HnswVectorsFormat(int maxConn, int beamWidth) {
-this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null);

Review Comment:
   @zhaih I see what you are saying about the weighing of different codecs. 
Where certain codecs could merged in parallel using a different number of 
threads. My concern is how this would ever be effectively communicate to the 
merge scheduler. 
   
   While it isn't a 1-1 relationship, adding more threads here in HNSW will 
increase I/O. I assume that for other codecs that could use parallelism, 
additional I/O will also be a concern.
   
   The scheduler needs to know how to throttle I/O if this gets too large so 
that merges and indexing can be smoothed out correct?
   
   This indicates to me the scheduler should be in charge of handing out 
threads to do work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-22 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1959856815

   > For instance the merge scheduler could return a custom Executor that 
dynamically decides to run a new task in the current thread or to fork to a 
separate thread depending on how many threads are currently busy across all 
merges vs. the current value of ConcurrentMergeScheduler.maxThreadCount?
   
   This makes me think that instead of it being a separate executor it should 
return a dynamic value for `numParallelMergeWorkers` or if 
`numParallelMergeWorkers==1` return `null` so that there is no parallelism.
   
   This would require some sort of "planning" on the merge scheduler (knowing 
how many are queued and how many `numParallelMergeWorkers` it provided to each 
executing merge action). I guess we could rely on an executor's 
`getActiveCount()` but that seems trappy. 
   
   This all implies that the configuration will actually 
`maxParallelMergeWorkers` and the MergeScheduler is free to provide any number 
of workers up to that limit. 
   
   
   The tricky part to me is determining 'how busy' the current merges are. The 
MergeScheduler could pass `numParallelMergeWorkers` to the `SegmentMerger`, but 
it goes completely unused. Maybe this is OK and we just assume it will be used.
   
   Another confusion is determining the total number of threads allowed for 
merging (inter/intra). We could default this to `maxParallelMergeWorkers * 
maxThreadCount`. In this instance `maxParallelMergeWorkers == 1` would behave 
as the current working. `maxParallelMergeWorkers == 2` would mean that we 
potentially use twice as many resources and the user should adjust 
`maxThreadCount` accordingly. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-22 Thread via GitHub



jpountz commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1959736838

   > So, we would have to update the MergeScheduler to have some methods to 
return the executor for us to use and pass to MergeState (which is only created 
via the SegmentMerger object). This means that the scheduler for individual 
merges and the parallelism available to those individual merges are independent.
   
   I agree with passing some executor from the scheduler to the `MergeState`, 
but I'm not sure I agree that this implies that inter-merge and intra-merge 
parallelism would be independent. For instance the merge scheduler could return 
a custom `Executor` that dynamically decides to run a new task in the current 
thread or to fork to a separate thread depending on how many threads are 
currently busy across all merges vs. the current value of 
`ConcurrentMergeScheduler.maxThreadCount`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-22 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1959551147

   > So I think it is still better to have two separate thread pools for 
inter-segment merge and inner segment merge, but I wonder whether we can have a 
ThreadPoolManager which configures both inter and inner pool. And CMS and HNSW 
merge (and any future merges) will source from this manager?
   
   Maybe? But this doesn't seem to simplify anything for users. They will still 
need two things to configure (number of merges that can happen at a time, 
number of threads available to all those merges to take action). 
   
   There is always the issue of best using resources and the potential of 
over-allocating the CPU cores. I don't think any finagling we do will 
ultimately change that.
   
   Selfishly, I really don't want to mess with the CMS code at all. Even if we 
switch it to use an executor of some sort, the complexities of I/O throttling & 
executable queuing would still all exist. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-22 Thread via GitHub



benwtrent commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1499311351


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##
@@ -152,7 +153,25 @@ public Lucene99HnswVectorsFormat() {
* @param beamWidth the size of the queue maintained during graph 
construction.
*/
   public Lucene99HnswVectorsFormat(int maxConn, int beamWidth) {
-this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null);

Review Comment:
   I am not sure we want such configuration at all in a default codec. I would 
rather the concurrency is all controlled upstream. 
   
   My thought is that if we had all this part of the MergeState before, we 
wouldn't have updated the codec definition at all.
   
   In the future, as more things use parallelism maybe then we should consider 
adding this setting back (or a completely new setting).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-21 Thread via GitHub



zhaih commented on code in PR #13124:
URL: https://github.com/apache/lucene/pull/13124#discussion_r1498656246


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java:
##
@@ -152,7 +153,25 @@ public Lucene99HnswVectorsFormat() {
* @param beamWidth the size of the queue maintained during graph 
construction.
*/
   public Lucene99HnswVectorsFormat(int maxConn, int beamWidth) {
-this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null);

Review Comment:
   If there will be more future inner segment concurrent merge, I think it 
might still be better to leave the `numMergeWorkers` here rather than put it in 
IWC? Otherwise we'll either have too many similar configures in IWC or hard to 
allocate weight on different types of merges?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-21 Thread via GitHub



zhaih commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1958716375

   +1 to move executor away from the Codec API (altho it's me who placed them 
there LOL)
   
   > it would be nice to fully encapsulate the merging concurrency there 
instead of having two sources of merging concurrency that are not aware of one 
another.
   
   I like the idea, but I'm a little bit worry about the situation where the 
HNSW merging threads are taking most of the resource such that original CMS 
threads cannot be executed. As the `numWorkers` we configured is a per-merge 
limit, which means if there are several big merges happening at the same time, 
say `n` merges, they will use `n * (numWorkers + 1)` threads, the thread pool 
can run out of threads very fastly and result in smaller merges be blocked by 
large merges.
   
   So I think it is still better to have two separate thread pools for 
inter-segment merge and inner segment merge, but I wonder whether we can have a 
`ThreadPoolManager` which configures both inter and inner pool. And CMS and 
HNSW merge (and any future merges) will source from this manager?
   
   Also I heard there's a usecase where they're currently sharing one CMS with 
multiple IW in a same process for better resource management, this new 
`ThreadPoolManager` can be a solution for them as well maybe?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-21 Thread via GitHub



benwtrent commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1957696114

   >  it would be nice to fully encapsulate the merging concurrency there 
instead of having two sources of merging concurrency that are not aware of one 
another.
   
   To take advantage of the executor, etc. it needs to be accessible from the 
scheduler. So, we would have to update the `MergeScheduler` to have some 
methods to return the executor for us to use and pass to `MergeState` (which is 
only created via the `SegmentMerger` object). This means that the scheduler for 
individual merges and the parallelism available to those individual merges are 
independent.
   
   I don't know how they would ever be "aware" of one another. 
   
   > I'm also keen on keeping the codec API as simple as possible and adding 
the executor as a member of the MergeState rather than a new parameter of all 
codec write APIs.
   
   I updated the PR to do just this. Its a much cleaner API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-02-21 Thread via GitHub



jpountz commented on PR #13124:
URL: https://github.com/apache/lucene/pull/13124#issuecomment-1957581351

   Thinking out loud: since merge schedulers already have the ability to merge 
concurrently (across multiple merges rather than within a merge though), it 
would be nice to fully encapsulate the merging concurrency there instead of 
having two sources of merging concurrency that are not aware of one another.
   
   I'm also keen on keeping the codec API as simple as possible and adding the 
executor as a member of the `MergeState` rather than a new parameter of all 
codec write APIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

77 matches

Mail list logo