Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-12-12 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2539436869

   @vigyasharma @jpountz @mikemccand  Any thoughts on the above approach on 
using multiple IndexWriter for different group (tenenat) with a read only 
combined view?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-11-21 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2492296015

   > Does the OpenSearch client directly work with 'n' different log-group 
specific IndexWriters?
   
   While writing logs, OpenSearch will interact with n' different log-group 
specific IndexWriters. For example, if logs are grouped by status codes, a 5xx 
log entry will be written using a 5xx specific IndexWriter.
   
   Conversely for read flows, like creating a reader, retrieving the latest 
commit (or segmentInfo state) associated with a directory (or IndexWriter) (for 
uploading to snapshot or syncing the state of replica from primary during 
checkpoint in SegRep, etc), OpenSearch will interact with Lucene via the 
combined view (parent IndexWriter). This parent Index Writer internally 
references segments of group level IndexWriters (200_0, 300_0 etc).
   
   Having separate IndexWriters for different groups ensures logs with 
different groups are maintained in different segments. Meanwhile, the combined 
view for group-level Segments of a Lucene Index in the form of parent 
IndexWriter provides a common view for operation like opening readers, syncing 
replicas, uploading segmentInfos of an index to a remote snapshot etc.
   
   > When a new log group is discovered, does the client create a new 
IndexWriter and add it to parent?
   
   Number of groups (IndexWriters) will be fixed and will be determined via a 
setting during Index creation.
   
   > Do we really need a parent "IndexWriter" with this approach? Would a 
Multi-Reader on all the child log-group directories work?
   
   Having a Multi-Reader on all the child log-group directories still won't 
provide a unified view of all group level segments associated with a Lucene 
Index. Even now, OpenSearch interacts with a Lucene index not only for indexing 
documents or opening a reader to read these indexed docs, but also for 
retrieving SegmentInfos associated with the latest commit of an IndexWriter 
directory (for eg: for storing snapshots of an Index on a remote store) or for 
obtaining file list associated with a past commit (for deleting unreferenced 
files inside commit deletion policy). Having a common view of multiple group 
level segments as an Index Writer associated with a single Lucene Index ensures 
that a Lucene index still behaves as a single entity (parent IndexWriters can 
be used to get a common commit for group level IndexWriters).
   
   Another approach is to use a SegmentInfos instance instead of an IndexWriter 
to maintain a common view for group level IndexWriters. Since in the above 
approach, parent IndexWriter periodically syncs and combines only segmentInfos 
of group-level IndexWriters, we can replace parent IndexWriter with a 
SegmentInfos as a combined view. This parent SegmentInfos will reference 
segments of group level segments similar to what a parent IndexWriter does.
   
   Let me know if this makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-11-12 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2470112408

   On some more analysis figured out an approach which addresses all the above 
comments and obtain same improvement with different IndexWriter for different 
group as we got with using different DWPTs for different group.
   
   ## Using separate IndexWriter for maintaining different tenants with a 
combined view
   
   ### Current Issue
   
   Maintaining separate IndexWriter for different groups (tenant) presents a 
significant problem as they do not function as a single unified entity. 
Although distinct IndexWriters and directories for each group ensures that data 
belonging to different groups are kept in separate segments and segments within 
the same group are merged, a unified read-only view for Client (OpenSearch) to 
interact with these multiple group-level IndexWriters is still needed.
   
   Lucene’s addIndexes api offers a way to combine group-level IndexWriters 
into single parent-level IndexWriter, but this approach has multiple drawbacks:
   
   1. Since writes may continue on group-level IndexWriters, periodic 
synchronisation with the parent-level IndexWriter is necessary.
   2. During synchronisation, an external lock needs to be placed on the group 
level IndexWriter directory, causing downtime.
   3. Synchronisation will also involve copying files from the group level 
IndexWriter directory to the parent IndexWriter directory, which is 
resource-intensive, consuming disk IO and CPU cycles.
   
   ### Proposal
   
   To address this issue, we propose introducing a mechanism that combines 
group-level IndexWriters as a soft reference to a parent IndexWriter. This will 
be achieved by creating a new variant of the addIndexes API within IndexWriter, 
which will only combine the SegmentInfos of group-level IndexWriter without 
requiring an external lock or copying files across directories. Group-level 
segments will be maintained in separate directories associated with their 
respective group-level IndexWriters.
   
   The client will periodically call (for OpenSearch side this corresponds to 
index refresh interval of 1 sec) this addIndexes API on the parent IndexWriter, 
passing the segmentInfos of child-level IndexWriter as parameters to sync the 
latest SegmentInfos with the parent IndexWriter. While combining the 
SegmentInfos of child-level IndexWriters, the addIndexes API will attach a 
prefix to the segment names to identify the group each Segments belongs to, 
avoiding name conflicts between segments of different group-level IndexWriters.
   
   ![compositeIndexWriter drawio 
(1)](https://github.com/user-attachments/assets/8ddd8568-a352-41ac-bc42-ce3cb4647f8f)
   
   The parent IndexWriter will be associated with a filter directory that will 
distinguishes the tenant using the file name prefix, redirecting any read/write 
operations on a file to the correct group level directory using segment file 
prefix name.
   
    Reason for choosing common view as an IndexWriter
   
   Most interactions of Lucene with the client (OpenSearch) such as opening a 
reader, getting the latest commit info, reopening a Lucene index, etc occurs 
via IndexWriter itself. Thus selecting IndexWriter as a common view made more 
sense.
   
   ### Improvements with multiple IndexWriter with a combined view
   
   We were able to observe around 50% - 60% improvements with multiple 
IndexWriter with a combined view approach similar to what we observed by having 
different DWPTs for different tenant (initial proposal).
   
   ### Considerations
   
   1. The referencing IndexWriter will be a combined read only view for group 
level IndexWriters. Since this IndexWriter does not itself has any segments and 
is only referencing segment Infos of other IndexWriters, write operation like 
segment merge, flush etc should not be performed on this parent IndexWriter 
instance.
   2. We need to consider prefix name attached before segment names when 
[parsing segment 
names](https://github.com/RS146BIJAY/lucene/blob/84811e974f38181b0c1f1e1b5655f674a1584385/lucene/core/src/java/org/apache/lucene/index/IndexFileNames.java#L119).
   3. It will be difficult to support update queries with multi IndexWriter 
approach. For eg: If we are grouping logs on status code and user update the 
status code field of the logs, for lucene, insert and update operations needs 
to be performed on the separate delete queue.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@luce

Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-09-20 Thread via GitHub


vigyasharma commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2364348826

   > 3\. Does require a new merge policy to merge the segments belonging to the 
same group.
   
   How do background index merges work with the original, separate DWPT based 
approach? Don't you need to ensure that you only merge segments that belong to 
a single group?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-09-19 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2360645812

   ## Approach 2: Using a physical directory for each group
   
   
![approach2](https://github.com/user-attachments/assets/223686c4-5c0c-49c1-b54c-1aee22a2d1bf)
   
   To segregate segments belonging to different groups and avoid attaching a 
prefix to segment names, we associated group-level IndexWriters with a physical 
directory instead of a filter directory. CompositeIndexWriter are linked to the 
top-level multi-tenant directory while group-level IndexWriters are connected 
to individual directories specific to each group within the parent directory. 
Since segments belonging to each groups are now in separate directory, there is 
no need to prefix segment names, thus solving the prefix name issue with above 
approach. Separate IndexWriter ensures only segments belonging to same group 
are merged together.
   
   ### Pros
   
   1. Having a different directories for each group’s IndexWriter reduces the 
chances of failing any Lucene’s internal calls.
   
   ### Cons
   
   1. Multiple IndexWriters still do not function as a single entity when 
interacting with the client (OpenSearch). Each IndexWriter has its own 
associated SegmentInfos, Index commit, SegmentInfos generation and version. 
This breaks multiple features like segment replication and it’s derivative 
remote store. For example, in a remote store enabled cluster, we maintain a 
replica of the shard (single Lucene index) on separate remote storage (such as 
S3).  To achieve this, during each checkpoint, we take a snapshot of the 
current generation of SegmentInfos associated with the Lucene Index and upload 
the associated files along  with a metadata file  (associated with a generation 
of SegmentInfo) to a remote store. Now with multiple IndexWriter for the same 
shard, a list of SegmentInfos (one for each group) will be associated. We can 
handle this by creating a list of snapshots and their separate metadata files, 
but this essentially translates to maintaining separate Lucene indexes for eac
 h shard, essentially making each segment group becoming a shard on the client 
(OpenSearch) end.
   2. In order to address the above issue, we can try creating a common wrapper 
for the list of SegmentInfos, similar to what we did for IndexWriters with 
CompositeIndexWriter. However, this approach also has issues, as the common 
wrapper would need a common generation and version. Additionally, it should be 
possible to associate the common wrapper with a specific index commit to allow 
opening a CompositeIndexWriter at a specific Index commit point. Furthermore, 
when a CompositeIndexWriter is opened using a commit point, it should be 
possible to open all the group level sub IndexWriters at that Index commit 
point. While this is doable, it is extremely complex to implement it.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-09-19 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2360651201

   ## Summary
   
   In summary the problem can be broken down into three sub problems.
   
   1. Having abstraction to write the data into different groups (Multiple 
Writers)
   2. Having a single interface/entity for multiple groups for client 
(OpenSearch) interaction (for Sequence id generation, segment replication, etc) 
with Lucene.
   3. Merging the segments belonging to the same group.
   
   
   With the different approaches we investigated, none of them satisfies/solves 
the above 3 sub problems cleanly with decent complexity. That leaves us with 
the originally suggested approach of using different DWPTs to represent 
different groups. The original approach: 
   
   1.  Uses single IndexWriter and different DWPT which provides clear 
abstraction for different groups.
   2. With single IndexWriter performs updates, at any given time, only a 
single SegmentInfos, generation and version were associated with a Lucene index.
   3. Does require a new merge policy to merge the segments belonging to the 
same group.
   
   
   Open for thoughts and suggestions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-09-19 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2360649893

   ## Approach 3: Combining group level IndexWriter with addIndexes
   
   
![approach3](https://github.com/user-attachments/assets/32ea3baa-0ae6-4a60-84e9-352a0e1e6a5e)
   
   In this approach, in order to make multiple group-level IndexWriters 
function as a unified entity, we use the Lucene’s addIndxes api to combine 
them. This ensures that the top-level IndexWriter shares a common segment_N, 
SegmentCommitInfos, generation and version. During indexing or update request, 
the client (such as OpenSearch) will continue to route requests to the 
appropriate IndexWriter based on the documents’s criteria evaluation. During 
flush, in addition to flushing the segments of the group-level IndexWriters, we 
will merge/move them into a single parent IndexWriter using the addIndexes API 
call. For read (or replication) operations, the client (like OpenSearch) will 
now open a Reader on the parent IndexWriter.
   
   ### Pros
   
   1. Having a common IndexWriter with a single SegmentCommitInfos, generation 
etc, ensures that client (OpenSearch) is still interacting with Lucene using a 
single entity.
   
   ### Cons
   
   1. When segments of different groups are combined into a single index, we 
must ensure that only segment within a group are merged together. This will 
require a new merge policy for top level IndexWriter.
   2. The Lucene addIndexes API  seems to acquire a write lock on each 
directory associated with group level IndexWriters, preventing active writes 
during the Index merging process. This can cause a downtime on the client 
(OpenSearch) side during this period. However, this issue could be mitigated if 
Lucene provided an API to combine these group level indexes as a soft 
reference, without copying the segment files or locking the group level 
IndexWriters.
   3. Additionally, index merging involves copying files from group level 
IndexWriters’ directory to parent IndexWriter directory. This is a resource 
intensive operation, consuming disk IO and CPU cycles. Moreover, since we open 
a Reader on the parent IndexWriter (combined IndexWriter from group level 
IndexWriters), slow index merging may impact reader refresh times delaying 
visibility of changes for search.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-09-19 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2360641099

   Thanks [mikemccand](https://github.com/mikemccand) and 
[vigyasharma](https://github.com/vigyasharma) for suggestions. Evaluated 
different approaches to use different IndexWriter for different groups:
   
   ## Approach 1: Using filter directory for each group
   
   
![approach1](https://github.com/user-attachments/assets/857b6bad-8e31-480a-8b3e-c9af06479b9e)
   
   In this approach, each group (for above example grouping criteria is status 
code) has its own IndexWriter, associated with distinct logical filter 
directories that attach a filename prefix to the segments according to their 
respective group (200_, 400_ etc.). These directories are backed by a single 
physical directory. Since different IndexWriter manages segments belonging to 
different groups, segments belonging to the same group are always merged 
together. A CompositeIndexWriter wraps the group-level IndexWriters for client 
(OpenSearch) interaction. While adding or updating a document, this 
CompositeIndexWriter delegates the operation to corresponding criteria specific 
IndexWriter. CompositeIndexWriter is associated with the top level physical 
directory.
   
   To address the sequence number conflict between different IndexWriters, a 
common sequence number generator was used for all IndexWriters within a shard. 
This ensures that sequence number are always continuous increasing number for 
the IndexWriters in the same shard.
   
   ### Pros
   
   1. Using separate IndexWriters for different groups ensures that documents 
from groups are categorised into distinct segments. This approach also 
eliminates the need to modify merge policy.
   2. Using a common sequence number generator prevent sequence number conflict 
among IndexWriters belonging to same group. However, since sequence number 
generation is delegated to the Client (OpenSearch), they must ensure that the 
sequence numbers are monotonically increasing.
   
   ### Cons
   
   1. Lucene internally search for files starting with segments_ or 
pending_segments_  for operations like getting last commit generation of index 
or write.lock for checking if lock is applied on directory, etc. Attaching a 
prefix name to these files may break Lucene’s internal operations.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-09-16 Thread via GitHub


vigyasharma commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2354135495

   I wonder if we can leverage IndexWriter's `addIndexes(Directory... dirs)` 
[API](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L2984)
 for this. We could create separate indexes for every category (log groups 2xx, 
4xx, 5xx in the example here), and combine them into one using this API. 
Internally, this version of the API simply copies over all segment files in the 
directory, so it should be pretty fast.
   
   This could mean that each shard for an OpenSearch/Elasticsearch index would 
maintain internal indexes for each desired category, and use the API to combine 
them into a common "shard" index at every flush? We'd still need a way to 
maintain category labels for a segment during merging, but that's a common 
problem for any approach we take.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-07-02 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2194731622

   Thanks a lot for suggestions @jpountz  and @mikemccand. 
   
   As suggested above, we worked on a POC to explore using separate IndexWriter 
for different groups. Each IndexWriter is associated with a distinct logical 
filter directories, which attaches a filename prefix according to the group. 
These directories are backed by a single multi tenant directory. However this 
approach presents several challenges on the Client (OpenSearch) side. Each 
IndexWriter now generates its own sequence number. In a service like OpenSearch 
where Translog operates based on sequence numbers at the Lucene Index level. 
When the same sequence number is generated across different IndexWriter for a 
same Lucene Index, conflicts can occur during operation like Translog replay. 
Additionally, local and global checkpoints maintained during recovery operation 
in service like OpenSearch require sequence number to be a continuous 
increasing number which won't be valid with multiple IndexWriter.
   
   We did not face these issue when different groups were represented by 
different DWPT pools. This is because there was only a single IndexWriter 
writing to a Lucene Index, generating a continuous increasing sequence number. 
The complexity of handling different segments for different groups is managed 
internally at Lucene level, rather than propagating it to the client side. Feel 
free to share any further suggestions you may have on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-06-04 Thread via GitHub


jpountz commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2147657867

   > do we have such a class already (that would distinguish the tenants via 
filename prefix or so)? That's a nice idea all by itself (separate from this 
use case) -- maybe open a spinoff to explore that?
   
   I don't think we do. +1 to exploring this separately. I like that we then 
wouldn't need to tune the merge policy because it would naturally only see 
segments that belong to its group.
   
   > You would also need a clean-ish way to manage a single total allowed RAM 
bytes across the N IndexWriters? I think IndexWriter's flushing policy or RAM 
accounting was already generalized to allow for this use case, but I don't 
remember the details.
   
   Right, `IndexWriter#flushNextBuffer()` and `IndexWriter#ramBytesUsed()` 
allow building this sort of thing on top of Lucene. It would be nice if Lucene 
provided more ready-to-use utilities around this.
   
   > Searching across the N separate shards as if they were a single index is 
also possible via MultiReader, though, I'm not sure how well intra-query 
concurrency works -- maybe it works just fine because the search-time 
leaves/slices are all union'd across the N shards?
   
   Indeed, I'd expect it to work just fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-06-03 Thread via GitHub


mikemccand commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2145162839

   I like @jpountz's idea of just using separate `IndexWriter`s for this 
use-case, instead of adding custom routing logic to the separate DWPTs inside a 
single `IndexWriter` and then also needing a custom `MergePolicy` that ensures 
that only the like-segments are merged.  A separate `IndexWriter` would cleanly 
achieve both of these?
   
   The idea of using a single underlying multi-tenant `Directory` with multiple 
`FilterDirectory` wrappers (one per `IndexWriter`) is interesting -- do we have 
such a class already (that would distinguish the tenants via filename prefix or 
so)?  That's a nice idea all by itself (separate from this use case) -- maybe 
open a spinoff to explore that?
   
   You would also need a clean-ish way to manage a single total allowed RAM 
bytes across the N `IndexWriter`s?   I think `IndexWriter`'s flushing policy or 
RAM accounting was already generalized to allow for this use case, but I don't 
remember the details.
   
   Searching across the N separate shards as if they were a single index is 
also possible via `MultiReader`, though, I'm not sure how well intra-query 
concurrency works -- maybe it works just fine because the search-time 
leaves/slices are all union'd across the N shards?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-06-02 Thread via GitHub


jpountz commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2144402163

   > However, implementing this approach would lead to significant overhead on 
the client side (such as OpenSearch) both in the terms of code changes and 
operational overhead like metadata management.
   
   Can you give more details? The main difference that comes to mind is that 
using multiple `IndexWriter`s requires multiple `Directory`s as well and 
OpenSearch may have a strong assumption that there is a 1:1 mapping between 
shards and folders on disk. But this could be worked around with a filter 
`Directory` that flags each index file with a prefix that identifies the group 
that each index file belongs to?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-06-02 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2144396439

   Attaching a [preliminary PR](https://github.com/apache/lucene/pull/13409) 
for the POC related to above issue to share my understanding. Please note that 
this is not the final PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-06-02 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2144392824

   Attaching a [preliminary PR](https://github.com/apache/lucene/pull/13409) 
for the POC related to above issue to share my understanding. Please note that 
this is not the final PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-05-27 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2133354180

   > I agree that better organizing data across segments yields significant 
benefits, I'm only advocating for doing this by maintaining a separate 
IndexWriter for each group instead of doing it inside of DocumentsWriter.
   
   Sorry missed answering this part in my earlier response. We did explore this 
approach of creating an IndexWriter/Lucene Index (or OpenSearch shard) for each 
group. However, implementing this approach would require significant changes on 
the client (OpenSearch) side. Besides, having multiple additional shards will 
also lead to huge overhead due to task like metadata management. On the other 
hand, maintaining separate DWPT pools for different groups would require 
minimal changes inside Lucene. The overhead will be lesser here as Lucene shard 
will still be maintained as a single physical unit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-05-24 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2129984751

   Thanks for the suggestion. Above suggestion for clustering within the 
segment does improves skipping of documents (especially when combined with [BKD 
optimisation](https://github.com/apache/lucene-solr/pull/1351) to skip non 
competitive documents). But it still limits us from building multiple 
optimisations which could be done by having separate DWPT pools for different 
groups:
   
   - Having separate pool of DWPTs (thus creating separate segments) for 
different groups, will also reduce the cardinality of values within a segment 
for a field. Optimisation like [precomputing aggregations with StartTree 
index](https://github.com/opensearch-project/OpenSearch/issues/12498) tends to 
perform better when cardinality of the field is not too high. 
   - With the above approach, size of the segments can be still high. If we 
store more relevant logs (like 5xx and 4xx) in a different segments than less 
relevant ones (like 2xx), size of segments containing error and fault logs will 
be smaller (since error logs are generally less). This will help us to do 
storage optimisations like storing more relevant logs (like 5xx logs) on  hot 
storage (like on the node's disk) whereas less relevant logs can be directly 
stored in cheaper remote storage (e.g.: AWS S3, Google Cloud Storage, MinIO, 
etc.).
   
   Actually, we won't be able to build any more optimizations on top of the 
segment topology if we store them together. Let me know if this makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-05-23 Thread via GitHub


RS146BIJAY commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2126884759

   Thanks Mike and Adrian for the feedback.
   
   > You do not mention it explicitly in the issue description, but presumably 
this only makes sense if an index sort is configured, otherwise merges may 
break the clustering that you are trying to create in the first place?
   
   Not exactly. As mentioned, in order to ensure that grouping criteria 
invariant is maintained even during segment merges, we are introducing a new 
merge policy that acts as a decorator over the existing Tiered Merge policy. 
During a segment merge, this policy would categorize segments according to 
their grouping function outcomes before merging segments within the same 
category, thus maintaining the grouping criteria’s integrity throughout the 
merge process.
   
   > I wonder if we could do something within a single DWPT pool, e.g. could we 
use rendez-vous hashing to optimistically try to reuse the same DWPT for the 
same group as often as possible, but only on a best-effort basis, not trading 
concurrency or creating more DWPTs than indexing concurrency requires?
   
   I believe even if we use a single DWPT pool with rendezvous hashing to 
distribute DWPTs we would end up creating same number of DWPTs as having 
different DWPT pools for different group. Consider an example where we are 
grouping logs based on status code for an index and 8 concurrent indexing 
thread is indexing 2xx status code logs. This will create 8 DWPTs. Now 4 
threads starts indexing 4xx status code logs concurrently, this will require 4 
extra DWPTs for indexing logs if we want to maintain status code based 
grouping. Instead of creating new DWPTs, we can try reusing existing 4 DWPTs 
created for 2xx status code logs on best effort basis. But this will again mix 
4xx status code logs with 2xx status code logs defeating the purpose of status 
code based grouping of logs. Let me know if my understanding is correct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-05-22 Thread via GitHub


jpountz commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2125281483

   This is an interesting idea!
   
   You do not mention it explicitly in the issue description, but presumably 
this only makes sense if an index sort is configured, otherwise merges may 
break the clustering that you are trying to create in the first place?
   
   > The DocumentWriterThreadPool will now maintain a [distinct pool of 
DWPTs](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#L47)
 for each possible outcome.
   
   I'm a bit uncomfortable with this approach. It is so heavy that it wouldn't 
perform much better than maintaining a separate `IndexWriter` per group? I 
wonder if we could do something within a single DWPT pool, e.g. could we use 
rendez-vous hashing to optimistically try to reuse the same DWPT for the same 
group as often as possible, but only on a best-effort basis, not trading 
concurrency or creating more DWPTs than indexing concurrency requires?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Support for criteria based DWPT selection inside DocumentWriter [lucene]

2024-05-21 Thread via GitHub


mikemccand commented on issue #13387:
URL: https://github.com/apache/lucene/issues/13387#issuecomment-2122642094

   I like this idea!  I hope we can find a simple enough API exposed through 
IWC to enable the optional grouping.
   
   This also has nice mechanical sympathy / symmetry with the distributed 
search engine analog.  A distributed search engine like OpenSearch indexes and 
searches into N shards across multiple servers, and this is nearly precisely 
the same logical problem that Lucene tackles on a single multi-core server when 
indexing and searching into N segments, especially as Lucene's intra-query 
concurrency becomes the norm/default and improves (e.g. allowing intra-segment 
per query concurrency as well).  We should cross-fertilize more based on this 
analogy: the two problems are nearly the same.  A shard, a segment, same thing 
heh (nearly).
   
   So this proposal is bringing custom document routing feature from 
OpenSearch, down into Lucene's segments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org