Re: Record level index with not unique keys

2023-08-16 Thread Vinoth Chandar
Hi, yes the indexing DAG can support this today and even if not, it can be easily fixed. Main issue would be how we encode the mapping well. for e.g if we want map from user_id to all events that belong to the user, we need a different, scalable way of storing this mapping. I can organize this

Re: Record level index with not unique keys

2023-07-13 Thread nicolas paris
Hello Prashant, thanks for your time. > With non unique keys how would tagging of records (for updates / deletes) work? Currently both GLOBAL_SIMPLE/BLOOM work out of the box in the mentioned context. See below pyspark script and results. As for the implementation, the tagLocationBacktoRecords

Re: Record level index with not unique keys

2023-07-13 Thread Prashant Wason
Hi Nicolas, The RI feature is designed for max performance as it is at a record-count scale. Hence, the schema is simplified and minimized. With non unique keys how would tagging of records (for updates / deletes) work? How would record Index know which mapping of the array to return for a given

Record level index with not unique keys

2023-07-12 Thread nicolas paris
hi there, Just tested preview of RLI (rfc-08), amazing feature. Soon the fast COW (rfc-68) will be based on RLI to get the parquet offsets and allow targeting parquet row groups. RLI is a global index, therefore it assumes the hudi key is present in at most one parquet file. As a result in the