That makes sense! I can see how masked reads are useful in that kind of
approach too. Thanks for the explanation.
On Fri, Jun 2, 2023, 8:45 AM Will Jones wrote:
> > The main downside with using the mask (or any solution based on a filter
> > node / filtering) is that it requires that the
> The main downside with using the mask (or any solution based on a filter
> node / filtering) is that it requires that the delete indices go into the
> plan itself. So you need to first read the delete files and then create
> the plan. And, if there are many deleted rows, this can be costly.
Also, for clarity, I do agree with Gang that these are both valuable
features in their own right. A mask makes a lot of sense for page indices.
On Fri, Jun 2, 2023 at 7:36 AM Weston Pace wrote:
> > then I think the incremental cost of adding the
> > positional deletes to the mask is probably
> then I think the incremental cost of adding the
> positional deletes to the mask is probably lower than the anti-join.
Do you mean developer cost? Then yes, I agree. Although there may be some
subtlety in the pushdown to connect a dataset filter to a parquet reader
filter.
The main downside
That's a good point, Gang. To perform deletes, we definitely need the row
index, so we'll want that regardless of whether it's used in scans.
> I'm not sure a mask would be the ideal solution for Iceberg (though it is
a reasonable feature in its own right) because I think position-based
deletes,
IMO, the adding a row_index column from the reader is orthogonal to
the mask implementation. Table formats (e.g. Apache Iceberg and
Delta) require the knowledge of row index to finalize row deletion. It
would be trivial to natively support row index from the file reader.
Best,
Gang
On Fri, Jun
I agree that having a row_index is a good approach. I'm not sure a mask
would be the ideal solution for Iceberg (though it is a reasonable feature
in its own right) because I think position-based deletes, in Iceberg, are
still done using an anti-join and not a filter.
That being said, we
Hi Rusty,
At first glance, I think adding a row_index column would make sense. To be
clear, this would be an index within a file / fragment, not across multiple
files, which don't necessarily have a known ordering in Acero (IIUC).
However, another approach would be to take a mask argument in the
Hi Arrow Team,
I wanted to suggest an improvement regarding Acero's Scan node.
Currently, it provides useful information such as __fragment_index,
__batch_index, __filename, and __last_in_fragment. However, it would
be beneficial to have an additional column that returns an overall
"row index"