Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-02 Thread Weston Pace
That makes sense! I can see how masked reads are useful in that kind of approach too. Thanks for the explanation. On Fri, Jun 2, 2023, 8:45 AM Will Jones wrote: > > The main downside with using the mask (or any solution based on a filter > > node / filtering) is that it requires that the

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-02 Thread Will Jones
> The main downside with using the mask (or any solution based on a filter > node / filtering) is that it requires that the delete indices go into the > plan itself. So you need to first read the delete files and then create > the plan. And, if there are many deleted rows, this can be costly.

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-02 Thread Weston Pace
Also, for clarity, I do agree with Gang that these are both valuable features in their own right. A mask makes a lot of sense for page indices. On Fri, Jun 2, 2023 at 7:36 AM Weston Pace wrote: > > then I think the incremental cost of adding the > > positional deletes to the mask is probably

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-02 Thread Weston Pace
> then I think the incremental cost of adding the > positional deletes to the mask is probably lower than the anti-join. Do you mean developer cost? Then yes, I agree. Although there may be some subtlety in the pushdown to connect a dataset filter to a parquet reader filter. The main downside

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-01 Thread Will Jones
That's a good point, Gang. To perform deletes, we definitely need the row index, so we'll want that regardless of whether it's used in scans. > I'm not sure a mask would be the ideal solution for Iceberg (though it is a reasonable feature in its own right) because I think position-based deletes,

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-01 Thread Gang Wu
IMO, the adding a row_index column from the reader is orthogonal to the mask implementation. Table formats (e.g. Apache Iceberg and Delta) require the knowledge of row index to finalize row deletion. It would be trivial to natively support row index from the file reader. Best, Gang On Fri, Jun

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-01 Thread Weston Pace
I agree that having a row_index is a good approach. I'm not sure a mask would be the ideal solution for Iceberg (though it is a reasonable feature in its own right) because I think position-based deletes, in Iceberg, are still done using an anti-join and not a filter. That being said, we

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-05-29 Thread Will Jones
Hi Rusty, At first glance, I think adding a row_index column would make sense. To be clear, this would be an index within a file / fragment, not across multiple files, which don't necessarily have a known ordering in Acero (IIUC). However, another approach would be to take a mask argument in the

[DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-05-29 Thread Rusty Conover
Hi Arrow Team, I wanted to suggest an improvement regarding Acero's Scan node. Currently, it provides useful information such as __fragment_index, __batch_index, __filename, and __last_in_fragment. However, it would be beneficial to have an additional column that returns an overall "row index"