I think C++14 is fine for optional dependencies and shouldn't block any development work right now. Note that we should be able to upgrade to require a minimum of C++14 as soon as April or May of this year since we will stop having to support one of the last gcc < 5 toolchains (for R 3.5 IIUC)
On Wed, Mar 3, 2021 at 5:41 PM Yeshwanth Sriram <[email protected]> wrote: > > Hi Micah, > > Thank you for the detailed response. Apologize for not responding earlier. > > a.) Looked at the latencies with and without filtering based on just foreach > and the latency is dominated by the parquet/write operation. So I’m going to > go with what I have which already provides substantial improvement for my use > case. > > b.) Would like to contribute for implement ANY over booleans in Arrow/compute > kernel. Waiting for permission to come through. > > I’m also interested in contributing to Azure/ADLS filesystem but the library > I was looking at is c++14 here https://github.com/Azure/azure-sdk-for-cpp . > Is c++14 no-go as a dependency in Arrow (even conditional ?) > > Thank you > Yesh > > On Feb 28, 2021, at 2:09 PM, Micah Kornfield <[email protected]> wrote: > > Hi Yeshwanth, > I think you can do the first part of the filtering using the Equals kernel > and IsIn kernel on the child arrays of the Map. I took a quick look but I > don't think that there is anything implemented that would allow you to map > the resulting bitmaps to the parent lists. It seems that we would want to add > an "Any" function for List<Bool> that returns a Bool array if any of the > elements are true. There is already one for flat Boolean Arrays [1] but I > don't think that is useful here. > > So I think the logic that you would ultimately want in pseudo-code: > > children_bitmap = Equals(map.key, "some string") && IsIn(map.struct.id, > [[“aaa”, “bee”, “see”]) > list = MakeList(map.offsets, children_bitmap) > final_selection = Any(list) > > Is the new Kernel something you would be interested in contributing? > > -Micah > > [1] https://github.com/apache/arrow/pull/8294 > > On Sun, Feb 28, 2021 at 9:05 AM Yeshwanth Sriram <[email protected]> > wrote: >> >> Using C++//Arrow to filter out large parquet files and I’m able to do this >> successfully. The current poc implementation is based on nested for/loops >> which I would like to avoid this and instead use built-in filter/take >> functions or some recommendations to extract (take functions ?) arrays of >> indices or booleans to filter out rows. >> >> The input (data) array/column type is MapArray[key:String, >> value:StructArray[id:String, …]] >> >> The input filter is a {filter_key: “some string”, filter_ids: [“aaa”, “bee”, >> “see”, ..] } >> - Where filter_key, and filter_ids is to match contents of input MapArray >> >> The output I’m looking for is either array of booleans or indices of input >> array that match the input filer. >> >> Thank you > >
