Try the arrow::compute::Take function

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/take.h#L121

On Sat, Apr 4, 2020 at 8:19 PM Yue Ni <[email protected]> wrote:
>
> Hi Wes,
>
> Thanks for the reply, but I don't think this is what I am looking for.
>
> It seems to me this `result.to_array()` will only return the array for the 
> selection vector 
> (https://github.com/apache/arrow/blob/b07c2626cb3cdd3498b41da9feedf7c8319baa27/python/pyarrow/gandiva.pyx#L130),
>  but it is not clear to me how I can use the selection vector to filter the 
> original record batch.
>
> If I understand correctly, in this test case, 
> `result.to_array().equals(pa.array(range(1000), type=pa.uint32()))` is 
> asserting that the selection vector has integer index values from [0, 1000), 
> but I am looking for to obtain an array in the filtered record batch which 
> should be an array of floats here. I know I can iterate indices in the 
> selection vector and use it to retrieve each row in original record batch 
> columns, but I am not certain if this is the right way to do it. For example, 
> if I have multiple columns in the original record batch, do I need to iterate 
> the selection vector multiple times to filter each of the column? Since this 
> is a common task, I expect there is an easy/efficient API to do this.
>
> Basically, I am looking for something like:
> selection_vector = filter.evaluate(record_batch, pa.default_memory_pool())
> filtered_column_arrays_in_record_batch = 
> record_batch.filter(selection_vector) # what is the API for doing this?
>
> In the C++ test cases, the closest thing I find is to construct a gandiva 
> projector to use the selection vector, but every test case there requires 
> client to construct a gandiva expression to build the projector (for example, 
> in this test case, a {sum_expr} is used for constructing the projector, 
> https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_project_test.cc#L121
>  ).
>
> I wonder if the filtering can be done without involving creating a projection 
> expression. At the same time, if projector is expected to be used for doing 
> this, what projector expressions should be used if I want to keep all the 
> columns as they are but just with some rows filtered based on the criteria 
> given?
>
> On Sun, Apr 5, 2020 at 7:27 AM Wes McKinney <[email protected]> wrote:
>>
>> You can see an example of filtering via the Python bindings
>>
>> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L89
>>
>> This creates a gandiva::Filter using gandiva::Filter::Make, which can
>> be used to filter a RecordBatch
>>
>> Is this what you need?
>>
>> On Fri, Apr 3, 2020 at 7:12 PM Yue Ni <[email protected]> wrote:
>> >
>> > Hi there,
>> >
>> > I am using the gandiva C++ library for processing RecordBatch. I would 
>> > like to know how I can apply gandiva::Filter for a RecordBatch so that I 
>> > can do some filtering without using the projector.
>> >
>> > Since I don't find any documentation for it, I read some source code about 
>> > its usage, and here are the test cases I found about its usage:
>> > 1) 
>> > https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_test.cc
>> > 2) 
>> > https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_project_test.cc
>> >
>> > From my reading, I find it is possible to get a SelectionVector by using 
>> > the gandiva::Filter, at the same time, you can use the SelectionVector 
>> > with the gandiva::Projector to filter RecordBatch when doing projection. 
>> > My questions are:
>> > 1) if I don't want to do any projection but simply filtering, what is the 
>> > recommended way to do it?
>> > 2) I am trying to handle the case like "SELECT * FROM table WHERE blah", 
>> > is it recommended to apply filtering without projection in this case or is 
>> > there any alternative approach doing it?
>> >
>> > Thanks.
>> >
>> > Regards,
>> > Yue
>> >

Reply via email to