You might pose your question on [email protected], more of the
people with closer Gandiva knowledge are present there

On Sat, Apr 4, 2020 at 10:45 PM Yue Ni <[email protected]> wrote:
>
> Thanks so much. This is exactly what I was looking for, and I will give it a 
> try.
>
> BTW, as a follow up question, I wonder if there is anyone has any idea that 
> if I want to apply both selection and projection to a record batch, is there 
> any performance difference between these two ways:
> 1) use filter to get a selection vector and convert it to an array ==> use 
> arrow::compute::Take to filter the record batch ==> construct a projector to 
> do projection for the filtered record batch
> 2) use filter to get a selection vector ==> construct projector to do 
> projection and apply the selection vector at the same time
>
> The first approach allows me to process a query step by step (first selection 
> then projection) and the second approach is more concise but seems not 
> clearly separated compared to the first approach. I prefer the first approach 
> since it is easier to handle a filtering-only query, but would like to 
> confirm if it will degrade the performance if both selection and projection 
> are needed.
>
> On Sun, Apr 5, 2020 at 11:08 AM Wes McKinney <[email protected]> wrote:
>>
>> Try the arrow::compute::Take function
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/take.h#L121
>>
>> On Sat, Apr 4, 2020 at 8:19 PM Yue Ni <[email protected]> wrote:
>> >
>> > Hi Wes,
>> >
>> > Thanks for the reply, but I don't think this is what I am looking for.
>> >
>> > It seems to me this `result.to_array()` will only return the array for the 
>> > selection vector 
>> > (https://github.com/apache/arrow/blob/b07c2626cb3cdd3498b41da9feedf7c8319baa27/python/pyarrow/gandiva.pyx#L130),
>> >  but it is not clear to me how I can use the selection vector to filter 
>> > the original record batch.
>> >
>> > If I understand correctly, in this test case, 
>> > `result.to_array().equals(pa.array(range(1000), type=pa.uint32()))` is 
>> > asserting that the selection vector has integer index values from [0, 
>> > 1000), but I am looking for to obtain an array in the filtered record 
>> > batch which should be an array of floats here. I know I can iterate 
>> > indices in the selection vector and use it to retrieve each row in 
>> > original record batch columns, but I am not certain if this is the right 
>> > way to do it. For example, if I have multiple columns in the original 
>> > record batch, do I need to iterate the selection vector multiple times to 
>> > filter each of the column? Since this is a common task, I expect there is 
>> > an easy/efficient API to do this.
>> >
>> > Basically, I am looking for something like:
>> > selection_vector = filter.evaluate(record_batch, pa.default_memory_pool())
>> > filtered_column_arrays_in_record_batch = 
>> > record_batch.filter(selection_vector) # what is the API for doing this?
>> >
>> > In the C++ test cases, the closest thing I find is to construct a gandiva 
>> > projector to use the selection vector, but every test case there requires 
>> > client to construct a gandiva expression to build the projector (for 
>> > example, in this test case, a {sum_expr} is used for constructing the 
>> > projector, 
>> > https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_project_test.cc#L121
>> >  ).
>> >
>> > I wonder if the filtering can be done without involving creating a 
>> > projection expression. At the same time, if projector is expected to be 
>> > used for doing this, what projector expressions should be used if I want 
>> > to keep all the columns as they are but just with some rows filtered based 
>> > on the criteria given?
>> >
>> > On Sun, Apr 5, 2020 at 7:27 AM Wes McKinney <[email protected]> wrote:
>> >>
>> >> You can see an example of filtering via the Python bindings
>> >>
>> >> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L89
>> >>
>> >> This creates a gandiva::Filter using gandiva::Filter::Make, which can
>> >> be used to filter a RecordBatch
>> >>
>> >> Is this what you need?
>> >>
>> >> On Fri, Apr 3, 2020 at 7:12 PM Yue Ni <[email protected]> wrote:
>> >> >
>> >> > Hi there,
>> >> >
>> >> > I am using the gandiva C++ library for processing RecordBatch. I would 
>> >> > like to know how I can apply gandiva::Filter for a RecordBatch so that 
>> >> > I can do some filtering without using the projector.
>> >> >
>> >> > Since I don't find any documentation for it, I read some source code 
>> >> > about its usage, and here are the test cases I found about its usage:
>> >> > 1) 
>> >> > https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_test.cc
>> >> > 2) 
>> >> > https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_project_test.cc
>> >> >
>> >> > From my reading, I find it is possible to get a SelectionVector by 
>> >> > using the gandiva::Filter, at the same time, you can use the 
>> >> > SelectionVector with the gandiva::Projector to filter RecordBatch when 
>> >> > doing projection. My questions are:
>> >> > 1) if I don't want to do any projection but simply filtering, what is 
>> >> > the recommended way to do it?
>> >> > 2) I am trying to handle the case like "SELECT * FROM table WHERE 
>> >> > blah", is it recommended to apply filtering without projection in this 
>> >> > case or is there any alternative approach doing it?
>> >> >
>> >> > Thanks.
>> >> >
>> >> > Regards,
>> >> > Yue
>> >> >

Reply via email to