Okay, I will post my question there later. Thanks so much for the answers. On Mon, Apr 6, 2020 at 4:36 AM Wes McKinney <[email protected]> wrote:
> You might pose your question on [email protected], more of the > people with closer Gandiva knowledge are present there > > On Sat, Apr 4, 2020 at 10:45 PM Yue Ni <[email protected]> wrote: > > > > Thanks so much. This is exactly what I was looking for, and I will give > it a try. > > > > BTW, as a follow up question, I wonder if there is anyone has any idea > that if I want to apply both selection and projection to a record batch, is > there any performance difference between these two ways: > > 1) use filter to get a selection vector and convert it to an array ==> > use arrow::compute::Take to filter the record batch ==> construct a > projector to do projection for the filtered record batch > > 2) use filter to get a selection vector ==> construct projector to do > projection and apply the selection vector at the same time > > > > The first approach allows me to process a query step by step (first > selection then projection) and the second approach is more concise but > seems not clearly separated compared to the first approach. I prefer the > first approach since it is easier to handle a filtering-only query, but > would like to confirm if it will degrade the performance if both selection > and projection are needed. > > > > On Sun, Apr 5, 2020 at 11:08 AM Wes McKinney <[email protected]> > wrote: > >> > >> Try the arrow::compute::Take function > >> > >> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/take.h#L121 > >> > >> On Sat, Apr 4, 2020 at 8:19 PM Yue Ni <[email protected]> wrote: > >> > > >> > Hi Wes, > >> > > >> > Thanks for the reply, but I don't think this is what I am looking for. > >> > > >> > It seems to me this `result.to_array()` will only return the array > for the selection vector ( > https://github.com/apache/arrow/blob/b07c2626cb3cdd3498b41da9feedf7c8319baa27/python/pyarrow/gandiva.pyx#L130), > but it is not clear to me how I can use the selection vector to filter the > original record batch. > >> > > >> > If I understand correctly, in this test case, > `result.to_array().equals(pa.array(range(1000), type=pa.uint32()))` is > asserting that the selection vector has integer index values from [0, > 1000), but I am looking for to obtain an array in the filtered record batch > which should be an array of floats here. I know I can iterate indices in > the selection vector and use it to retrieve each row in original record > batch columns, but I am not certain if this is the right way to do it. For > example, if I have multiple columns in the original record batch, do I need > to iterate the selection vector multiple times to filter each of the > column? Since this is a common task, I expect there is an easy/efficient > API to do this. > >> > > >> > Basically, I am looking for something like: > >> > selection_vector = filter.evaluate(record_batch, > pa.default_memory_pool()) > >> > filtered_column_arrays_in_record_batch = > record_batch.filter(selection_vector) # what is the API for doing this? > >> > > >> > In the C++ test cases, the closest thing I find is to construct a > gandiva projector to use the selection vector, but every test case there > requires client to construct a gandiva expression to build the projector > (for example, in this test case, a {sum_expr} is used for constructing the > projector, > https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_project_test.cc#L121 > ). > >> > > >> > I wonder if the filtering can be done without involving creating a > projection expression. At the same time, if projector is expected to be > used for doing this, what projector expressions should be used if I want to > keep all the columns as they are but just with some rows filtered based on > the criteria given? > >> > > >> > On Sun, Apr 5, 2020 at 7:27 AM Wes McKinney <[email protected]> > wrote: > >> >> > >> >> You can see an example of filtering via the Python bindings > >> >> > >> >> > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L89 > >> >> > >> >> This creates a gandiva::Filter using gandiva::Filter::Make, which can > >> >> be used to filter a RecordBatch > >> >> > >> >> Is this what you need? > >> >> > >> >> On Fri, Apr 3, 2020 at 7:12 PM Yue Ni <[email protected]> wrote: > >> >> > > >> >> > Hi there, > >> >> > > >> >> > I am using the gandiva C++ library for processing RecordBatch. I > would like to know how I can apply gandiva::Filter for a RecordBatch so > that I can do some filtering without using the projector. > >> >> > > >> >> > Since I don't find any documentation for it, I read some source > code about its usage, and here are the test cases I found about its usage: > >> >> > 1) > https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_test.cc > >> >> > 2) > https://github.com/apache/arrow/blob/967728fe4654e5d53bc0789e64e5a9ba7f27f263/cpp/src/gandiva/tests/filter_project_test.cc > >> >> > > >> >> > From my reading, I find it is possible to get a SelectionVector by > using the gandiva::Filter, at the same time, you can use the > SelectionVector with the gandiva::Projector to filter RecordBatch when > doing projection. My questions are: > >> >> > 1) if I don't want to do any projection but simply filtering, what > is the recommended way to do it? > >> >> > 2) I am trying to handle the case like "SELECT * FROM table WHERE > blah", is it recommended to apply filtering without projection in this case > or is there any alternative approach doing it? > >> >> > > >> >> > Thanks. > >> >> > > >> >> > Regards, > >> >> > Yue > >> >> > >
