Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-06-04 Thread via GitHub


AlenkaF commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2939240565

   I have seen the same Conbench errors on other PRs 
(https://github.com/apache/arrow/pull/46638#issuecomment-2933359946). It is 
getting a bit annoying, will try to take time to look into it =)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-06-03 Thread via GitHub


conbench-apache-arrow[bot] commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2938633491

   After merging your PR, Conbench analyzed the 4 benchmarking runs that have 
been run so far on merge-commit 94e3b3e98fe75a7498f2ee6edaeb006146d5cbe3.
   
   There were 73 benchmark results with an error:
   
   - Commit Run on `arm64-t4g-2xlarge-linux` at [2025-06-03 
14:18:42Z](https://conbench.ursa.dev/compare/runs/6a3243d72f334c188fa7cd7caeee7509...627e9c74d8f54e5ba09e256f2254f64f/)
 - [`tpch` (R) with engine=arrow, format=parquet, language=R, 
memory_map=False, query_id=TPCH-07, 
scale_factor=1](https://conbench.ursa.dev/benchmark-results/0683f177992071e980001257b908864f)
 - [`tpch` (R) with engine=arrow, format=native, language=R, 
memory_map=False, query_id=TPCH-09, 
scale_factor=1](https://conbench.ursa.dev/benchmark-results/0683f182f08b757d800025f20b26dbea)
   - and 71 more (see the report linked below)
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/43438036681) 
has more details. It also includes information about 6 possible false positives 
for unstable benchmarks that are known to sometimes produce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-06-03 Thread via GitHub


xingyu-long commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2936018308

   Thank you all!  @zanmato1984 @raulcd @AlenkaF 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-06-03 Thread via GitHub


raulcd commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2936102562

   Thank you for the contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-06-03 Thread via GitHub


AlenkaF merged PR #46566:
URL: https://github.com/apache/arrow/pull/46566


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-06-03 Thread via GitHub


raulcd commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2934441651

   CI failures for `example-python-minimal-*` are unrelated and due to:
   - https://github.com/apache/arrow/issues/44803


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-06-03 Thread via GitHub


github-actions[bot] commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2934268220

   :warning: GitHub issue #46572 **has been automatically assigned in GitHub** 
to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-06-03 Thread via GitHub


github-actions[bot] commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2934285042

   Revision: 889c98b9dd3e66876b6a4083e88d7b73566cb72d
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-e8f44d6eca](https://github.com/ursacomputing/crossbow/branches/all?query=actions-e8f44d6eca)
   
   |Task|Status|
   ||--|
   |example-python-minimal-build-fedora-conda|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-example-python-minimal-build-fedora-conda)](https://github.com/ursacomputing/crossbow/actions/runs/15413319990/job/43370190114)|
   |example-python-minimal-build-ubuntu-venv|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-example-python-minimal-build-ubuntu-venv)](https://github.com/ursacomputing/crossbow/actions/runs/15413320028/job/43370190601)|
   |test-conda-python-3.10|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-test-conda-python-3.10)](https://github.com/ursacomputing/crossbow/actions/runs/15413320516/job/43370192657)|
   |test-conda-python-3.10-hdfs-2.9.2|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-test-conda-python-3.10-hdfs-2.9.2)](https://github.com/ursacomputing/crossbow/actions/runs/15413320303/job/43370191500)|
   |test-conda-python-3.10-hdfs-3.2.1|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-test-conda-python-3.10-hdfs-3.2.1)](https://github.com/ursacomputing/crossbow/actions/runs/15413321032/job/43370195655)|
   |test-conda-python-3.10-pandas-latest-numpy-latest|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-test-conda-python-3.10-pandas-latest-numpy-latest)](https://github.com/ursacomputing/crossbow/actions/runs/15413319050/job/43370186378)|
   |test-conda-python-3.11|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-test-conda-python-3.11)](https://github.com/ursacomputing/crossbow/actions/runs/15413319826/job/43370189215)|
   |test-conda-python-3.11-dask-latest|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-test-conda-python-3.11-dask-latest)](https://github.com/ursacomputing/crossbow/actions/runs/15413320254/job/43370191450)|
   |test-conda-python-3.11-dask-upstream_devel|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-test-conda-python-3.11-dask-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/15413320047/job/43370190767)|
   |test-conda-python-3.11-hypothesis|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-test-conda-python-3.11-hypothesis)](https://github.com/ursacomputing/crossbow/actions/runs/15413319622/job/43370188522)|
   |test-conda-python-3.11-pandas-latest-numpy-1.26|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-test-conda-python-3.11-pandas-latest-numpy-1.26)](https://github.com/ursacomputing/crossbow/actions/runs/15413320381/job/43370192174)|
   |test-conda-python-3.11-pandas-latest-numpy-latest|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-test-conda-python-3.11-pandas-latest-numpy-latest)](https://github.com/ursacomputing/crossbow/actions/runs/15413321099/job/43370196206)|
   |test-conda-python-3.11-pandas-nightly-numpy-nightly|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-test-conda-python-3.11-pandas-nightly-numpy-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/15413320616/job/43370193160)|
   |test-conda-python-3.11-pandas-upstream_devel-numpy-nightly|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-test-conda-python-3.11-pandas-upstream_devel-numpy-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/15413319844/job/43370189264)|
   |test-conda-python-3.11-spark-master|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-e8f44d6eca-github-test-conda-python-3.11-spark-master)](https://github.com/ursacomputing/crossbow/actions/runs/15413319869/job/43370190032)|
   |test-cond

Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-06-03 Thread via GitHub


AlenkaF commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2934275431

   @github-actions crossbow submit -g python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-06-02 Thread via GitHub


xingyu-long commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2931544525

   > Do we want to be consistent and call the new argument `filter_expression` 
instead of `expression` as we do on `FilterOptions`? See docstring there:
   
   Just updated the code as suggested name `filter_expression`. Please take a 
look when you have time @raulcd @AlenkaF , thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-06-02 Thread via GitHub


AlenkaF commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2930373608

   > Do we want to be consistent and call the new argument `filter_expression` 
instead of `expression` as we do on `FilterOptions`?
   
   I think that would make sense, yes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-06-02 Thread via GitHub


raulcd commented on code in PR #46566:
URL: https://github.com/apache/arrow/pull/46566#discussion_r2120346861


##
python/pyarrow/_acero.pyx:
##
@@ -373,15 +379,17 @@ class HashJoinNodeOptions(_HashJoinNodeOptions):
 output_suffix_for_right : str
 Suffix added to names of output fields coming from right input,
 see `output_suffix_for_left` for details.
+filter: Expression

Review Comment:
   ```suggestion
   filter: pyarrow.compute.Expression
   ```
   
   



##
python/pyarrow/table.pxi:
##
@@ -5665,6 +5665,8 @@ cdef class Table(_Tabular):
 in the join result.
 use_threads : bool, default True
 Whether to use multithreading or not.
+filter: Expression

Review Comment:
   ```suggestion
   filter: pyarrow.compute.Expression
   ```



##
python/pyarrow/acero.py:
##
@@ -114,6 +114,8 @@ def _perform_join(join_type, left_operand, left_keys,
 in the join result.
 output_type: Table or InMemoryDataset
 The output type for the exec plan result.
+filter: Expression

Review Comment:
   ```suggestion
   filter: pyarrow.compute.Expression
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-30 Thread via GitHub


xingyu-long commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2924118414

   > I've approved the PR in terms of its functionality. I think we need 
another +1 from @AlenkaF @raulcd @pitrou in terms of python (or functionality 
of course) since I'm no python expert.
   
   Thanks!  @zanmato1984  Really appreciated it!
   
   I will wait for other approvals. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-30 Thread via GitHub


zanmato1984 commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2922945573

   I've approved the PR in terms of its functionality. I think we need another 
+1 from @AlenkaF @raulcd @pitrou in terms of python (or functionality of 
course) since I'm no python expert.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-30 Thread via GitHub


xingyu-long commented on code in PR #46566:
URL: https://github.com/apache/arrow/pull/46566#discussion_r2116101132


##
python/pyarrow/tests/test_acero.py:
##
@@ -300,6 +300,86 @@ def test_order_by():
 _ = OrderByNodeOptions([("b", "ascending")], null_placement="start")
 
 
+def test_hash_join_with_residual_filter():
+left = pa.table({'key': [1, 2, 3], 'a': [4, 5, 6]})
+left_source = Declaration("table_source", 
options=TableSourceNodeOptions(left))
+right = pa.table({'key': [2, 3, 4], 'b': [4, 5, 6]})
+right_source = Declaration("table_source", 
options=TableSourceNodeOptions(right))
+
+join_opts = HashJoinNodeOptions(
+"inner", left_keys="key", right_keys="key",
+filter=pc.equal(pc.field('a'), 5))
+joined = Declaration(
+"hashjoin", options=join_opts, inputs=[left_source, right_source])
+result = joined.to_table()
+expected = pa.table(
+[[2], [5], [2], [4]],
+names=["key", "a", "key", "b"])
+assert result.equals(expected)
+
+# test filter expression referencing columns from both side
+join_opts = HashJoinNodeOptions(
+"left outer", left_keys="key", right_keys="key",
+filter=pc.equal(pc.field("a"), 5) | pc.equal(pc.field("b"), 10)
+)
+joined = Declaration(
+"hashjoin", options=join_opts, inputs=[left_source, right_source])
+result = joined.to_table()
+expected = pa.table(
+[[2, 1, 3], [5, 4, 6], [2, None, None], [4, None, None]],
+names=["key", "a", "key", "b"])
+assert result.equals(expected)
+
+left = pa.table({'l1': [1], 'l_str': ["alpha"]})
+left_source = Declaration("table_source", 
options=TableSourceNodeOptions(left))
+right = pa.table({'r1': [1], 'r_str': ["alpha"]})
+right_source = Declaration("table_source", 
options=TableSourceNodeOptions(right))
+
+# test with always true

Review Comment:
   I used `pc.scalar(True/False)` and let me know if it makes sense to you. 
Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-29 Thread via GitHub


zanmato1984 commented on code in PR #46566:
URL: https://github.com/apache/arrow/pull/46566#discussion_r2115262361


##
python/pyarrow/tests/test_acero.py:
##
@@ -300,6 +300,86 @@ def test_order_by():
 _ = OrderByNodeOptions([("b", "ascending")], null_placement="start")
 
 
+def test_hash_join_with_residual_filter():
+left = pa.table({'key': [1, 2, 3], 'a': [4, 5, 6]})
+left_source = Declaration("table_source", 
options=TableSourceNodeOptions(left))
+right = pa.table({'key': [2, 3, 4], 'b': [4, 5, 6]})
+right_source = Declaration("table_source", 
options=TableSourceNodeOptions(right))
+
+join_opts = HashJoinNodeOptions(
+"inner", left_keys="key", right_keys="key",
+filter=pc.equal(pc.field('a'), 5))
+joined = Declaration(
+"hashjoin", options=join_opts, inputs=[left_source, right_source])
+result = joined.to_table()
+expected = pa.table(
+[[2], [5], [2], [4]],
+names=["key", "a", "key", "b"])
+assert result.equals(expected)
+
+# test filter expression referencing columns from both side
+join_opts = HashJoinNodeOptions(
+"left outer", left_keys="key", right_keys="key",
+filter=pc.equal(pc.field("a"), 5) | pc.equal(pc.field("b"), 10)
+)
+joined = Declaration(
+"hashjoin", options=join_opts, inputs=[left_source, right_source])
+result = joined.to_table()
+expected = pa.table(
+[[2, 1, 3], [5, 4, 6], [2, None, None], [4, None, None]],
+names=["key", "a", "key", "b"])
+assert result.equals(expected)
+
+left = pa.table({'l1': [1], 'l_str': ["alpha"]})
+left_source = Declaration("table_source", 
options=TableSourceNodeOptions(left))
+right = pa.table({'r1': [1], 'r_str': ["alpha"]})
+right_source = Declaration("table_source", 
options=TableSourceNodeOptions(right))
+
+# test with always true

Review Comment:
   This is not "always true", it's still depending on the actual data. I meant 
something constant like `_true` or `1 + 1 = 2`.
   
   And the "always false" too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-29 Thread via GitHub


xingyu-long commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2921391424

   > If you want to write a similar test case, let's just workaround the 
constraint and use unique column names.
   
   Thanks for confirming it! The tests I added for test `always true` and 
`always false` should cover this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-29 Thread via GitHub


zanmato1984 commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2921384026

   > I see, so if I understand this correctly, ideally, we probably should 
assign distinct key for both columns before using filter expression since 
output_suffix_for_left would only works for output at the end of the workflow, 
right? (sorry if this is a dumb question...) i.e., something like this won't 
work
   > 
   > ```python
   > join_opts = HashJoinNodeOptions(
   > "inner", left_keys="key", right_keys="key",
   > output_suffix_for_left="_left",output_suffix_for_right="_right",
   > filter=pc.equal(pc.field('key_left'), 2)) # < will 
hit key not found in both schemas.
   > joined = Declaration(
   > "hashjoin", options=join_opts, inputs=[left_source, right_source])
   > result = joined.to_table()
   > ```
   
   Sorry I made a mistake. You are right about this. Thanks for clarifying.
   
   If you want to write a similar test case, let's just workaround the 
constraint and use unique column names.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-29 Thread via GitHub


xingyu-long commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2921331804

   > This is an independent problem. Because join is concatenating columns from 
both sides, so it is possible that the result table contains columns with the 
same name. If so, you won't be able to further reference a such column without 
ambiguity. You can specify output_suffix_for_left/right to append unique 
identifiers to their column names, so that you can disambiguate them.
   
   I see, so if I understand this correctly, ideally, we probably should assign 
distinct key for both columns before using filter expression since 
output_suffix_for_left would only works for output at the end of the workflow, 
right?  (sorry if this is a dumb question...) i.e.,  something like this won't 
work
   
   ```python3
   join_opts = HashJoinNodeOptions(
   "inner", left_keys="key", right_keys="key",
   output_suffix_for_left="_left",output_suffix_for_right="_right",
   filter=pc.equal(pc.field('key_left'), 2)) # < will 
hit key not found in both schemas.
   joined = Declaration(
   "hashjoin", options=join_opts, inputs=[left_source, right_source])
   result = joined.to_table()
   ``` 
   
   if we don't use filter at all, we are ok with same column, and we can use 
output_suffix_for_left to help for the output only. @zanmato1984 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-29 Thread via GitHub


xingyu-long commented on code in PR #46566:
URL: https://github.com/apache/arrow/pull/46566#discussion_r2115200396


##
python/pyarrow/tests/test_acero.py:
##
@@ -300,6 +300,37 @@ def test_order_by():
 _ = OrderByNodeOptions([("b", "ascending")], null_placement="start")
 
 
+def test_hash_join_with_filter():

Review Comment:
   I just added these tests, please take another look. Thanks @zanmato1984 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-29 Thread via GitHub


xingyu-long commented on code in PR #46566:
URL: https://github.com/apache/arrow/pull/46566#discussion_r211516


##
python/pyarrow/_acero.pyx:
##
@@ -273,14 +273,15 @@ cdef class _HashJoinNodeOptions(ExecNodeOptions):
 
 def _set_options(
 self, join_type, left_keys, right_keys, left_output=None, 
right_output=None,
-output_suffix_for_left="", output_suffix_for_right="",
+output_suffix_for_left="", output_suffix_for_right="", Expression 
filter_expression=None,

Review Comment:
   sounds good! just updated them



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-29 Thread via GitHub


xingyu-long commented on code in PR #46566:
URL: https://github.com/apache/arrow/pull/46566#discussion_r2115199593


##
python/pyarrow/acero.py:
##
@@ -114,6 +114,8 @@ def _perform_join(join_type, left_operand, left_keys,
 in the join result.
 output_type: Table or InMemoryDataset
 The output type for the exec plan result.
+filter_expression: Expression
+Expression that will be used during join operation.

Review Comment:
   Done



##
python/pyarrow/tests/test_acero.py:
##
@@ -300,6 +300,37 @@ def test_order_by():
 _ = OrderByNodeOptions([("b", "ascending")], null_placement="start")
 
 
+def test_hash_join_with_filter():

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-29 Thread via GitHub


zanmato1984 commented on code in PR #46566:
URL: https://github.com/apache/arrow/pull/46566#discussion_r2115110646


##
python/pyarrow/_acero.pyx:
##
@@ -273,14 +273,15 @@ cdef class _HashJoinNodeOptions(ExecNodeOptions):
 
 def _set_options(
 self, join_type, left_keys, right_keys, left_output=None, 
right_output=None,
-output_suffix_for_left="", output_suffix_for_right="",
+output_suffix_for_left="", output_suffix_for_right="", Expression 
filter_expression=None,

Review Comment:
   I think we can simply call it (and other places passing through this 
parameter) `filter` (which also matches C++'s naming).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-29 Thread via GitHub


zanmato1984 commented on code in PR #46566:
URL: https://github.com/apache/arrow/pull/46566#discussion_r2115109420


##
python/pyarrow/acero.py:
##
@@ -114,6 +114,8 @@ def _perform_join(join_type, left_operand, left_keys,
 in the join result.
 output_type: Table or InMemoryDataset
 The output type for the exec plan result.
+filter_expression: Expression
+Expression that will be used during join operation.

Review Comment:
   ```suggestion
   Residual filter which is applied to matching row.
   ```



##
python/pyarrow/tests/test_acero.py:
##
@@ -300,6 +300,37 @@ def test_order_by():
 _ = OrderByNodeOptions([("b", "ascending")], null_placement="start")
 
 
+def test_hash_join_with_filter():

Review Comment:
   ```suggestion
   def test_hash_join_with_residual_filter():
   ```



##
python/pyarrow/tests/test_acero.py:
##
@@ -300,6 +300,37 @@ def test_order_by():
 _ = OrderByNodeOptions([("b", "ascending")], null_placement="start")
 
 
+def test_hash_join_with_filter():

Review Comment:
   We can add cases with special filters like `true`, `false`, expression 
referencing columns from both sides.



##
python/pyarrow/table.pxi:
##
@@ -5665,6 +5665,8 @@ cdef class Table(_Tabular):
 in the join result.
 use_threads : bool, default True
 Whether to use multithreading or not.
+filter_expression: Expression
+Expression that will be used during join operation.

Review Comment:
   ```suggestion
   Residual filter which is applied to matching row.
   ```



##
python/pyarrow/_acero.pyx:
##
@@ -273,14 +273,15 @@ cdef class _HashJoinNodeOptions(ExecNodeOptions):
 
 def _set_options(
 self, join_type, left_keys, right_keys, left_output=None, 
right_output=None,
-output_suffix_for_left="", output_suffix_for_right="",
+output_suffix_for_left="", output_suffix_for_right="", Expression 
filter_expression=None,

Review Comment:
   I think we can simply call it (and other places passing through this 
parameter) `filter`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-29 Thread via GitHub


xingyu-long commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2920972318

   > * inner join will kee
   
   Thanks @zanmato1984 for your explanation, it makes sense. probably I should 
mention more details in function docstring for this usage then. at same time, 
feel free to review the changes since it just exposes what c++ does for python.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-29 Thread via GitHub


zanmato1984 commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2920698129

   If my above comment addresses your concern, I'll in turn review the code. 
Thank you @xingyu-long .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-29 Thread via GitHub


zanmato1984 commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2920686680

   Thank you @xingyu-long for contributing this!
   
   I'd first address your concern of:
   > it seems we didn't return empty, instead, we return the `right outer`? it 
seems the join type takes higher priority than filter operation for the final 
result?
   > 
   > btw, it seems fine with inner join type.
   
   Yes, this is expected by SQL semantic. And this is also the difference 
between you put an expression within `ON` condition of `JOIN` and that within 
`WHERE` clause, e.g.,
   ```FROM t1 LEFT JOIN t2 ON t1.value = x and t2.value = y```
   does not equal to
   ```FROM t1 LEFT JOIN t2 ON true WHERE t1.value = x and t2.value = y```
   (They are equivalent ONLY for inner joins.)
   This is quite understandable because otherwise you wouldn't need most of 
join types except inner :)
   
   Conceptually, all subexpressions in `ON` condition are equally contributing 
to determine if two rows from each side are a "match" (the whole expression 
evaluates `true`) or a "non-match" (the whole expression evaluates `null` or 
`false`). It's just that in practice, most query engines do hash join that 
requires at least one equal condition with columns from both sides, and for 
such conditions the columns are used as join "key"s (in your case the join key 
is implicitly specified by columns with common name). The rest of the 
expression is normally treated as so-called "residual filter" (this is what 
your PR added). Now back to the "conceptually", depending on the join type 
(inner/left outer/right outer/etc), rows are then processed differently. Take 
inner and left outer as two examples:
   * inner join will keep all the columns from both sides for a match, and 
discard the entire row for a non-match - this is the same as if you do the 
filter on the table scan first than apply join.
   * left outer join will always keep the left side columns, and keep the right 
side columns as well for a match, or discard the right side columns (by filling 
`null`s) for a non-match (but this row is still emitted in the join result).
   
   > for example, let's assume that we have two tables which have some common 
fields (`id` and `name`), and we'd like to join them by id and then filter name 
with certain pattern. so without exposing this API to python, we probably need 
to maintain a big intermediate state of temp join and then apply the filter on 
top of it.
   
   Yes this is necessary for preserving the SQL-like join semantic - as long as 
you write the filter in the `ON` condition. Again, the filter support you are 
adding is the "residual filter" (the subexpressions other than join keys in 
`ON` condition), not a regular "filter".
   
   > but if we can apply the filter on both tables first before we joining two 
tables, it would be more efficient? that's why I'd like to confirm what's the 
expected behavior for this filter in c++ implementation.
   
   In this case you can just do the filter ahead of join, e.g.,
   ```
   t2_filtered = t2.filter(pc.equal(pc.field("n_legs"), 200)
   t1.join(t2_filtered, 'id', join_type="right outer")
   ```
   As long as it is what you needed.
   
   > 1. it seems filter cannot apply for both side, i.e same field for both 
table/schema
   
   This is an independent problem. Because join is concatenating columns from 
both sides, so it is possible that the result table contains columns with the 
same name. If so, you won't be able to further reference a such column without 
ambiguity. You can specify `output_suffix_for_left/right` to append unique 
identifiers to their column names, so that you can disambiguate them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-29 Thread via GitHub


raulcd commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2918696771

   I am no expert on this area.
   I agree with you that the the test you shared seems to return an unexpected 
behaviour. I would expect the filter to be correctly applied.
   Having said that I don't think the issue is coming from the code you have 
linked on `acero/hash_join_node.cc::CollectFilterColumns`, from my 
understanding this is the expected behavior. This isn't checking whether there 
are repeated fields on both schemas, is checking whether the filter field is in 
both schemas in order to avoid ambiguous filter expressions. cc @zanmato1984 
@pitrou which have more knowledge around this and can help understand it better 
and can validate whether the test is related to a possible bug on right outer 
join.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-27 Thread via GitHub


xingyu-long commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2915131235

   Thanks for the suggestion @AlenkaF @raulcd  I just updated the code.
   
   btw, I observed two things while I am writing tests for this matter
   
   1. it seems filter cannot apply for both side, i.e same field for both 
table/schema, this was implemented in c++ side
   
https://github.com/apache/arrow/blob/0c1896854c060afb1754f2817f5a8821efd4ebcb/cpp/src/arrow/acero/hash_join_node.cc#L449-L451
   
   is this intended behavior?  
   
   
   for example, let's assume that we have two tables which have same schema 
(`id` and `name`), and we'd like to join them by id and then filter name with 
certain pattern. so without exposing this API to python, we probably need to 
maintain a big intermediate state of temp join and then apply the filter on top 
of it. 
   
   
   but if we can apply the filter on both tables first before we joining two 
tables, it would be more efficient? that's why I'd like to confirm what's the 
expected behavior for this filter in c++ implementation. 
   
   
   
   2. I tried to exercise join with filter, I saw following surprise. (assuming 
we use filter only on one side)
   
   ```python3
   In [54]: import pandas as pd
   ...: import pyarrow as pa
   ...: df1 = pd.DataFrame({'id': [1, 2, 3],
   ...: 'year': [2020, 2022, 2019]})
   ...: df2 = pd.DataFrame({'id': [3, 4],
   ...: 'n_legs': [5, 100],
   ...: 'animal': ["Brittle stars", "Centipede"]})
   ...: t1 = pa.Table.from_pandas(df1)
   ...: t2 = pa.Table.from_pandas(df2)
   
   In [55]: t1.join(t2, 'id', join_type="right outer").combine_chunks()
   Out[55]:
   pyarrow.Table
   year: int64
   id: int64
   n_legs: int64
   animal: string
   
   year: [[2019,null]]
   id: [[3,4]]
   n_legs: [[5,100]]
   animal: [["Brittle stars","Centipede"]]
   
   # and then we apply filter expression with intended mismatch here
   In [56]: t1.join(t2, 'id', join_type="right outer", 
filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks()
   Out[56]:
   pyarrow.Table
   year: int64
   id: int64
   n_legs: int64
   animal: string
   
   year: [[null,null]]
   id: [[3,4]]
   n_legs: [[5,100]]
   animal: [["Brittle stars","Centipede"]] 
   ```
   
   it seems we didn't return empty, instead, we return the `right outer`?  it 
seems the join type takes higher priority than filter   operation for the final 
result? 
   
   
   btw, it seems fine with inner join type.
   
   ```python3
   In [57]: t1.join(t2, 'id', join_type="inner", 
filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks()
   Out[57]:
   pyarrow.Table
   id: int64
   year: int64
   n_legs: int64
   animal: string
   
   id: []
   year: []
   n_legs: []
   animal: []
   ```
   
   this one seems like a bug to me, but I am not sure, @AlenkaF @raulcd  could 
you provide some feedback on these two questions? Thanks!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

2025-05-26 Thread via GitHub


AlenkaF commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2908783533

   Thanks for opening this issue!
   
   I've marked the PR as a draft and updated the title.
   
   Regarding the call in `Table.join`: I would suggest placing the new keyword 
argument at the end of the list — this helps preserve consistency and avoids 
breaking any assumptions about argument order.
   
   Also, please make sure to connect it with `acero.py` and `_acero.pyx`.
   
   CC: @raulcd for any additional thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]