Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
AlenkaF commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2939240565 I have seen the same Conbench errors on other PRs (https://github.com/apache/arrow/pull/46638#issuecomment-2933359946). It is getting a bit annoying, will try to take time to look into it =) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
conbench-apache-arrow[bot] commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2938633491 After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 94e3b3e98fe75a7498f2ee6edaeb006146d5cbe3. There were 73 benchmark results with an error: - Commit Run on `arm64-t4g-2xlarge-linux` at [2025-06-03 14:18:42Z](https://conbench.ursa.dev/compare/runs/6a3243d72f334c188fa7cd7caeee7509...627e9c74d8f54e5ba09e256f2254f64f/) - [`tpch` (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-07, scale_factor=1](https://conbench.ursa.dev/benchmark-results/0683f177992071e980001257b908864f) - [`tpch` (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-09, scale_factor=1](https://conbench.ursa.dev/benchmark-results/0683f182f08b757d800025f20b26dbea) - and 71 more (see the report linked below) There were no benchmark performance regressions. 🎉 The [full Conbench report](https://github.com/apache/arrow/runs/43438036681) has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
xingyu-long commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2936018308 Thank you all! @zanmato1984 @raulcd @AlenkaF -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
raulcd commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2936102562 Thank you for the contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
AlenkaF merged PR #46566: URL: https://github.com/apache/arrow/pull/46566 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
raulcd commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2934441651 CI failures for `example-python-minimal-*` are unrelated and due to: - https://github.com/apache/arrow/issues/44803 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
github-actions[bot] commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2934268220 :warning: GitHub issue #46572 **has been automatically assigned in GitHub** to PR creator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
github-actions[bot] commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2934285042 Revision: 889c98b9dd3e66876b6a4083e88d7b73566cb72d Submitted crossbow builds: [ursacomputing/crossbow @ actions-e8f44d6eca](https://github.com/ursacomputing/crossbow/branches/all?query=actions-e8f44d6eca) |Task|Status| ||--| |example-python-minimal-build-fedora-conda|[](https://github.com/ursacomputing/crossbow/actions/runs/15413319990/job/43370190114)| |example-python-minimal-build-ubuntu-venv|[](https://github.com/ursacomputing/crossbow/actions/runs/15413320028/job/43370190601)| |test-conda-python-3.10|[](https://github.com/ursacomputing/crossbow/actions/runs/15413320516/job/43370192657)| |test-conda-python-3.10-hdfs-2.9.2|[](https://github.com/ursacomputing/crossbow/actions/runs/15413320303/job/43370191500)| |test-conda-python-3.10-hdfs-3.2.1|[](https://github.com/ursacomputing/crossbow/actions/runs/15413321032/job/43370195655)| |test-conda-python-3.10-pandas-latest-numpy-latest|[](https://github.com/ursacomputing/crossbow/actions/runs/15413319050/job/43370186378)| |test-conda-python-3.11|[](https://github.com/ursacomputing/crossbow/actions/runs/15413319826/job/43370189215)| |test-conda-python-3.11-dask-latest|[](https://github.com/ursacomputing/crossbow/actions/runs/15413320254/job/43370191450)| |test-conda-python-3.11-dask-upstream_devel|[](https://github.com/ursacomputing/crossbow/actions/runs/15413320047/job/43370190767)| |test-conda-python-3.11-hypothesis|[](https://github.com/ursacomputing/crossbow/actions/runs/15413319622/job/43370188522)| |test-conda-python-3.11-pandas-latest-numpy-1.26|[](https://github.com/ursacomputing/crossbow/actions/runs/15413320381/job/43370192174)| |test-conda-python-3.11-pandas-latest-numpy-latest|[](https://github.com/ursacomputing/crossbow/actions/runs/15413321099/job/43370196206)| |test-conda-python-3.11-pandas-nightly-numpy-nightly|[](https://github.com/ursacomputing/crossbow/actions/runs/15413320616/job/43370193160)| |test-conda-python-3.11-pandas-upstream_devel-numpy-nightly|[](https://github.com/ursacomputing/crossbow/actions/runs/15413319844/job/43370189264)| |test-conda-python-3.11-spark-master|[](https://github.com/ursacomputing/crossbow/actions/runs/15413319869/job/43370190032)| |test-cond
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
AlenkaF commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2934275431 @github-actions crossbow submit -g python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
xingyu-long commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2931544525 > Do we want to be consistent and call the new argument `filter_expression` instead of `expression` as we do on `FilterOptions`? See docstring there: Just updated the code as suggested name `filter_expression`. Please take a look when you have time @raulcd @AlenkaF , thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
AlenkaF commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2930373608 > Do we want to be consistent and call the new argument `filter_expression` instead of `expression` as we do on `FilterOptions`? I think that would make sense, yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
raulcd commented on code in PR #46566: URL: https://github.com/apache/arrow/pull/46566#discussion_r2120346861 ## python/pyarrow/_acero.pyx: ## @@ -373,15 +379,17 @@ class HashJoinNodeOptions(_HashJoinNodeOptions): output_suffix_for_right : str Suffix added to names of output fields coming from right input, see `output_suffix_for_left` for details. +filter: Expression Review Comment: ```suggestion filter: pyarrow.compute.Expression ``` ## python/pyarrow/table.pxi: ## @@ -5665,6 +5665,8 @@ cdef class Table(_Tabular): in the join result. use_threads : bool, default True Whether to use multithreading or not. +filter: Expression Review Comment: ```suggestion filter: pyarrow.compute.Expression ``` ## python/pyarrow/acero.py: ## @@ -114,6 +114,8 @@ def _perform_join(join_type, left_operand, left_keys, in the join result. output_type: Table or InMemoryDataset The output type for the exec plan result. +filter: Expression Review Comment: ```suggestion filter: pyarrow.compute.Expression ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
xingyu-long commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2924118414 > I've approved the PR in terms of its functionality. I think we need another +1 from @AlenkaF @raulcd @pitrou in terms of python (or functionality of course) since I'm no python expert. Thanks! @zanmato1984 Really appreciated it! I will wait for other approvals. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
zanmato1984 commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2922945573 I've approved the PR in terms of its functionality. I think we need another +1 from @AlenkaF @raulcd @pitrou in terms of python (or functionality of course) since I'm no python expert. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
xingyu-long commented on code in PR #46566:
URL: https://github.com/apache/arrow/pull/46566#discussion_r2116101132
##
python/pyarrow/tests/test_acero.py:
##
@@ -300,6 +300,86 @@ def test_order_by():
_ = OrderByNodeOptions([("b", "ascending")], null_placement="start")
+def test_hash_join_with_residual_filter():
+left = pa.table({'key': [1, 2, 3], 'a': [4, 5, 6]})
+left_source = Declaration("table_source",
options=TableSourceNodeOptions(left))
+right = pa.table({'key': [2, 3, 4], 'b': [4, 5, 6]})
+right_source = Declaration("table_source",
options=TableSourceNodeOptions(right))
+
+join_opts = HashJoinNodeOptions(
+"inner", left_keys="key", right_keys="key",
+filter=pc.equal(pc.field('a'), 5))
+joined = Declaration(
+"hashjoin", options=join_opts, inputs=[left_source, right_source])
+result = joined.to_table()
+expected = pa.table(
+[[2], [5], [2], [4]],
+names=["key", "a", "key", "b"])
+assert result.equals(expected)
+
+# test filter expression referencing columns from both side
+join_opts = HashJoinNodeOptions(
+"left outer", left_keys="key", right_keys="key",
+filter=pc.equal(pc.field("a"), 5) | pc.equal(pc.field("b"), 10)
+)
+joined = Declaration(
+"hashjoin", options=join_opts, inputs=[left_source, right_source])
+result = joined.to_table()
+expected = pa.table(
+[[2, 1, 3], [5, 4, 6], [2, None, None], [4, None, None]],
+names=["key", "a", "key", "b"])
+assert result.equals(expected)
+
+left = pa.table({'l1': [1], 'l_str': ["alpha"]})
+left_source = Declaration("table_source",
options=TableSourceNodeOptions(left))
+right = pa.table({'r1': [1], 'r_str': ["alpha"]})
+right_source = Declaration("table_source",
options=TableSourceNodeOptions(right))
+
+# test with always true
Review Comment:
I used `pc.scalar(True/False)` and let me know if it makes sense to you.
Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
zanmato1984 commented on code in PR #46566:
URL: https://github.com/apache/arrow/pull/46566#discussion_r2115262361
##
python/pyarrow/tests/test_acero.py:
##
@@ -300,6 +300,86 @@ def test_order_by():
_ = OrderByNodeOptions([("b", "ascending")], null_placement="start")
+def test_hash_join_with_residual_filter():
+left = pa.table({'key': [1, 2, 3], 'a': [4, 5, 6]})
+left_source = Declaration("table_source",
options=TableSourceNodeOptions(left))
+right = pa.table({'key': [2, 3, 4], 'b': [4, 5, 6]})
+right_source = Declaration("table_source",
options=TableSourceNodeOptions(right))
+
+join_opts = HashJoinNodeOptions(
+"inner", left_keys="key", right_keys="key",
+filter=pc.equal(pc.field('a'), 5))
+joined = Declaration(
+"hashjoin", options=join_opts, inputs=[left_source, right_source])
+result = joined.to_table()
+expected = pa.table(
+[[2], [5], [2], [4]],
+names=["key", "a", "key", "b"])
+assert result.equals(expected)
+
+# test filter expression referencing columns from both side
+join_opts = HashJoinNodeOptions(
+"left outer", left_keys="key", right_keys="key",
+filter=pc.equal(pc.field("a"), 5) | pc.equal(pc.field("b"), 10)
+)
+joined = Declaration(
+"hashjoin", options=join_opts, inputs=[left_source, right_source])
+result = joined.to_table()
+expected = pa.table(
+[[2, 1, 3], [5, 4, 6], [2, None, None], [4, None, None]],
+names=["key", "a", "key", "b"])
+assert result.equals(expected)
+
+left = pa.table({'l1': [1], 'l_str': ["alpha"]})
+left_source = Declaration("table_source",
options=TableSourceNodeOptions(left))
+right = pa.table({'r1': [1], 'r_str': ["alpha"]})
+right_source = Declaration("table_source",
options=TableSourceNodeOptions(right))
+
+# test with always true
Review Comment:
This is not "always true", it's still depending on the actual data. I meant
something constant like `_true` or `1 + 1 = 2`.
And the "always false" too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
xingyu-long commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2921391424 > If you want to write a similar test case, let's just workaround the constraint and use unique column names. Thanks for confirming it! The tests I added for test `always true` and `always false` should cover this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
zanmato1984 commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2921384026
> I see, so if I understand this correctly, ideally, we probably should
assign distinct key for both columns before using filter expression since
output_suffix_for_left would only works for output at the end of the workflow,
right? (sorry if this is a dumb question...) i.e., something like this won't
work
>
> ```python
> join_opts = HashJoinNodeOptions(
> "inner", left_keys="key", right_keys="key",
> output_suffix_for_left="_left",output_suffix_for_right="_right",
> filter=pc.equal(pc.field('key_left'), 2)) # < will
hit key not found in both schemas.
> joined = Declaration(
> "hashjoin", options=join_opts, inputs=[left_source, right_source])
> result = joined.to_table()
> ```
Sorry I made a mistake. You are right about this. Thanks for clarifying.
If you want to write a similar test case, let's just workaround the
constraint and use unique column names.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
xingyu-long commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2921331804
> This is an independent problem. Because join is concatenating columns from
both sides, so it is possible that the result table contains columns with the
same name. If so, you won't be able to further reference a such column without
ambiguity. You can specify output_suffix_for_left/right to append unique
identifiers to their column names, so that you can disambiguate them.
I see, so if I understand this correctly, ideally, we probably should assign
distinct key for both columns before using filter expression since
output_suffix_for_left would only works for output at the end of the workflow,
right? (sorry if this is a dumb question...) i.e., something like this won't
work
```python3
join_opts = HashJoinNodeOptions(
"inner", left_keys="key", right_keys="key",
output_suffix_for_left="_left",output_suffix_for_right="_right",
filter=pc.equal(pc.field('key_left'), 2)) # < will
hit key not found in both schemas.
joined = Declaration(
"hashjoin", options=join_opts, inputs=[left_source, right_source])
result = joined.to_table()
```
if we don't use filter at all, we are ok with same column, and we can use
output_suffix_for_left to help for the output only. @zanmato1984
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
xingyu-long commented on code in PR #46566:
URL: https://github.com/apache/arrow/pull/46566#discussion_r2115200396
##
python/pyarrow/tests/test_acero.py:
##
@@ -300,6 +300,37 @@ def test_order_by():
_ = OrderByNodeOptions([("b", "ascending")], null_placement="start")
+def test_hash_join_with_filter():
Review Comment:
I just added these tests, please take another look. Thanks @zanmato1984
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
xingyu-long commented on code in PR #46566: URL: https://github.com/apache/arrow/pull/46566#discussion_r211516 ## python/pyarrow/_acero.pyx: ## @@ -273,14 +273,15 @@ cdef class _HashJoinNodeOptions(ExecNodeOptions): def _set_options( self, join_type, left_keys, right_keys, left_output=None, right_output=None, -output_suffix_for_left="", output_suffix_for_right="", +output_suffix_for_left="", output_suffix_for_right="", Expression filter_expression=None, Review Comment: sounds good! just updated them -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
xingyu-long commented on code in PR #46566:
URL: https://github.com/apache/arrow/pull/46566#discussion_r2115199593
##
python/pyarrow/acero.py:
##
@@ -114,6 +114,8 @@ def _perform_join(join_type, left_operand, left_keys,
in the join result.
output_type: Table or InMemoryDataset
The output type for the exec plan result.
+filter_expression: Expression
+Expression that will be used during join operation.
Review Comment:
Done
##
python/pyarrow/tests/test_acero.py:
##
@@ -300,6 +300,37 @@ def test_order_by():
_ = OrderByNodeOptions([("b", "ascending")], null_placement="start")
+def test_hash_join_with_filter():
Review Comment:
done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
zanmato1984 commented on code in PR #46566: URL: https://github.com/apache/arrow/pull/46566#discussion_r2115110646 ## python/pyarrow/_acero.pyx: ## @@ -273,14 +273,15 @@ cdef class _HashJoinNodeOptions(ExecNodeOptions): def _set_options( self, join_type, left_keys, right_keys, left_output=None, right_output=None, -output_suffix_for_left="", output_suffix_for_right="", +output_suffix_for_left="", output_suffix_for_right="", Expression filter_expression=None, Review Comment: I think we can simply call it (and other places passing through this parameter) `filter` (which also matches C++'s naming). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
zanmato1984 commented on code in PR #46566:
URL: https://github.com/apache/arrow/pull/46566#discussion_r2115109420
##
python/pyarrow/acero.py:
##
@@ -114,6 +114,8 @@ def _perform_join(join_type, left_operand, left_keys,
in the join result.
output_type: Table or InMemoryDataset
The output type for the exec plan result.
+filter_expression: Expression
+Expression that will be used during join operation.
Review Comment:
```suggestion
Residual filter which is applied to matching row.
```
##
python/pyarrow/tests/test_acero.py:
##
@@ -300,6 +300,37 @@ def test_order_by():
_ = OrderByNodeOptions([("b", "ascending")], null_placement="start")
+def test_hash_join_with_filter():
Review Comment:
```suggestion
def test_hash_join_with_residual_filter():
```
##
python/pyarrow/tests/test_acero.py:
##
@@ -300,6 +300,37 @@ def test_order_by():
_ = OrderByNodeOptions([("b", "ascending")], null_placement="start")
+def test_hash_join_with_filter():
Review Comment:
We can add cases with special filters like `true`, `false`, expression
referencing columns from both sides.
##
python/pyarrow/table.pxi:
##
@@ -5665,6 +5665,8 @@ cdef class Table(_Tabular):
in the join result.
use_threads : bool, default True
Whether to use multithreading or not.
+filter_expression: Expression
+Expression that will be used during join operation.
Review Comment:
```suggestion
Residual filter which is applied to matching row.
```
##
python/pyarrow/_acero.pyx:
##
@@ -273,14 +273,15 @@ cdef class _HashJoinNodeOptions(ExecNodeOptions):
def _set_options(
self, join_type, left_keys, right_keys, left_output=None,
right_output=None,
-output_suffix_for_left="", output_suffix_for_right="",
+output_suffix_for_left="", output_suffix_for_right="", Expression
filter_expression=None,
Review Comment:
I think we can simply call it (and other places passing through this
parameter) `filter`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
xingyu-long commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2920972318 > * inner join will kee Thanks @zanmato1984 for your explanation, it makes sense. probably I should mention more details in function docstring for this usage then. at same time, feel free to review the changes since it just exposes what c++ does for python. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
zanmato1984 commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2920698129 If my above comment addresses your concern, I'll in turn review the code. Thank you @xingyu-long . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
zanmato1984 commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2920686680
Thank you @xingyu-long for contributing this!
I'd first address your concern of:
> it seems we didn't return empty, instead, we return the `right outer`? it
seems the join type takes higher priority than filter operation for the final
result?
>
> btw, it seems fine with inner join type.
Yes, this is expected by SQL semantic. And this is also the difference
between you put an expression within `ON` condition of `JOIN` and that within
`WHERE` clause, e.g.,
```FROM t1 LEFT JOIN t2 ON t1.value = x and t2.value = y```
does not equal to
```FROM t1 LEFT JOIN t2 ON true WHERE t1.value = x and t2.value = y```
(They are equivalent ONLY for inner joins.)
This is quite understandable because otherwise you wouldn't need most of
join types except inner :)
Conceptually, all subexpressions in `ON` condition are equally contributing
to determine if two rows from each side are a "match" (the whole expression
evaluates `true`) or a "non-match" (the whole expression evaluates `null` or
`false`). It's just that in practice, most query engines do hash join that
requires at least one equal condition with columns from both sides, and for
such conditions the columns are used as join "key"s (in your case the join key
is implicitly specified by columns with common name). The rest of the
expression is normally treated as so-called "residual filter" (this is what
your PR added). Now back to the "conceptually", depending on the join type
(inner/left outer/right outer/etc), rows are then processed differently. Take
inner and left outer as two examples:
* inner join will keep all the columns from both sides for a match, and
discard the entire row for a non-match - this is the same as if you do the
filter on the table scan first than apply join.
* left outer join will always keep the left side columns, and keep the right
side columns as well for a match, or discard the right side columns (by filling
`null`s) for a non-match (but this row is still emitted in the join result).
> for example, let's assume that we have two tables which have some common
fields (`id` and `name`), and we'd like to join them by id and then filter name
with certain pattern. so without exposing this API to python, we probably need
to maintain a big intermediate state of temp join and then apply the filter on
top of it.
Yes this is necessary for preserving the SQL-like join semantic - as long as
you write the filter in the `ON` condition. Again, the filter support you are
adding is the "residual filter" (the subexpressions other than join keys in
`ON` condition), not a regular "filter".
> but if we can apply the filter on both tables first before we joining two
tables, it would be more efficient? that's why I'd like to confirm what's the
expected behavior for this filter in c++ implementation.
In this case you can just do the filter ahead of join, e.g.,
```
t2_filtered = t2.filter(pc.equal(pc.field("n_legs"), 200)
t1.join(t2_filtered, 'id', join_type="right outer")
```
As long as it is what you needed.
> 1. it seems filter cannot apply for both side, i.e same field for both
table/schema
This is an independent problem. Because join is concatenating columns from
both sides, so it is possible that the result table contains columns with the
same name. If so, you won't be able to further reference a such column without
ambiguity. You can specify `output_suffix_for_left/right` to append unique
identifiers to their column names, so that you can disambiguate them.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
raulcd commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2918696771 I am no expert on this area. I agree with you that the the test you shared seems to return an unexpected behaviour. I would expect the filter to be correctly applied. Having said that I don't think the issue is coming from the code you have linked on `acero/hash_join_node.cc::CollectFilterColumns`, from my understanding this is the expected behavior. This isn't checking whether there are repeated fields on both schemas, is checking whether the filter field is in both schemas in order to avoid ambiguous filter expressions. cc @zanmato1984 @pitrou which have more knowledge around this and can help understand it better and can validate whether the test is related to a possible bug on right outer join. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
xingyu-long commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2915131235
Thanks for the suggestion @AlenkaF @raulcd I just updated the code.
btw, I observed two things while I am writing tests for this matter
1. it seems filter cannot apply for both side, i.e same field for both
table/schema, this was implemented in c++ side
https://github.com/apache/arrow/blob/0c1896854c060afb1754f2817f5a8821efd4ebcb/cpp/src/arrow/acero/hash_join_node.cc#L449-L451
is this intended behavior?
for example, let's assume that we have two tables which have same schema
(`id` and `name`), and we'd like to join them by id and then filter name with
certain pattern. so without exposing this API to python, we probably need to
maintain a big intermediate state of temp join and then apply the filter on top
of it.
but if we can apply the filter on both tables first before we joining two
tables, it would be more efficient? that's why I'd like to confirm what's the
expected behavior for this filter in c++ implementation.
2. I tried to exercise join with filter, I saw following surprise. (assuming
we use filter only on one side)
```python3
In [54]: import pandas as pd
...: import pyarrow as pa
...: df1 = pd.DataFrame({'id': [1, 2, 3],
...: 'year': [2020, 2022, 2019]})
...: df2 = pd.DataFrame({'id': [3, 4],
...: 'n_legs': [5, 100],
...: 'animal': ["Brittle stars", "Centipede"]})
...: t1 = pa.Table.from_pandas(df1)
...: t2 = pa.Table.from_pandas(df2)
In [55]: t1.join(t2, 'id', join_type="right outer").combine_chunks()
Out[55]:
pyarrow.Table
year: int64
id: int64
n_legs: int64
animal: string
year: [[2019,null]]
id: [[3,4]]
n_legs: [[5,100]]
animal: [["Brittle stars","Centipede"]]
# and then we apply filter expression with intended mismatch here
In [56]: t1.join(t2, 'id', join_type="right outer",
filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks()
Out[56]:
pyarrow.Table
year: int64
id: int64
n_legs: int64
animal: string
year: [[null,null]]
id: [[3,4]]
n_legs: [[5,100]]
animal: [["Brittle stars","Centipede"]]
```
it seems we didn't return empty, instead, we return the `right outer`? it
seems the join type takes higher priority than filter operation for the final
result?
btw, it seems fine with inner join type.
```python3
In [57]: t1.join(t2, 'id', join_type="inner",
filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks()
Out[57]:
pyarrow.Table
id: int64
year: int64
n_legs: int64
animal: string
id: []
year: []
n_legs: []
animal: []
```
this one seems like a bug to me, but I am not sure, @AlenkaF @raulcd could
you provide some feedback on these two questions? Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]
AlenkaF commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2908783533 Thanks for opening this issue! I've marked the PR as a draft and updated the title. Regarding the call in `Table.join`: I would suggest placing the new keyword argument at the end of the list — this helps preserve consistency and avoids breaking any assumptions about argument order. Also, please make sure to connect it with `acero.py` and `_acero.pyx`. CC: @raulcd for any additional thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
