Thank you very much, your reply is very helpful. I have one more question to 
ask. Since our data is actually stored in S3, I would like to ask if we can set 
project during SCAN. My understanding is that we will only get the columns we 
need from S3 instead of scanning the entire file. This will greatly reduce the 
network bandwidth usage. Or did I misunderstand that even if I do project after 
SCAN, it will also only read the required columns?




1057445597
1057445...@qq.com



 




------------------ ???????? ------------------
??????:                                                                         
                                               "user"                           
                                                         
<weston.p...@gmail.com&gt;;
????????:&nbsp;2022??9??21??(??????) ????9:01
??????:&nbsp;"user"<user@arrow.apache.org&gt;;

????:&nbsp;Re: [c++][compute]Is there any other way to use Join besides Acero??



Thanks for the detailed reproducer.&nbsp; I've added a few notes on the JIRA 
that I hope will help.

On Tue, Sep 20, 2022, 5:10 AM 1057445597 <1057445...@qq.com&gt; wrote:

I re-uploaded a copy of the code that can be compiled and run in join_test.zip, 
including cmakelists.txt, the test data files and the Python code that 
generated the test files. There is also Python code to view the data files. You 
will need to compile Arrow 9.0 yourself.




1057445597
1057445...@qq.com



&nbsp;




------------------&nbsp;????????&nbsp;------------------
??????:                                                                         
                                               "user"                           
                                                         <1057445...@qq.com&gt;;
????????:&nbsp;2022??9??15??(??????) ????10:27
??????:&nbsp;"user"<user@arrow.apache.org&gt;;

????:&nbsp;?????? [c++][compute]Is there any other way to use Join besides 
Acero??



this jira


https://issues.apache.org/jira/browse/ARROW-17740


1057445597
1057445...@qq.com



&nbsp;




------------------ ???????? ------------------
??????:                                                                         
                                               "user"                           
                                                         
<weston.p...@gmail.com&gt;;
????????:&nbsp;2022??9??15??(??????) ????12:15
??????:&nbsp;"user"<user@arrow.apache.org&gt;;

????:&nbsp;Re: [c++][compute]Is there any other way to use Join besides Acero??



Within Arrow-C++ that is the only way I am aware of.&nbsp; You might be able to 
use DuckDb.&nbsp; It should be able to scan parquet files.

Is this the same program that you shared before?&nbsp; Were you able to figure 
out threading?&nbsp; Can you create a JIRA with some sample input files and a 
reproducible example?


On Wed, Sep 14, 2022 at 5:14 PM 1057445597 <1057445...@qq.com&gt; wrote:

Acero performs poorly, and coredump occurs frequently??


In the scenario I'm working on, I'll read one Parquet file and then several 
other Parquet files. These files will have the same column name (UUID). I need 
to join (by UUID), project (remove UUID), and filter (some custom filtering) 
the results of the two reads. I found that Acero could only be used to do join, 
but when I tested it, Acero performance was very poor and very unstable, 
coredump often happened. Is there another way? Or just another way to do a join!







1057445597
1057445...@qq.com



&nbsp;

Reply via email to