Re: [Julia][Python] trouble with loading parquet files from interfaces in different languages

Kazunori Akiyama Wed, 22 Mar 2023 11:48:44 -0700

Thanks for your prompt response! I will explore more and consider making a 
request if it turns out to be ideal to have a more efficient loader.


- Kazu

> On Mar 22, 2023, at 12:36 PM, Bryce Mecum <[email protected]> wrote:
> 
> Yes, I think you're correct that there isn't another way to do the 
> conversion. A more efficient conversion may be in scope for the project so 
> you might consider opening a GitHub Issue [1] to discuss further. You may 
> also find this past discussion [2] interesting.
> 
> [1] https://github.com/apache/arrow-julia/issues 
> <https://github.com/apache/arrow-julia/issues>
> [2] https://github.com/apache/arrow-julia/issues/227 
> <https://github.com/apache/arrow-julia/issues/227>
> On Wed, Mar 22, 2023 at 6:11 AM Kazunori Akiyama <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Bryce,
> 
> This clarifies a lot — I was indeed confused regarding formats. The reference 
> [5] was really helpful to clarify my confusion. 
> 
> Let me ask one more question regarding Julia interfaces before closing the 
> thread. So does this mean that we don’t have a function that loads parquet 
> files into the Julia implementation of the Arrow in-memory format? Looks like 
> the only way is converting it to the IPC format using Parquet.jl and Arrow.jl 
> and reload it. Am I correct? 
> 
> Like:
> # convert a parquet file into the Arrow IPC format
> tab = Parquet.readfile(“blah.parquet”)
> Arrow.write(“blah.arrow”, tab)
> 
> # reload it into in-memory data
> tab2 = Arrow.read(“blah.arrow")
> 
> - Kazu
> 
>> On Mar 21, 2023, at 6:40 PM, Bryce Mecum <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi Kazu, from the description of what behavior you're seeing and the code 
>> you've provided, it looks like you may be mixing up the two file formats 
>> (Arrow IPC and Parquet) in your code. Your Julia code looks like it's using 
>> the Arrow IPC file format whereas your Python code looks like it's using the 
>> Parquet file format.
>> 
>> If you want to use Parquet to share data:
>> 
>> - In Julia: Use the Parquet package and its read_table and write_table 
>> methods [1]
>> - In Python: Use pyarrow.parquet module and its read_table and write_table 
>> methods [2]
>> 
>> If you want to use Arrow IPC to share data:
>> 
>> - In Julia: Use the Arrow package and its Arrow.table and Arrow.write 
>> methods [3] 
>> - In Python: Use the pyarrow package and the IPC readers and writers [4] 
>> 
>> Additionally, there is a FAQ [5] on the Apache Arrow website about formats 
>> that you may find relevant.
>> 
>> [1] https://github.com/JuliaIO/Parquet.jl 
>> <https://github.com/JuliaIO/Parquet.jl>
>> [2] https://arrow.apache.org/docs/python/parquet.html 
>> <https://arrow.apache.org/docs/python/parquet.html>
>> [3] https://arrow.juliadata.org/dev/manual/#User-Manual 
>> <https://arrow.juliadata.org/dev/manual/#User-Manual>
>> [4] https://arrow.apache.org/docs/python/ipc.html 
>> <https://arrow.apache.org/docs/python/ipc.html>
>> [5] https://arrow.apache.org/faq/#what-about-arrow-files-then 
>> <https://arrow.apache.org/faq/#what-about-arrow-files-then>
>> On Tue, Mar 21, 2023 at 12:00 PM Kazunori Akiyama <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hello,
>> 
>> I’m a radio astronomer working for the Event Horizon Telescope 
>> <https://eventhorizontelescope.org/> project. We are interested in Apache 
>> Arrow for our next-generation data format as other radio astronomy groups 
>> started to develop a new Arrow-based data format 
>> <https://github.com/ratt-ru/casa-arrow>. We are currently developing major 
>> software ecosystems in Julia and Python, and would like to test data IO 
>> interfaces with Arrow.jl and pyarrow.
>> 
>> I’m writing this e-mail because I faced some issues in loading Arrow table 
>> data created in a different language. We just did a very simple check like 
>> creating Arrow tables in python and Julia, and loading them in another 
>> language (i.e. Julia and Python respectively). While we confirmed that each 
>> of pyarrow and Arrow.jl can read parquet files generated from itself, it 
>> can’t load parquet files from another language. For instance, we found
>> 
>> pyarrow can’t read a table written by Arrow.write method of Julia’s 
>> Arrow.jl.It <http://arrow.jl.it/> returns `ArrowInvalid: Could not open 
>> Parquet input source ‘FILENAME': Parquet magic bytes not found in footer. 
>> Either the file is corrupted or this is not a parquet file.`
>> Arrow.jl can’t read a table from pyarrow. It doesn’t show any errors, but 
>> the loaded table is completely empty and doesn’t have any rows and cols.
>> 
>> I have attached Julia and python scripts that create parquet files of a very 
>> simple single-column table (juliadf.parquet from julia, pandasdf.parquet 
>> from python). pyarrow.parquet.read_table doesn’t work for juliadf.parquet, 
>> and Arrow.Table methods doesn’t work for pandasdf.parquet. I also attached 
>> python’s pip freeze file and Julia’s toml files just in case you want to see 
>> my python and julia enviroments.
>> 
>> As this is a very primitive test, I’m pretty sure I made some simple 
>> mistakes here. What I’m missing? Let me know how I should handle parquet 
>> files from interfaces in different languages.
>> 
>> Thanks,
>> Kazu
>> 
>> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: [Julia][Python] trouble with loading parquet files from interfaces in different languages

Reply via email to