Re: [Julia][Python] trouble with loading parquet files from interfaces in different languages

Bryce Mecum Tue, 21 Mar 2023 15:40:43 -0700

Hi Kazu, from the description of what behavior you're seeing and the code
you've provided, it looks like you may be mixing up the two file formats
(Arrow IPC and Parquet) in your code. Your Julia code looks like it's using
the Arrow IPC file format whereas your Python code looks like it's using
the Parquet file format.


If you want to use Parquet to share data:

- In Julia: Use the Parquet package and its read_table and write_table
methods [1]
- In Python: Use pyarrow.parquet module and its read_table and write_table
methods [2]

If you want to use Arrow IPC to share data:

- In Julia: Use the Arrow package and its Arrow.table and Arrow.write
methods [3]
- In Python: Use the pyarrow package and the IPC readers and writers [4]

Additionally, there is a FAQ [5] on the Apache Arrow website about formats
that you may find relevant.

[1] https://github.com/JuliaIO/Parquet.jl
[2] https://arrow.apache.org/docs/python/parquet.html
[3] https://arrow.juliadata.org/dev/manual/#User-Manual
[4] https://arrow.apache.org/docs/python/ipc.html
[5] https://arrow.apache.org/faq/#what-about-arrow-files-then

On Tue, Mar 21, 2023 at 12:00 PM Kazunori Akiyama <[email protected]> wrote:

> Hello,
>
> I’m a radio astronomer working for the Event Horizon Telescope
> <https://eventhorizontelescope.org/> project. We are interested in Apache
> Arrow for our next-generation data format as other radio astronomy groups
> started to develop a new Arrow-based data format
> <https://github.com/ratt-ru/casa-arrow>. We are currently developing
> major software ecosystems in Julia and Python, and would like to test data
> IO interfaces with Arrow.jl and pyarrow.
>
> I’m writing this e-mail because I faced some issues in loading Arrow table
> data created in a different language. We just did a very simple check like
> creating Arrow tables in python and Julia, and loading them in another
> language (i.e. Julia and Python respectively). While we confirmed that each
> of pyarrow and Arrow.jl can read parquet files generated from itself, it
> can’t load parquet files from another language. For instance, we found
>
>
>    - pyarrow can’t read a table written by Arrow.write method of Julia’s
>    Arrow.jl.It returns `ArrowInvalid: Could not open Parquet input source
>    ‘FILENAME': Parquet magic bytes not found in footer. Either the file is
>    corrupted or this is not a parquet file.`
>    - Arrow.jl can’t read a table from pyarrow. It doesn’t show any
>    errors, but the loaded table is completely empty and doesn’t have any rows
>    and cols.
>
>
> I have attached Julia and python scripts that create parquet files of a
> very simple single-column table (juliadf.parquet from julia,
> pandasdf.parquet from python). pyarrow.parquet.read_table doesn’t work for
> juliadf.parquet, and Arrow.Table methods doesn’t work for pandasdf.parquet.
> I also attached python’s pip freeze file and Julia’s toml files just in
> case you want to see my python and julia enviroments.
>
> As this is a very primitive test, I’m pretty sure I made some simple
> mistakes here. What I’m missing? Let me know how I should handle parquet
> files from interfaces in different languages.
>
> Thanks,
> Kazu
>
>
>

Re: [Julia][Python] trouble with loading parquet files from interfaces in different languages

Reply via email to