Re: [Julia][Python] trouble with loading parquet files from interfaces in different languages

Bryce Mecum Wed, 22 Mar 2023 09:37:08 -0700

Yes, I think you're correct that there isn't another way to do the
conversion. A more efficient conversion may be in scope for the project so
you might consider opening a GitHub Issue [1] to discuss further. You may
also find this past discussion [2] interesting.


[1] https://github.com/apache/arrow-julia/issues
[2] https://github.com/apache/arrow-julia/issues/227

On Wed, Mar 22, 2023 at 6:11 AM Kazunori Akiyama <[email protected]> wrote:

> Hi Bryce,
>
> This clarifies a lot — I was indeed confused regarding formats. The
> reference [5] was really helpful to clarify my confusion.
>
> Let me ask one more question regarding Julia interfaces before closing the
> thread. So does this mean that we don’t have a function that loads parquet
> files into the Julia implementation of the Arrow in-memory format? Looks
> like the only way is converting it to the IPC format using Parquet.jl and
> Arrow.jl and reload it. Am I correct?
>
> Like:
> # convert a parquet file into the Arrow IPC format
> tab = Parquet.readfile(“blah.parquet”)
> Arrow.write(“blah.arrow”, tab)
>
> # reload it into in-memory data
> tab2 = Arrow.read(“blah.arrow")
>
> - Kazu
>
> On Mar 21, 2023, at 6:40 PM, Bryce Mecum <[email protected]> wrote:
>
> Hi Kazu, from the description of what behavior you're seeing and the code
> you've provided, it looks like you may be mixing up the two file formats
> (Arrow IPC and Parquet) in your code. Your Julia code looks like it's using
> the Arrow IPC file format whereas your Python code looks like it's using
> the Parquet file format.
>
> If you want to use Parquet to share data:
>
> - In Julia: Use the Parquet package and its read_table and write_table
> methods [1]
> - In Python: Use pyarrow.parquet module and its read_table and write_table
> methods [2]
>
> If you want to use Arrow IPC to share data:
>
> - In Julia: Use the Arrow package and its Arrow.table and Arrow.write
> methods [3]
> - In Python: Use the pyarrow package and the IPC readers and writers [4]
>
> Additionally, there is a FAQ [5] on the Apache Arrow website about formats
> that you may find relevant.
>
> [1] https://github.com/JuliaIO/Parquet.jl
> [2] https://arrow.apache.org/docs/python/parquet.html
> [3] https://arrow.juliadata.org/dev/manual/#User-Manual
> [4] https://arrow.apache.org/docs/python/ipc.html
> [5] https://arrow.apache.org/faq/#what-about-arrow-files-then
>
> On Tue, Mar 21, 2023 at 12:00 PM Kazunori Akiyama <[email protected]>
> wrote:
>
>> Hello,
>>
>> I’m a radio astronomer working for the Event Horizon Telescope
>> <https://eventhorizontelescope.org/> project. We are interested in
>> Apache Arrow for our next-generation data format as other radio astronomy
>> groups started to develop a new Arrow-based data format
>> <https://github.com/ratt-ru/casa-arrow>. We are currently developing
>> major software ecosystems in Julia and Python, and would like to test data
>> IO interfaces with Arrow.jl and pyarrow.
>>
>> I’m writing this e-mail because I faced some issues in loading Arrow
>> table data created in a different language. We just did a very simple check
>> like creating Arrow tables in python and Julia, and loading them in another
>> language (i.e. Julia and Python respectively). While we confirmed that each
>> of pyarrow and Arrow.jl can read parquet files generated from itself, it
>> can’t load parquet files from another language. For instance, we found
>>
>>
>>    - pyarrow can’t read a table written by Arrow.write method of Julia’s
>>    Arrow.jl.It <http://arrow.jl.it/> returns `ArrowInvalid: Could not
>>    open Parquet input source ‘FILENAME': Parquet magic bytes not found in
>>    footer. Either the file is corrupted or this is not a parquet file.`
>>    - Arrow.jl can’t read a table from pyarrow. It doesn’t show any
>>    errors, but the loaded table is completely empty and doesn’t have any rows
>>    and cols.
>>
>>
>> I have attached Julia and python scripts that create parquet files of a
>> very simple single-column table (juliadf.parquet from julia,
>> pandasdf.parquet from python). pyarrow.parquet.read_table doesn’t work for
>> juliadf.parquet, and Arrow.Table methods doesn’t work for pandasdf.parquet.
>> I also attached python’s pip freeze file and Julia’s toml files just in
>> case you want to see my python and julia enviroments.
>>
>> As this is a very primitive test, I’m pretty sure I made some simple
>> mistakes here. What I’m missing? Let me know how I should handle parquet
>> files from interfaces in different languages.
>>
>> Thanks,
>> Kazu
>>
>>
>>
>

Re: [Julia][Python] trouble with loading parquet files from interfaces in different languages

Reply via email to