Re: Pyarrow uint32() int64() column type mismatch bug?

Grant Williams Fri, 28 Jan 2022 09:31:36 -0800

Thank you, Micah! That makes sense.

Do you have any thoughts about maybe adding a logged warning if a user
calls write_table() and uint32() is in the given schema?


On Fri, Jan 28, 2022 at 11:15 AM Micah Kornfield <[email protected]>
wrote:

> Hi Grant,
> This is intended behavior because the default writing of parquet  uses
> version 1 of logical types. Version 1 does not support annotating fields as
> uint32, so to preserve the values round trip they are cast to int64.  If
> you wish to maintain the type setting the version kwarg to 2.4 or 2.6 [1]
> should work.
>
> Cheers,
> Micah
>
>
> [1]
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
>
> On Fri, Jan 28, 2022 at 9:04 AM Grant Williams <[email protected]>
> wrote:
>
>> Hello,
>>
>> I've found that if you write a file that has a schema that specifies
>> column A as a uint32() type. If you read the file and inspect the schema it
>> will show Column A as int64(). This issue appears to be unique to the
>> uint32() type and I was unable to get any other type mismatches with the
>> other integer or float types.
>>
>> The following is a link to a gist showing a minimal code example and the
>> output from it:
>> https://gist.github.com/grantmwilliams/1ceb490312c59e4fb6e4bc15b57e9707.
>>
>> I'm not sure if this is a problem with the physical datatype being
>> actually written as int64, or if the metadata for the file is just wrong
>> instead. Does anyone have any idea what could be causing this? Or whether
>> it's just a metadata issue or an actual physical type error?
>>
>> Thanks,
>> Grant W.
>> --
>> Grant Williams
>> Machine Learning Engineer
>> https://github.com/grantmwilliams/
>>
>

-- 
Grant Williams
Machine Learning Engineer
https://github.com/grantmwilliams/

Re: Pyarrow uint32() int64() column type mismatch bug?

Reply via email to