Hello Micah and Joris,

Thanks fort your answer. I understand that using the "TIME" fields of Parquet 
can be problematic in some instances.
But I still find it strange that this is the only case (I think) that 
exporting/importing an Arrow table in a particular format (Feather, ORC, 
Parquet) changes the type of the field (there are other cases where the type is 
not supported at all, but it gives a plain error during the export).
I will try to lookup to the Arrow schema in the Parquet file. Is there a 
particular task to be done when reading back the Parquet file so that the type 
of the DURATION field is correctly inferred ?

Regards,
Louis C
________________________________
De : Micah Kornfield <[email protected]>
Envoyé : jeudi 14 juillet 2022 08:33
À : [email protected] <[email protected]>
Objet : Re: Using Parquet adapter with type DURATION : field type loss

Hi Louis,
I would lean against doing this.  Parquet doesn't seem to be prescriptive, but 
I understand Time type to have a max value of at most 1 day (i.e. 86400 
seconds, this is how Arrow defines the type at least [1]).  Durations can be 
larger and that can lead to ambiguity in handling.  Second, the Arrow schema 
should be preserved by default when writing the parquet file so it should be 
recoverable, I understand this doesn't help for non-arrow based systems but it 
potentially gives a work-around in some contexts.

I think the more appropriate solution is to see if there is interest in 
extending Parquet's type system for this type OR figuring out conventions that 
are more universal for logical types that aren't in Parquet's type system.

Thanks,
Micah

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L222

On Tue, Jul 12, 2022 at 8:02 AM Louis C 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

I integrated the arrow library to a larger project, and was testing doing 
exports/imports of the same tables to see if it behaved well. Doing this, I 
became aware that arrow DURATION types were exported as INT64 (as the 
corresponding number of µs if I remember correctly) in the parquet export, and 
then imported as INT64 types. So the parquet export loses the type for the 
DURATION fields.
Would not it be better to export the DURATION type as the parquet logical type 
"TIME_MICROS" (meaning TIME wit precision micro, as TIME_MICROS seems to be 
somewhat deprecated 
(https://apache.googlesource.com/parquet-format/+/refs/heads/bloom-filter/LogicalTypes.md))
 as is doing matlab (see 
https://fr.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html)
 ?

Best regards,
Louis C

Reply via email to