Hi Louis,
I would lean against doing this.  Parquet doesn't seem to be prescriptive,
but I understand Time type to have a max value of at most 1 day (i.e. 86400
seconds, this is how Arrow defines the type at least [1]).  Durations can
be larger and that can lead to ambiguity in handling.  Second, the Arrow
schema should be preserved by default when writing the parquet file so it
should be recoverable, I understand this doesn't help for non-arrow based
systems but it potentially gives a work-around in some contexts.

I think the more appropriate solution is to see if there is interest in
extending Parquet's type system for this type OR figuring out conventions
that are more universal for logical types that aren't in Parquet's type
system.

Thanks,
Micah

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L222

On Tue, Jul 12, 2022 at 8:02 AM Louis C <[email protected]> wrote:

> Hello,
>
> I integrated the arrow library to a larger project, and was testing doing
> exports/imports of the same tables to see if it behaved well. Doing this, I
> became aware that arrow DURATION types were exported as INT64 (as the
> corresponding number of µs if I remember correctly) in the parquet export,
> and then imported as INT64 types. So the parquet export loses the type for
> the DURATION fields.
> Would not it be better to export the DURATION type as the parquet logical
> type "TIME_MICROS" (meaning TIME wit precision micro, as TIME_MICROS seems
> to be somewhat deprecated (
> https://apache.googlesource.com/parquet-format/+/refs/heads/bloom-filter/LogicalTypes.md))
> as is doing matlab (see
> https://fr.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html)
> ?
>
> Best regards,
> Louis C
>

Reply via email to