On Mon, 18 Jul 2022 at 10:38, Louis C <[email protected]> wrote:

> Hello Micah and Joris,
>
> Thanks fort your answer. I understand that using the "TIME" fields of
> Parquet can be problematic in some instances.
> But I still find it strange that this is the only case (I think) that
> exporting/importing an Arrow table in a particular format (Feather, ORC,
> Parquet) changes the type of the field (there are other cases where the
> type is not supported at all, but it gives a plain error during the export).
> I will try to lookup to the Arrow schema in the Parquet file. Is there a
> particular task to be done when reading back the Parquet file so that the
> type of the DURATION field is correctly inferred ?
>

If you are using the Arrow C++ implementation or one of its bindings (R
arrow, pyarrow, ..), this should be done automatically.


>
> Regards,
> Louis C
> ------------------------------
> *De :* Micah Kornfield <[email protected]>
> *Envoyé :* jeudi 14 juillet 2022 08:33
> *À :* [email protected] <[email protected]>
> *Objet :* Re: Using Parquet adapter with type DURATION : field type loss
>
> Hi Louis,
> I would lean against doing this.  Parquet doesn't seem to be prescriptive,
> but I understand Time type to have a max value of at most 1 day (i.e. 86400
> seconds, this is how Arrow defines the type at least [1]).  Durations can
> be larger and that can lead to ambiguity in handling.  Second, the Arrow
> schema should be preserved by default when writing the parquet file so it
> should be recoverable, I understand this doesn't help for non-arrow based
> systems but it potentially gives a work-around in some contexts.
>
> I think the more appropriate solution is to see if there is interest in
> extending Parquet's type system for this type OR figuring out conventions
> that are more universal for logical types that aren't in Parquet's type
> system.
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L222
>
> On Tue, Jul 12, 2022 at 8:02 AM Louis C <[email protected]> wrote:
>
> Hello,
>
> I integrated the arrow library to a larger project, and was testing doing
> exports/imports of the same tables to see if it behaved well. Doing this, I
> became aware that arrow DURATION types were exported as INT64 (as the
> corresponding number of µs if I remember correctly) in the parquet export,
> and then imported as INT64 types. So the parquet export loses the type for
> the DURATION fields.
> Would not it be better to export the DURATION type as the parquet logical
> type "TIME_MICROS" (meaning TIME wit precision micro, as TIME_MICROS seems
> to be somewhat deprecated (
> https://apache.googlesource.com/parquet-format/+/refs/heads/bloom-filter/LogicalTypes.md))
> as is doing matlab (see
> https://fr.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html)
> ?
>
> Best regards,
> Louis C
>
>

Reply via email to