On Mon, 18 Jul 2022 at 10:38, Louis C <[email protected]> wrote: > Hello Micah and Joris, > > Thanks fort your answer. I understand that using the "TIME" fields of > Parquet can be problematic in some instances. > But I still find it strange that this is the only case (I think) that > exporting/importing an Arrow table in a particular format (Feather, ORC, > Parquet) changes the type of the field (there are other cases where the > type is not supported at all, but it gives a plain error during the export). > I will try to lookup to the Arrow schema in the Parquet file. Is there a > particular task to be done when reading back the Parquet file so that the > type of the DURATION field is correctly inferred ? >
If you are using the Arrow C++ implementation or one of its bindings (R arrow, pyarrow, ..), this should be done automatically. > > Regards, > Louis C > ------------------------------ > *De :* Micah Kornfield <[email protected]> > *Envoyé :* jeudi 14 juillet 2022 08:33 > *À :* [email protected] <[email protected]> > *Objet :* Re: Using Parquet adapter with type DURATION : field type loss > > Hi Louis, > I would lean against doing this. Parquet doesn't seem to be prescriptive, > but I understand Time type to have a max value of at most 1 day (i.e. 86400 > seconds, this is how Arrow defines the type at least [1]). Durations can > be larger and that can lead to ambiguity in handling. Second, the Arrow > schema should be preserved by default when writing the parquet file so it > should be recoverable, I understand this doesn't help for non-arrow based > systems but it potentially gives a work-around in some contexts. > > I think the more appropriate solution is to see if there is interest in > extending Parquet's type system for this type OR figuring out conventions > that are more universal for logical types that aren't in Parquet's type > system. > > Thanks, > Micah > > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L222 > > On Tue, Jul 12, 2022 at 8:02 AM Louis C <[email protected]> wrote: > > Hello, > > I integrated the arrow library to a larger project, and was testing doing > exports/imports of the same tables to see if it behaved well. Doing this, I > became aware that arrow DURATION types were exported as INT64 (as the > corresponding number of µs if I remember correctly) in the parquet export, > and then imported as INT64 types. So the parquet export loses the type for > the DURATION fields. > Would not it be better to export the DURATION type as the parquet logical > type "TIME_MICROS" (meaning TIME wit precision micro, as TIME_MICROS seems > to be somewhat deprecated ( > https://apache.googlesource.com/parquet-format/+/refs/heads/bloom-filter/LogicalTypes.md)) > as is doing matlab (see > https://fr.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html) > ? > > Best regards, > Louis C > >
