[jira] [Created] (PARQUET-1620) Schema creation from another schema will not be possible - deprecated

2019-07-10 Thread Werner Daehn (JIRA)
Werner Daehn created PARQUET-1620:
-

 Summary: Schema creation from another schema will not be possible 
- deprecated
 Key: PARQUET-1620
 URL: https://issues.apache.org/jira/browse/PARQUET-1620
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.11.0
Reporter: Werner Daehn


Imagine I have a current schema and want to create a projection schema from 
that. One option is the schema.Types.*Builder but the more direct version would 
be to clone the schema itself without children.

{{List l = new ArrayList<>();}}
{{ for (String c : childmappings.keySet()) {}}
{{  Mapping m = childmappings.get(c);}}
{{  l.add(m.getProjectionSchema());}}
{{ }}}
{{ GroupType gt = new GroupType(schema.getRepetition(), schema.getName(), 
schema.getOriginalType(), l);}}

 

The last line, the new GroupType(..) constructor is deprecated. We should use 
the version with the LogicalTypeAnnotation instead. Fine. But how do you get 
the LogicalTypeAnnotation  from an existing schema?

I feel you should not deprecate these methods and if, provide an extra method 
to create a Type column from a type column (column alone, without children. 
Else the projection would have all child columns).

 

Do you agree?

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

2019-07-10 Thread TP Boudreau
Sorry for the quick self-reply, but after brief reflection I think two
changes to my alternative proposal are required:

1.  The proposed new field should be a parameter to the TimestampType, not
FileMetaData -- file level adds unnecessary complication / opportunities
for mischief.
2.  Although reported vs. inferred is the logical distinction, practically
this change is about whether or not the TimestampType was built from a
TIMESTAMP converted type, so the name should reflect that.

After these amendments, this option boils down to: add a new boolean
parameter to the TimestampType named (something like) fromConvertedType.

--TPB

On Wed, Jul 10, 2019 at 8:56 AM TP Boudreau  wrote:

> Hi Zoltan,
>
> Thank you for the helpful clarification of the community's understanding
> of the TIMESTAMP annotation.
>
> The core of the problem (IMHO) is that there no way to distinguish in the
> new LogicalType TimestampType between the case where UTC-normalization has
> been directly reported (via a user supplied TimestampType boolean
> parameter) or merely inferred (from a naked TIMESTAMP converted type).  So
> perhaps another alternative might be to retain the isAdjustedToUTC boolean
> as is and add another field that indicates whether the adjusted flag was
> REPORTED or INFERRED (could be boolean or short variant or some type).
> This would allow interested readers to differentiate between old and new
> timestamps, while allowing other readers to enjoy the default you believe
> is warranted.  It seems most straightforward that it would be an additional
> parameter on the TimestampType, but I supposed it could reside in the
> FileMetaData struct (on the assumption that the schema elements, having
> been written by the same writer, all uniformly use converted type or
> LogicalType).
>
> --Tim
>
>
> On Wed, Jul 10, 2019 at 6:48 AM Zoltan Ivanfi 
> wrote:
>
>> Hi Tim,
>>
>> In my opinion the specification of the older timestamp types only allowed
>> UTC-normalized storage, since these types were defined as the number of
>> milli/microseconds elapsed since the Unix epoch. This clearly defines the
>> meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e.
>> 1970-01-01 00:00:00 UTC. It does not say anything about how this value
>> must
>> be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC", but
>> typically it is displayed adjusted to the user's local timezone, for
>> example "1970-01-01 01:00:00" for a user in Paris. I don't think this
>> definition allows interpreting the numeric value 0 as "1970-01-01
>> 00:00:00"
>> in Paris, since the latter would correspond to 1969-12-31 23:00:00 UTC,
>> which must be stored as the numeric value -3600 (times 10^3 for _MILLIS or
>> 10^6 for _MICROS) instead.
>>
>> I realize that compatibility with real-life usage patterns is important
>> regardless of whether they comply with the specification or not, but I
>> can't think of any solution that would be useful in practice. The
>> suggestion to turn the boolean into an enum would certainly allow Parquet
>> to have timestamps with unknown semantics, but I don't know what value
>> that
>> would bring to applications and how they would use it. I'm also afraid
>> that
>> the undefined semantics would get misused/overused by developers who are
>> not sure about the difference between the two semantics and we would end
>> up
>> with a lot of meaningless timestamps.
>>
>> Even with the problems I listed your suggestion may still be better than
>> the current solution, but before making a community decision I would like
>> to continue this discussion focusing on three questions:
>>
>>- What are the implications of this change?
>>- How will unknown semantics be used in practice?
>>- Does it bring value?
>>- Can we do better?
>>- Can we even change the boolean to an enum? It has been specified like
>>that and released a long time ago. Although I am not aware of any
>> software
>>component that would have already implemented it, I was also unaware of
>>software components using TIMESTAMP_MILLIS and _MICROS for local
>> semantics.
>>
>> One alternative that comes to my mind is to default to the more common
>> UTC-normalized semantics but allow overriding it in the reader schema.
>>
>> Thanks,
>>
>> Zoltan
>>
>> On Tue, Jul 9, 2019 at 9:52 PM TP Boudreau  wrote:
>>
>> > I'm not a long-time Parquet user, but I assisted in the expansion of the
>> > parquet-cpp library's LogicalType facility.
>> >
>> > My impression is that the original TIMESTAMP converted types were
>> silent on
>> > whether the annotated value was UTC adjusted and that (often arcane)
>> > out-of-band information had to be relied on by readers to decide the UTC
>> > adjustment status for timestamp columns.  It seemed to me that that
>> > perceived shortcoming was a primary motivator for adding the
>> > isAdjustedToUTC boolean parameter to the corresponding new Timestamp
>> > LogicalType.  If that impression is accurate, 

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

2019-07-10 Thread TP Boudreau
Hi Zoltan,

Thank you for the helpful clarification of the community's understanding of
the TIMESTAMP annotation.

The core of the problem (IMHO) is that there no way to distinguish in the
new LogicalType TimestampType between the case where UTC-normalization has
been directly reported (via a user supplied TimestampType boolean
parameter) or merely inferred (from a naked TIMESTAMP converted type).  So
perhaps another alternative might be to retain the isAdjustedToUTC boolean
as is and add another field that indicates whether the adjusted flag was
REPORTED or INFERRED (could be boolean or short variant or some type).
This would allow interested readers to differentiate between old and new
timestamps, while allowing other readers to enjoy the default you believe
is warranted.  It seems most straightforward that it would be an additional
parameter on the TimestampType, but I supposed it could reside in the
FileMetaData struct (on the assumption that the schema elements, having
been written by the same writer, all uniformly use converted type or
LogicalType).

--Tim


On Wed, Jul 10, 2019 at 6:48 AM Zoltan Ivanfi 
wrote:

> Hi Tim,
>
> In my opinion the specification of the older timestamp types only allowed
> UTC-normalized storage, since these types were defined as the number of
> milli/microseconds elapsed since the Unix epoch. This clearly defines the
> meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e.
> 1970-01-01 00:00:00 UTC. It does not say anything about how this value must
> be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC", but
> typically it is displayed adjusted to the user's local timezone, for
> example "1970-01-01 01:00:00" for a user in Paris. I don't think this
> definition allows interpreting the numeric value 0 as "1970-01-01 00:00:00"
> in Paris, since the latter would correspond to 1969-12-31 23:00:00 UTC,
> which must be stored as the numeric value -3600 (times 10^3 for _MILLIS or
> 10^6 for _MICROS) instead.
>
> I realize that compatibility with real-life usage patterns is important
> regardless of whether they comply with the specification or not, but I
> can't think of any solution that would be useful in practice. The
> suggestion to turn the boolean into an enum would certainly allow Parquet
> to have timestamps with unknown semantics, but I don't know what value that
> would bring to applications and how they would use it. I'm also afraid that
> the undefined semantics would get misused/overused by developers who are
> not sure about the difference between the two semantics and we would end up
> with a lot of meaningless timestamps.
>
> Even with the problems I listed your suggestion may still be better than
> the current solution, but before making a community decision I would like
> to continue this discussion focusing on three questions:
>
>- What are the implications of this change?
>- How will unknown semantics be used in practice?
>- Does it bring value?
>- Can we do better?
>- Can we even change the boolean to an enum? It has been specified like
>that and released a long time ago. Although I am not aware of any
> software
>component that would have already implemented it, I was also unaware of
>software components using TIMESTAMP_MILLIS and _MICROS for local
> semantics.
>
> One alternative that comes to my mind is to default to the more common
> UTC-normalized semantics but allow overriding it in the reader schema.
>
> Thanks,
>
> Zoltan
>
> On Tue, Jul 9, 2019 at 9:52 PM TP Boudreau  wrote:
>
> > I'm not a long-time Parquet user, but I assisted in the expansion of the
> > parquet-cpp library's LogicalType facility.
> >
> > My impression is that the original TIMESTAMP converted types were silent
> on
> > whether the annotated value was UTC adjusted and that (often arcane)
> > out-of-band information had to be relied on by readers to decide the UTC
> > adjustment status for timestamp columns.  It seemed to me that that
> > perceived shortcoming was a primary motivator for adding the
> > isAdjustedToUTC boolean parameter to the corresponding new Timestamp
> > LogicalType.  If that impression is accurate, then when reading TIMESTAMP
> > columns written by legacy (converted type only) writers, it seems
> > inappropriate for LogicalType aware readers to unconditionally assign
> > *either* "false" or "true" (as currently required) to a boolean UTC
> > adjusted parameter, as that requires the reader to infer a property that
> > wasn't implied by the writer.
> >
> > One possible approach to untangling this might be to amend the
> > parquet.thrift specification to change the isAdjustedToUTC boolean
> property
> > to an enum or union type (some enumerated list) named (for example)
> > UTCAdjustment with three possible values: Unknown, UTCAdjusted,
> > NotUTCAdjusted (I'm not married to the names).  Extant files with
> TIMESTAMP
> > converted types only would map for forward compatibility to Timestamp
> > 

[jira] [Updated] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2019-07-10 Thread Zoltan Ivanfi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated PARQUET-1222:
---
Description: 
Currently parquet-format specifies the sort order for floating point numbers as 
follows:
{code:java}
   *   FLOAT - signed comparison of the represented value
   *   DOUBLE - signed comparison of the represented value
{code}
The problem is that the comparison of floating point numbers is only a partial 
ordering with strange behaviour in specific corner cases. For example, 
according to IEEE 754, -0 is neither less nor more than \+0 and comparing NaN 
to anything always returns false. This ordering is not suitable for statistics. 
Additionally, the Java implementation already uses a different (total) ordering 
that handles these cases correctly but differently than the C\+\+ 
implementations, which leads to interoperability problems.

TypeDefinedOrder for doubles and floats should be deprecated and a new 
TotalFloatingPointOrder should be introduced. The default for writing doubles 
and floats would be the new TotalFloatingPointOrder. This ordering should be 
effective and easy to implement in all programming languages.

  was:
Currently parquet-format specifies the sort order for floating point numbers as 
follows:
{code:java}
   *   FLOAT - signed comparison of the represented value
   *   DOUBLE - signed comparison of the represented value
{code}
The problem is that the comparison of floating point numbers is only a partial 
ordering with strange behaviour in specific corner cases. For example, 
according to IEEE 754, -0 is neither less nor more than +0 and comparing NaN to 
anything always returns false. This ordering is not suitable for statistics. 
Additionally, the Java implementation already uses a different (total) ordering 
that handles these cases correctly but differently than the C++ 
implementations, which leads to interoperability problems.

TypeDefinedOrder for doubles and floats should be deprecated and a new 
TotalFloatingPointOrder should be introduced. The default for writing doubles 
and floats would be the new TotalFloatingPointOrder. This ordering should be 
effective and easy to implement in all programming languages.


> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

2019-07-10 Thread Wes McKinney
Correct

On Wed, Jul 10, 2019 at 9:21 AM Zoltan Ivanfi  wrote:
>
> Hi Wes,
>
> Do you mean that the new logical types have already been released in 0.14.0
> and a 0.14.1 is needed ASAP to fix this regression?
>
> Thanks,
>
> Zoltan
>
> On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney  wrote:
>
> > hi Zoltan -- given the raging fire that is 0.14.0 as a result of these
> > issues and others we need to make a new release within the next 7-10
> > days. We can point you to nightly Python builds to make testing for
> > you easier so you don't have to build the project yourself.
> >
> > - Wes
> >
> > On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi 
> > wrote:
> > >
> > > Hi,
> > >
> > > Oh, and one more thing: Before releasing the next Arrow version
> > > incorporating the new logical types, we should definitely test that their
> > > behaviour matches that of parquet-mr. When is the next release planned to
> > > come out?
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi  wrote:
> > >
> > > > Hi Wes,
> > > >
> > > > Yes, I agree that we should do that, but then we have a problem of
> > what to
> > > > do in the other direction, i.e. when we use the new logical types API
> > to
> > > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
> > > > normalized flag? Tim has started a discussion about that, suggesting
> > to use
> > > > three states that I just answered.
> > > >
> > > > Br,
> > > >
> > > > Zoltan
> > > >
> > > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney 
> > wrote:
> > > >
> > > >> Thank for the comments.
> > > >>
> > > >> So in summary I think that we need to set the TIMESTAMP_* converted
> > > >> types to maintain forward compatibility and stay consistent with what
> > > >> we were doing in the C++ library prior to the introduction of the
> > > >> LogicalType metadata.
> > > >>
> > > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi 
> > > >>  > >
> > > >> wrote:
> > > >> >
> > > >> > Hi Wes,
> > > >> >
> > > >> > Both of the semantics are deterministic in one aspect and
> > > >> indeterministic
> > > >> > in another. Timestamps of instant semantic will always refer to the
> > same
> > > >> > instant, but their user-facing representation (how they get
> > displayed)
> > > >> > depends on the user's time zone. Timestamps of local semantics
> > always
> > > >> have
> > > >> > the same user-facing representation but the instant they refer to is
> > > >> > undefined (or ambigous, depending on your point of view).
> > > >> >
> > > >> > My understanding is that Spark uses instant semantics, i.e.,
> > timestamps
> > > >> are
> > > >> > stored normalized to UTC and are displayed adjusted to the user's
> > local
> > > >> > time zone.
> > > >> >
> > > >> > Br,
> > > >> >
> > > >> > Zoltan
> > > >> >
> > > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney 
> > > >> wrote:
> > > >> >
> > > >> > > Thanks Zoltan.
> > > >> > >
> > > >> > > This is definitely a tricky issue.
> > > >> > >
> > > >> > > Spark's application of localtime semantics to timestamp data has
> > been
> > > >> > > a source of issues for many people. Personally I don't find that
> > > >> > > behavior to be particularly helpful since depending on the session
> > > >> > > time zone, you will get different results on data not marked as
> > > >> > > UTC-normalized.
> > > >> > >
> > > >> > > In pandas, as contrast, we apply UTC semantics to
> > > >> > > naive/not-explicitly-normalized data so at least the code produces
> > > >> > > deterministic results on all environments. That seems strictly
> > better
> > > >> > > to me -- if you want a localized interpretation of naive data, you
> > > >> > > must opt into this rather than having it automatically selected
> > based
> > > >> > > on your locale. The instances of people shooting their toes off
> > due to
> > > >> > > time zones are practically non-existent, whereas I'm hearing about
> > > >> > > Spark gotchas all the time.
> > > >> > >
> > > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi
> >  > > >> >
> > > >> > > wrote:
> > > >> > > >
> > > >> > > > Hi Wes,
> > > >> > > >
> > > >> > > > The rules for TIMESTAMP forward-compatibility were created
> > based on
> > > >> the
> > > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only
> > > >> been used
> > > >> > > > in the instant aka. UTC-normalized semantics so far. This
> > > >> assumption was
> > > >> > > > supported by two sources:
> > > >> > > >
> > > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS
> > and
> > > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed
> > since
> > > >> the
> > > >> > > Unix
> > > >> > > > epoch, an instant specified in UTC, from which it follows that
> > they
> > > >> have
> > > >> > > > instant semantics (because timestamps of local semantics do not
> > > >> > > correspond
> > > >> > > > to a single instant).
> > > >> > > >
> > > >> > > > 2. Anecdotal knowledge: We were not aware of any software

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

2019-07-10 Thread Zoltan Ivanfi
Hi Wes,

Do you mean that the new logical types have already been released in 0.14.0
and a 0.14.1 is needed ASAP to fix this regression?

Thanks,

Zoltan

On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney  wrote:

> hi Zoltan -- given the raging fire that is 0.14.0 as a result of these
> issues and others we need to make a new release within the next 7-10
> days. We can point you to nightly Python builds to make testing for
> you easier so you don't have to build the project yourself.
>
> - Wes
>
> On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi 
> wrote:
> >
> > Hi,
> >
> > Oh, and one more thing: Before releasing the next Arrow version
> > incorporating the new logical types, we should definitely test that their
> > behaviour matches that of parquet-mr. When is the next release planned to
> > come out?
> >
> > Br,
> >
> > Zoltan
> >
> > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi  wrote:
> >
> > > Hi Wes,
> > >
> > > Yes, I agree that we should do that, but then we have a problem of
> what to
> > > do in the other direction, i.e. when we use the new logical types API
> to
> > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
> > > normalized flag? Tim has started a discussion about that, suggesting
> to use
> > > three states that I just answered.
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney 
> wrote:
> > >
> > >> Thank for the comments.
> > >>
> > >> So in summary I think that we need to set the TIMESTAMP_* converted
> > >> types to maintain forward compatibility and stay consistent with what
> > >> we were doing in the C++ library prior to the introduction of the
> > >> LogicalType metadata.
> > >>
> > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi  >
> > >> wrote:
> > >> >
> > >> > Hi Wes,
> > >> >
> > >> > Both of the semantics are deterministic in one aspect and
> > >> indeterministic
> > >> > in another. Timestamps of instant semantic will always refer to the
> same
> > >> > instant, but their user-facing representation (how they get
> displayed)
> > >> > depends on the user's time zone. Timestamps of local semantics
> always
> > >> have
> > >> > the same user-facing representation but the instant they refer to is
> > >> > undefined (or ambigous, depending on your point of view).
> > >> >
> > >> > My understanding is that Spark uses instant semantics, i.e.,
> timestamps
> > >> are
> > >> > stored normalized to UTC and are displayed adjusted to the user's
> local
> > >> > time zone.
> > >> >
> > >> > Br,
> > >> >
> > >> > Zoltan
> > >> >
> > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney 
> > >> wrote:
> > >> >
> > >> > > Thanks Zoltan.
> > >> > >
> > >> > > This is definitely a tricky issue.
> > >> > >
> > >> > > Spark's application of localtime semantics to timestamp data has
> been
> > >> > > a source of issues for many people. Personally I don't find that
> > >> > > behavior to be particularly helpful since depending on the session
> > >> > > time zone, you will get different results on data not marked as
> > >> > > UTC-normalized.
> > >> > >
> > >> > > In pandas, as contrast, we apply UTC semantics to
> > >> > > naive/not-explicitly-normalized data so at least the code produces
> > >> > > deterministic results on all environments. That seems strictly
> better
> > >> > > to me -- if you want a localized interpretation of naive data, you
> > >> > > must opt into this rather than having it automatically selected
> based
> > >> > > on your locale. The instances of people shooting their toes off
> due to
> > >> > > time zones are practically non-existent, whereas I'm hearing about
> > >> > > Spark gotchas all the time.
> > >> > >
> > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi
>  > >> >
> > >> > > wrote:
> > >> > > >
> > >> > > > Hi Wes,
> > >> > > >
> > >> > > > The rules for TIMESTAMP forward-compatibility were created
> based on
> > >> the
> > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only
> > >> been used
> > >> > > > in the instant aka. UTC-normalized semantics so far. This
> > >> assumption was
> > >> > > > supported by two sources:
> > >> > > >
> > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS
> and
> > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed
> since
> > >> the
> > >> > > Unix
> > >> > > > epoch, an instant specified in UTC, from which it follows that
> they
> > >> have
> > >> > > > instant semantics (because timestamps of local semantics do not
> > >> > > correspond
> > >> > > > to a single instant).
> > >> > > >
> > >> > > > 2. Anecdotal knowledge: We were not aware of any software
> component
> > >> that
> > >> > > > used these types differently from the specification.
> > >> > > >
> > >> > > > Based on your e-mail, we were wrong on #2.
> > >> > > >
> > >> > > > From this false premise it followed that TIMESTAMPs with local
> > >> semantics
> > >> > > > were a new type and did not need to be annotated with the old
> types
> > >> to

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

2019-07-10 Thread Wes McKinney
hi Zoltan -- given the raging fire that is 0.14.0 as a result of these
issues and others we need to make a new release within the next 7-10
days. We can point you to nightly Python builds to make testing for
you easier so you don't have to build the project yourself.

- Wes

On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi  wrote:
>
> Hi,
>
> Oh, and one more thing: Before releasing the next Arrow version
> incorporating the new logical types, we should definitely test that their
> behaviour matches that of parquet-mr. When is the next release planned to
> come out?
>
> Br,
>
> Zoltan
>
> On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi  wrote:
>
> > Hi Wes,
> >
> > Yes, I agree that we should do that, but then we have a problem of what to
> > do in the other direction, i.e. when we use the new logical types API to
> > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
> > normalized flag? Tim has started a discussion about that, suggesting to use
> > three states that I just answered.
> >
> > Br,
> >
> > Zoltan
> >
> > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney  wrote:
> >
> >> Thank for the comments.
> >>
> >> So in summary I think that we need to set the TIMESTAMP_* converted
> >> types to maintain forward compatibility and stay consistent with what
> >> we were doing in the C++ library prior to the introduction of the
> >> LogicalType metadata.
> >>
> >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi 
> >> wrote:
> >> >
> >> > Hi Wes,
> >> >
> >> > Both of the semantics are deterministic in one aspect and
> >> indeterministic
> >> > in another. Timestamps of instant semantic will always refer to the same
> >> > instant, but their user-facing representation (how they get displayed)
> >> > depends on the user's time zone. Timestamps of local semantics always
> >> have
> >> > the same user-facing representation but the instant they refer to is
> >> > undefined (or ambigous, depending on your point of view).
> >> >
> >> > My understanding is that Spark uses instant semantics, i.e., timestamps
> >> are
> >> > stored normalized to UTC and are displayed adjusted to the user's local
> >> > time zone.
> >> >
> >> > Br,
> >> >
> >> > Zoltan
> >> >
> >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney 
> >> wrote:
> >> >
> >> > > Thanks Zoltan.
> >> > >
> >> > > This is definitely a tricky issue.
> >> > >
> >> > > Spark's application of localtime semantics to timestamp data has been
> >> > > a source of issues for many people. Personally I don't find that
> >> > > behavior to be particularly helpful since depending on the session
> >> > > time zone, you will get different results on data not marked as
> >> > > UTC-normalized.
> >> > >
> >> > > In pandas, as contrast, we apply UTC semantics to
> >> > > naive/not-explicitly-normalized data so at least the code produces
> >> > > deterministic results on all environments. That seems strictly better
> >> > > to me -- if you want a localized interpretation of naive data, you
> >> > > must opt into this rather than having it automatically selected based
> >> > > on your locale. The instances of people shooting their toes off due to
> >> > > time zones are practically non-existent, whereas I'm hearing about
> >> > > Spark gotchas all the time.
> >> > >
> >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi 
> >> > >  >> >
> >> > > wrote:
> >> > > >
> >> > > > Hi Wes,
> >> > > >
> >> > > > The rules for TIMESTAMP forward-compatibility were created based on
> >> the
> >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only
> >> been used
> >> > > > in the instant aka. UTC-normalized semantics so far. This
> >> assumption was
> >> > > > supported by two sources:
> >> > > >
> >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and
> >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since
> >> the
> >> > > Unix
> >> > > > epoch, an instant specified in UTC, from which it follows that they
> >> have
> >> > > > instant semantics (because timestamps of local semantics do not
> >> > > correspond
> >> > > > to a single instant).
> >> > > >
> >> > > > 2. Anecdotal knowledge: We were not aware of any software component
> >> that
> >> > > > used these types differently from the specification.
> >> > > >
> >> > > > Based on your e-mail, we were wrong on #2.
> >> > > >
> >> > > > From this false premise it followed that TIMESTAMPs with local
> >> semantics
> >> > > > were a new type and did not need to be annotated with the old types
> >> to
> >> > > > maintain compatibility. In fact, annotating them with the old types
> >> were
> >> > > > considered to be harmful, since it would have mislead older readers
> >> into
> >> > > > thinking that they can read TIMESTAMPs with local semantics, when in
> >> > > > reality they would have misinterpreted them as TIMESTAMPs with
> >> instant
> >> > > > semantics. This would have lead to a difference of several hours,
> >> > > > corresponding to the time zone offset.
> >> > > >
> 

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

2019-07-10 Thread Zoltan Ivanfi
Hi,

Oh, and one more thing: Before releasing the next Arrow version
incorporating the new logical types, we should definitely test that their
behaviour matches that of parquet-mr. When is the next release planned to
come out?

Br,

Zoltan

On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi  wrote:

> Hi Wes,
>
> Yes, I agree that we should do that, but then we have a problem of what to
> do in the other direction, i.e. when we use the new logical types API to
> read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
> normalized flag? Tim has started a discussion about that, suggesting to use
> three states that I just answered.
>
> Br,
>
> Zoltan
>
> On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney  wrote:
>
>> Thank for the comments.
>>
>> So in summary I think that we need to set the TIMESTAMP_* converted
>> types to maintain forward compatibility and stay consistent with what
>> we were doing in the C++ library prior to the introduction of the
>> LogicalType metadata.
>>
>> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi 
>> wrote:
>> >
>> > Hi Wes,
>> >
>> > Both of the semantics are deterministic in one aspect and
>> indeterministic
>> > in another. Timestamps of instant semantic will always refer to the same
>> > instant, but their user-facing representation (how they get displayed)
>> > depends on the user's time zone. Timestamps of local semantics always
>> have
>> > the same user-facing representation but the instant they refer to is
>> > undefined (or ambigous, depending on your point of view).
>> >
>> > My understanding is that Spark uses instant semantics, i.e., timestamps
>> are
>> > stored normalized to UTC and are displayed adjusted to the user's local
>> > time zone.
>> >
>> > Br,
>> >
>> > Zoltan
>> >
>> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney 
>> wrote:
>> >
>> > > Thanks Zoltan.
>> > >
>> > > This is definitely a tricky issue.
>> > >
>> > > Spark's application of localtime semantics to timestamp data has been
>> > > a source of issues for many people. Personally I don't find that
>> > > behavior to be particularly helpful since depending on the session
>> > > time zone, you will get different results on data not marked as
>> > > UTC-normalized.
>> > >
>> > > In pandas, as contrast, we apply UTC semantics to
>> > > naive/not-explicitly-normalized data so at least the code produces
>> > > deterministic results on all environments. That seems strictly better
>> > > to me -- if you want a localized interpretation of naive data, you
>> > > must opt into this rather than having it automatically selected based
>> > > on your locale. The instances of people shooting their toes off due to
>> > > time zones are practically non-existent, whereas I'm hearing about
>> > > Spark gotchas all the time.
>> > >
>> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi > >
>> > > wrote:
>> > > >
>> > > > Hi Wes,
>> > > >
>> > > > The rules for TIMESTAMP forward-compatibility were created based on
>> the
>> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only
>> been used
>> > > > in the instant aka. UTC-normalized semantics so far. This
>> assumption was
>> > > > supported by two sources:
>> > > >
>> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and
>> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since
>> the
>> > > Unix
>> > > > epoch, an instant specified in UTC, from which it follows that they
>> have
>> > > > instant semantics (because timestamps of local semantics do not
>> > > correspond
>> > > > to a single instant).
>> > > >
>> > > > 2. Anecdotal knowledge: We were not aware of any software component
>> that
>> > > > used these types differently from the specification.
>> > > >
>> > > > Based on your e-mail, we were wrong on #2.
>> > > >
>> > > > From this false premise it followed that TIMESTAMPs with local
>> semantics
>> > > > were a new type and did not need to be annotated with the old types
>> to
>> > > > maintain compatibility. In fact, annotating them with the old types
>> were
>> > > > considered to be harmful, since it would have mislead older readers
>> into
>> > > > thinking that they can read TIMESTAMPs with local semantics, when in
>> > > > reality they would have misinterpreted them as TIMESTAMPs with
>> instant
>> > > > semantics. This would have lead to a difference of several hours,
>> > > > corresponding to the time zone offset.
>> > > >
>> > > > In the light of your e-mail, this misinterpretation of timestamps
>> may
>> > > > already be happening, since if Arrow annotates local timestamps with
>> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets
>> them
>> > > as
>> > > > timestamps with instant semantics, leading to a difference of
>> several
>> > > hours.
>> > > >
>> > > > Based on this, I think it would make sense from Arrow's point of
>> view to
>> > > > annotate both semantics with the old types, since that is its
>> historical
>> > > > behaviour and keeping it up is needed for 

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

2019-07-10 Thread Zoltan Ivanfi
Hi Wes,

Yes, I agree that we should do that, but then we have a problem of what to
do in the other direction, i.e. when we use the new logical types API to
read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
normalized flag? Tim has started a discussion about that, suggesting to use
three states that I just answered.

Br,

Zoltan

On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney  wrote:

> Thank for the comments.
>
> So in summary I think that we need to set the TIMESTAMP_* converted
> types to maintain forward compatibility and stay consistent with what
> we were doing in the C++ library prior to the introduction of the
> LogicalType metadata.
>
> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi 
> wrote:
> >
> > Hi Wes,
> >
> > Both of the semantics are deterministic in one aspect and indeterministic
> > in another. Timestamps of instant semantic will always refer to the same
> > instant, but their user-facing representation (how they get displayed)
> > depends on the user's time zone. Timestamps of local semantics always
> have
> > the same user-facing representation but the instant they refer to is
> > undefined (or ambigous, depending on your point of view).
> >
> > My understanding is that Spark uses instant semantics, i.e., timestamps
> are
> > stored normalized to UTC and are displayed adjusted to the user's local
> > time zone.
> >
> > Br,
> >
> > Zoltan
> >
> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney  wrote:
> >
> > > Thanks Zoltan.
> > >
> > > This is definitely a tricky issue.
> > >
> > > Spark's application of localtime semantics to timestamp data has been
> > > a source of issues for many people. Personally I don't find that
> > > behavior to be particularly helpful since depending on the session
> > > time zone, you will get different results on data not marked as
> > > UTC-normalized.
> > >
> > > In pandas, as contrast, we apply UTC semantics to
> > > naive/not-explicitly-normalized data so at least the code produces
> > > deterministic results on all environments. That seems strictly better
> > > to me -- if you want a localized interpretation of naive data, you
> > > must opt into this rather than having it automatically selected based
> > > on your locale. The instances of people shooting their toes off due to
> > > time zones are practically non-existent, whereas I'm hearing about
> > > Spark gotchas all the time.
> > >
> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi  >
> > > wrote:
> > > >
> > > > Hi Wes,
> > > >
> > > > The rules for TIMESTAMP forward-compatibility were created based on
> the
> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been
> used
> > > > in the instant aka. UTC-normalized semantics so far. This assumption
> was
> > > > supported by two sources:
> > > >
> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and
> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since
> the
> > > Unix
> > > > epoch, an instant specified in UTC, from which it follows that they
> have
> > > > instant semantics (because timestamps of local semantics do not
> > > correspond
> > > > to a single instant).
> > > >
> > > > 2. Anecdotal knowledge: We were not aware of any software component
> that
> > > > used these types differently from the specification.
> > > >
> > > > Based on your e-mail, we were wrong on #2.
> > > >
> > > > From this false premise it followed that TIMESTAMPs with local
> semantics
> > > > were a new type and did not need to be annotated with the old types
> to
> > > > maintain compatibility. In fact, annotating them with the old types
> were
> > > > considered to be harmful, since it would have mislead older readers
> into
> > > > thinking that they can read TIMESTAMPs with local semantics, when in
> > > > reality they would have misinterpreted them as TIMESTAMPs with
> instant
> > > > semantics. This would have lead to a difference of several hours,
> > > > corresponding to the time zone offset.
> > > >
> > > > In the light of your e-mail, this misinterpretation of timestamps may
> > > > already be happening, since if Arrow annotates local timestamps with
> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets
> them
> > > as
> > > > timestamps with instant semantics, leading to a difference of several
> > > hours.
> > > >
> > > > Based on this, I think it would make sense from Arrow's point of
> view to
> > > > annotate both semantics with the old types, since that is its
> historical
> > > > behaviour and keeping it up is needed for maintaining compatibilty.
> I'm
> > > not
> > > > so sure about the Java library though, since as far as I know, these
> > > types
> > > > were never used in the local sense there (although I may be wrong
> again).
> > > > Were we to decide that Arrow and parquet-mr should behave
> differently in
> > > > this aspect though, it may be tricky to convey this distinction in
> the
> > > > specification. I would be interested in hearing your and 

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

2019-07-10 Thread Wes McKinney
Thank for the comments.

So in summary I think that we need to set the TIMESTAMP_* converted
types to maintain forward compatibility and stay consistent with what
we were doing in the C++ library prior to the introduction of the
LogicalType metadata.

On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi  wrote:
>
> Hi Wes,
>
> Both of the semantics are deterministic in one aspect and indeterministic
> in another. Timestamps of instant semantic will always refer to the same
> instant, but their user-facing representation (how they get displayed)
> depends on the user's time zone. Timestamps of local semantics always have
> the same user-facing representation but the instant they refer to is
> undefined (or ambigous, depending on your point of view).
>
> My understanding is that Spark uses instant semantics, i.e., timestamps are
> stored normalized to UTC and are displayed adjusted to the user's local
> time zone.
>
> Br,
>
> Zoltan
>
> On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney  wrote:
>
> > Thanks Zoltan.
> >
> > This is definitely a tricky issue.
> >
> > Spark's application of localtime semantics to timestamp data has been
> > a source of issues for many people. Personally I don't find that
> > behavior to be particularly helpful since depending on the session
> > time zone, you will get different results on data not marked as
> > UTC-normalized.
> >
> > In pandas, as contrast, we apply UTC semantics to
> > naive/not-explicitly-normalized data so at least the code produces
> > deterministic results on all environments. That seems strictly better
> > to me -- if you want a localized interpretation of naive data, you
> > must opt into this rather than having it automatically selected based
> > on your locale. The instances of people shooting their toes off due to
> > time zones are practically non-existent, whereas I'm hearing about
> > Spark gotchas all the time.
> >
> > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi 
> > wrote:
> > >
> > > Hi Wes,
> > >
> > > The rules for TIMESTAMP forward-compatibility were created based on the
> > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been used
> > > in the instant aka. UTC-normalized semantics so far. This assumption was
> > > supported by two sources:
> > >
> > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and
> > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since the
> > Unix
> > > epoch, an instant specified in UTC, from which it follows that they have
> > > instant semantics (because timestamps of local semantics do not
> > correspond
> > > to a single instant).
> > >
> > > 2. Anecdotal knowledge: We were not aware of any software component that
> > > used these types differently from the specification.
> > >
> > > Based on your e-mail, we were wrong on #2.
> > >
> > > From this false premise it followed that TIMESTAMPs with local semantics
> > > were a new type and did not need to be annotated with the old types to
> > > maintain compatibility. In fact, annotating them with the old types were
> > > considered to be harmful, since it would have mislead older readers into
> > > thinking that they can read TIMESTAMPs with local semantics, when in
> > > reality they would have misinterpreted them as TIMESTAMPs with instant
> > > semantics. This would have lead to a difference of several hours,
> > > corresponding to the time zone offset.
> > >
> > > In the light of your e-mail, this misinterpretation of timestamps may
> > > already be happening, since if Arrow annotates local timestamps with
> > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets them
> > as
> > > timestamps with instant semantics, leading to a difference of several
> > hours.
> > >
> > > Based on this, I think it would make sense from Arrow's point of view to
> > > annotate both semantics with the old types, since that is its historical
> > > behaviour and keeping it up is needed for maintaining compatibilty. I'm
> > not
> > > so sure about the Java library though, since as far as I know, these
> > types
> > > were never used in the local sense there (although I may be wrong again).
> > > Were we to decide that Arrow and parquet-mr should behave differently in
> > > this aspect though, it may be tricky to convey this distinction in the
> > > specification. I would be interested in hearing your and other
> > developers'
> > > opinions on this.
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney  wrote:
> > >
> > > > hi folks,
> > > >
> > > > We have just recently implemented the new LogicalType unions in the
> > > > Parquet C++ library and we have run into a forward compatibility
> > > > problem with reader versions prior to this implementation.
> > > >
> > > > To recap the issue, prior to the introduction of LogicalType, the
> > > > Parquet format had no explicit notion of time zones or UTC
> > > > normalization. The new TimestampType provides a flag to indicate
> > > > 

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

2019-07-10 Thread Zoltan Ivanfi
Hi Tim,

In my opinion the specification of the older timestamp types only allowed
UTC-normalized storage, since these types were defined as the number of
milli/microseconds elapsed since the Unix epoch. This clearly defines the
meaning of the numeric value 0 as 0 seconds after the Unix epoch, i.e.
1970-01-01 00:00:00 UTC. It does not say anything about how this value must
be displayed, i.e. it may be displayed as "1970-01-01 00:00:00 UTC", but
typically it is displayed adjusted to the user's local timezone, for
example "1970-01-01 01:00:00" for a user in Paris. I don't think this
definition allows interpreting the numeric value 0 as "1970-01-01 00:00:00"
in Paris, since the latter would correspond to 1969-12-31 23:00:00 UTC,
which must be stored as the numeric value -3600 (times 10^3 for _MILLIS or
10^6 for _MICROS) instead.

I realize that compatibility with real-life usage patterns is important
regardless of whether they comply with the specification or not, but I
can't think of any solution that would be useful in practice. The
suggestion to turn the boolean into an enum would certainly allow Parquet
to have timestamps with unknown semantics, but I don't know what value that
would bring to applications and how they would use it. I'm also afraid that
the undefined semantics would get misused/overused by developers who are
not sure about the difference between the two semantics and we would end up
with a lot of meaningless timestamps.

Even with the problems I listed your suggestion may still be better than
the current solution, but before making a community decision I would like
to continue this discussion focusing on three questions:

   - What are the implications of this change?
   - How will unknown semantics be used in practice?
   - Does it bring value?
   - Can we do better?
   - Can we even change the boolean to an enum? It has been specified like
   that and released a long time ago. Although I am not aware of any software
   component that would have already implemented it, I was also unaware of
   software components using TIMESTAMP_MILLIS and _MICROS for local semantics.

One alternative that comes to my mind is to default to the more common
UTC-normalized semantics but allow overriding it in the reader schema.

Thanks,

Zoltan

On Tue, Jul 9, 2019 at 9:52 PM TP Boudreau  wrote:

> I'm not a long-time Parquet user, but I assisted in the expansion of the
> parquet-cpp library's LogicalType facility.
>
> My impression is that the original TIMESTAMP converted types were silent on
> whether the annotated value was UTC adjusted and that (often arcane)
> out-of-band information had to be relied on by readers to decide the UTC
> adjustment status for timestamp columns.  It seemed to me that that
> perceived shortcoming was a primary motivator for adding the
> isAdjustedToUTC boolean parameter to the corresponding new Timestamp
> LogicalType.  If that impression is accurate, then when reading TIMESTAMP
> columns written by legacy (converted type only) writers, it seems
> inappropriate for LogicalType aware readers to unconditionally assign
> *either* "false" or "true" (as currently required) to a boolean UTC
> adjusted parameter, as that requires the reader to infer a property that
> wasn't implied by the writer.
>
> One possible approach to untangling this might be to amend the
> parquet.thrift specification to change the isAdjustedToUTC boolean property
> to an enum or union type (some enumerated list) named (for example)
> UTCAdjustment with three possible values: Unknown, UTCAdjusted,
> NotUTCAdjusted (I'm not married to the names).  Extant files with TIMESTAMP
> converted types only would map for forward compatibility to Timestamp
> LogicalTypes with UTCAdjustment:=Unknown .  New files with user supplied
> Timestamp LogicalTypes would always record the converted type as TIMESTAMP
> for backward compatibility regardless of the value of the new UTCAdjustment
> parameter (this would be lossy on a round-trip through a legacy library,
> but that's unavoidable -- and the legacy libraries would be no worse off
> than they are now).  The specification would normatively state that new
> user supplied Timestamp LogicalTypes SHOULD (or MUST?) use either
> UTCAdjusted or NotUTCAdjusted (discouraging the use of Unknown in new
> files).
>
> Thanks, Tim
>


Re: [VOTE] Parquet Bloom filter spec sign-off

2019-07-10 Thread 俊杰陈
I see, will resume this next week.  Thanks.



On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi  wrote:
>
> Hi Junjie,
>
> Since there are ongoing improvements addressing review comments, I would
> hold off with the vote for a few more days until the specification settles.
>
> Br,
>
> Zoltan
>
> On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈  wrote:
>
> > Hi Parquet committers and developers
> >
> > We are waiting for your important ballot:)
> >
> > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈  wrote:
> > >
> > > Yes, there are some public benchmark results, such as the official
> > > benchmark from xxhash site (http://www.xxhash.com/) and published
> > > comparison from smhasher project
> > > (https://github.com/rurban/smhasher/).
> > >
> > >
> > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney  wrote:
> > > >
> > > > Do you have any benchmark data to support the choice of hash function?
> > > >
> > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈  wrote:
> > > > >
> > > > > Dear Parquet developers
> > > > >
> > > > > To simplify the voting, I 'd like to update voting content to the
> > spec
> > > > > with xxHash hash strategy. Now you can reply with +1 or -1.
> > > > >
> > > > > Thanks for your participation.
> > > > >
> > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈  wrote:
> > > > > >
> > > > > > Dear Parquet developers
> > > > > >
> > > > > > Parquet Bloom filter has been developed for a while, per the
> > discussion on the mail list, it's time to call a vote for spec to move
> > forward. The current spec can be found at
> > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > There are some different options about the internal hash choice of Bloom
> > filter and the PR is for that concern.
> > > > > >
> > > > > > So I 'd like to propose to vote the spec + hash option, for
> > example:
> > > > > >
> > > > > > +1 to spec and xxHash
> > > > > > +1 to spec and murmur3
> > > > > > ...
> > > > > >
> > > > > > Please help to vote, any feedback is also welcome in the
> > discussion thread.
> > > > > >
> > > > > > Thanks & Best Regards
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards
> >



--
Thanks & Best Regards


Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

2019-07-10 Thread Zoltan Ivanfi
Hi Wes,

Both of the semantics are deterministic in one aspect and indeterministic
in another. Timestamps of instant semantic will always refer to the same
instant, but their user-facing representation (how they get displayed)
depends on the user's time zone. Timestamps of local semantics always have
the same user-facing representation but the instant they refer to is
undefined (or ambigous, depending on your point of view).

My understanding is that Spark uses instant semantics, i.e., timestamps are
stored normalized to UTC and are displayed adjusted to the user's local
time zone.

Br,

Zoltan

On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney  wrote:

> Thanks Zoltan.
>
> This is definitely a tricky issue.
>
> Spark's application of localtime semantics to timestamp data has been
> a source of issues for many people. Personally I don't find that
> behavior to be particularly helpful since depending on the session
> time zone, you will get different results on data not marked as
> UTC-normalized.
>
> In pandas, as contrast, we apply UTC semantics to
> naive/not-explicitly-normalized data so at least the code produces
> deterministic results on all environments. That seems strictly better
> to me -- if you want a localized interpretation of naive data, you
> must opt into this rather than having it automatically selected based
> on your locale. The instances of people shooting their toes off due to
> time zones are practically non-existent, whereas I'm hearing about
> Spark gotchas all the time.
>
> On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi 
> wrote:
> >
> > Hi Wes,
> >
> > The rules for TIMESTAMP forward-compatibility were created based on the
> > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been used
> > in the instant aka. UTC-normalized semantics so far. This assumption was
> > supported by two sources:
> >
> > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and
> > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since the
> Unix
> > epoch, an instant specified in UTC, from which it follows that they have
> > instant semantics (because timestamps of local semantics do not
> correspond
> > to a single instant).
> >
> > 2. Anecdotal knowledge: We were not aware of any software component that
> > used these types differently from the specification.
> >
> > Based on your e-mail, we were wrong on #2.
> >
> > From this false premise it followed that TIMESTAMPs with local semantics
> > were a new type and did not need to be annotated with the old types to
> > maintain compatibility. In fact, annotating them with the old types were
> > considered to be harmful, since it would have mislead older readers into
> > thinking that they can read TIMESTAMPs with local semantics, when in
> > reality they would have misinterpreted them as TIMESTAMPs with instant
> > semantics. This would have lead to a difference of several hours,
> > corresponding to the time zone offset.
> >
> > In the light of your e-mail, this misinterpretation of timestamps may
> > already be happening, since if Arrow annotates local timestamps with
> > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets them
> as
> > timestamps with instant semantics, leading to a difference of several
> hours.
> >
> > Based on this, I think it would make sense from Arrow's point of view to
> > annotate both semantics with the old types, since that is its historical
> > behaviour and keeping it up is needed for maintaining compatibilty. I'm
> not
> > so sure about the Java library though, since as far as I know, these
> types
> > were never used in the local sense there (although I may be wrong again).
> > Were we to decide that Arrow and parquet-mr should behave differently in
> > this aspect though, it may be tricky to convey this distinction in the
> > specification. I would be interested in hearing your and other
> developers'
> > opinions on this.
> >
> > Thanks,
> >
> > Zoltan
> >
> > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney  wrote:
> >
> > > hi folks,
> > >
> > > We have just recently implemented the new LogicalType unions in the
> > > Parquet C++ library and we have run into a forward compatibility
> > > problem with reader versions prior to this implementation.
> > >
> > > To recap the issue, prior to the introduction of LogicalType, the
> > > Parquet format had no explicit notion of time zones or UTC
> > > normalization. The new TimestampType provides a flag to indicate
> > > UTC-normalization
> > >
> > > struct TimestampType {
> > > 1: required bool isAdjustedToUTC
> > > 2: required TimeUnit unit
> > > }
> > >
> > > When using this new type, the ConvertedType field must also be set for
> > > forward compatibility (so that old readers can still understand the
> > > data), but parquet.thrift says
> > >
> > > // use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC =
> > > true, unit = MICROS)
> > > // use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC =
> > > true, unit 

[jira] [Commented] (PARQUET-1609) support xxhash in bloom filter

2019-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881985#comment-16881985
 ] 

ASF GitHub Bot commented on PARQUET-1609:
-

jbapple commented on pull request #143: PARQUET-1609: Specify which xxhash 
carefully
URL: https://github.com/apache/parquet-format/pull/143
 
 
   The hash function "xxhash" is actually a number of different hash
   functions including xxHash, XXH64, XXH32, and XXH3. Additionally,
   these hash functions accept "seeds", as most modern hash functions do,
   including MurmurHash variants.
   
   This patch specifies that the BloomFilter hash function default is
   XXH64 with a seed of 0. It omits the confusing note about the ISA and
   different variants of xxHash, since XXH64 is apparently
   architecture-independent.
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET-1609) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> support xxhash in bloom filter
> --
>
> Key: PARQUET-1609
> URL: https://issues.apache.org/jira/browse/PARQUET-1609
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Affects Versions: format-2.6.0
>Reporter: Junjie Chen
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1609) support xxhash in bloom filter

2019-07-10 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1609:

Labels: pull-request-available  (was: )

> support xxhash in bloom filter
> --
>
> Key: PARQUET-1609
> URL: https://issues.apache.org/jira/browse/PARQUET-1609
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Affects Versions: format-2.6.0
>Reporter: Junjie Chen
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1617) Add more details to bloom filter spec

2019-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881940#comment-16881940
 ] 

ASF GitHub Bot commented on PARQUET-1617:
-

zivanfi commented on pull request #140: PARQUET-1617: Add more detail to Bloom 
filter spec
URL: https://github.com/apache/parquet-format/pull/140
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add more details to bloom filter spec
> -
>
> Key: PARQUET-1617
> URL: https://issues.apache.org/jira/browse/PARQUET-1617
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Junjie Chen
>Assignee: Junjie Chen
>Priority: Minor
>  Labels: pull-request-available
>
> The current spec doesn't contain deep detail of some reference, which may 
> bring confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1619) Merge crypto spec and structures to format master

2019-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881937#comment-16881937
 ] 

ASF GitHub Bot commented on PARQUET-1619:
-

ggershinsky commented on pull request #142: PARQUET-1619: Merge encryption in 
format master
URL: https://github.com/apache/parquet-format/pull/142
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Merge crypto spec and structures to format master
> -
>
> Key: PARQUET-1619
> URL: https://issues.apache.org/jira/browse/PARQUET-1619
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1619) Merge crypto spec and structures to format master

2019-07-10 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1619:

Labels: pull-request-available  (was: )

> Merge crypto spec and structures to format master
> -
>
> Key: PARQUET-1619
> URL: https://issues.apache.org/jira/browse/PARQUET-1619
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1619) Merge crypto spec and structures to format master

2019-07-10 Thread Gidon Gershinsky (JIRA)
Gidon Gershinsky created PARQUET-1619:
-

 Summary: Merge crypto spec and structures to format master
 Key: PARQUET-1619
 URL: https://issues.apache.org/jira/browse/PARQUET-1619
 Project: Parquet
  Issue Type: Sub-task
  Components: parquet-format
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Parquet Bloom filter spec sign-off

2019-07-10 Thread Zoltan Ivanfi
Hi Junjie,

Since there are ongoing improvements addressing review comments, I would
hold off with the vote for a few more days until the specification settles.

Br,

Zoltan

On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈  wrote:

> Hi Parquet committers and developers
>
> We are waiting for your important ballot:)
>
> On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈  wrote:
> >
> > Yes, there are some public benchmark results, such as the official
> > benchmark from xxhash site (http://www.xxhash.com/) and published
> > comparison from smhasher project
> > (https://github.com/rurban/smhasher/).
> >
> >
> > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney  wrote:
> > >
> > > Do you have any benchmark data to support the choice of hash function?
> > >
> > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈  wrote:
> > > >
> > > > Dear Parquet developers
> > > >
> > > > To simplify the voting, I 'd like to update voting content to the
> spec
> > > > with xxHash hash strategy. Now you can reply with +1 or -1.
> > > >
> > > > Thanks for your participation.
> > > >
> > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈  wrote:
> > > > >
> > > > > Dear Parquet developers
> > > > >
> > > > > Parquet Bloom filter has been developed for a while, per the
> discussion on the mail list, it's time to call a vote for spec to move
> forward. The current spec can be found at
> https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> There are some different options about the internal hash choice of Bloom
> filter and the PR is for that concern.
> > > > >
> > > > > So I 'd like to propose to vote the spec + hash option, for
> example:
> > > > >
> > > > > +1 to spec and xxHash
> > > > > +1 to spec and murmur3
> > > > > ...
> > > > >
> > > > > Please help to vote, any feedback is also welcome in the
> discussion thread.
> > > > >
> > > > > Thanks & Best Regards
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards
>
>
>
> --
> Thanks & Best Regards
>


[jira] [Commented] (PARQUET-1618) Update encryption spec for Bloom filter encryption

2019-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881851#comment-16881851
 ] 

ASF GitHub Bot commented on PARQUET-1618:
-

ggershinsky commented on pull request #141: PARQUET-1618: Update encryption 
spec for bloom filter encryption
URL: https://github.com/apache/parquet-format/pull/141
 
 
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Update encryption spec for Bloom filter encryption
> --
>
> Key: PARQUET-1618
> URL: https://issues.apache.org/jira/browse/PARQUET-1618
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
>
> update Encryption.md with the new module types for Bloom filter encryption.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1618) Update encryption spec for Bloom filter encryption

2019-07-10 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1618:

Labels: pull-request-available  (was: )

> Update encryption spec for Bloom filter encryption
> --
>
> Key: PARQUET-1618
> URL: https://issues.apache.org/jira/browse/PARQUET-1618
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-format
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>  Labels: pull-request-available
>
> update Encryption.md with the new module types for Bloom filter encryption.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1552) upgrade protoc-jar-maven-plugin to 3.8.0

2019-07-10 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PARQUET-1552.

Resolution: Fixed

> upgrade protoc-jar-maven-plugin to 3.8.0
> 
>
> Key: PARQUET-1552
> URL: https://issues.apache.org/jira/browse/PARQUET-1552
> Project: Parquet
>  Issue Type: Wish
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Junjie Chen
>Assignee: Junjie Chen
>Priority: Minor
>  Labels: pull-request-available
>
> Current protoc-jar-maven-plugin has a problem when building project after a 
> proxy network. The latest release 3.8.0 version fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1552) upgrade protoc-jar-maven-plugin to 3.8.0

2019-07-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881827#comment-16881827
 ] 

ASF GitHub Bot commented on PARQUET-1552:
-

nandorKollar commented on pull request #659: PARQUET-1552: upgrade 
protoc-jar-maven-plugin to 3.8.0 to fix proxy issue
URL: https://github.com/apache/parquet-mr/pull/659
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> upgrade protoc-jar-maven-plugin to 3.8.0
> 
>
> Key: PARQUET-1552
> URL: https://issues.apache.org/jira/browse/PARQUET-1552
> Project: Parquet
>  Issue Type: Wish
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Junjie Chen
>Assignee: Junjie Chen
>Priority: Minor
>  Labels: pull-request-available
>
> Current protoc-jar-maven-plugin has a problem when building project after a 
> proxy network. The latest release 3.8.0 version fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Parquet Bloom filter spec sign-off

2019-07-10 Thread 俊杰陈
Hi Parquet committers and developers

We are waiting for your important ballot:)

On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈  wrote:
>
> Yes, there are some public benchmark results, such as the official
> benchmark from xxhash site (http://www.xxhash.com/) and published
> comparison from smhasher project
> (https://github.com/rurban/smhasher/).
>
>
> On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney  wrote:
> >
> > Do you have any benchmark data to support the choice of hash function?
> >
> > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈  wrote:
> > >
> > > Dear Parquet developers
> > >
> > > To simplify the voting, I 'd like to update voting content to the spec
> > > with xxHash hash strategy. Now you can reply with +1 or -1.
> > >
> > > Thanks for your participation.
> > >
> > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈  wrote:
> > > >
> > > > Dear Parquet developers
> > > >
> > > > Parquet Bloom filter has been developed for a while, per the discussion 
> > > > on the mail list, it's time to call a vote for spec to move forward. 
> > > > The current spec can be found at 
> > > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md. 
> > > > There are some different options about the internal hash choice of 
> > > > Bloom filter and the PR is for that concern.
> > > >
> > > > So I 'd like to propose to vote the spec + hash option, for example:
> > > >
> > > > +1 to spec and xxHash
> > > > +1 to spec and murmur3
> > > > ...
> > > >
> > > > Please help to vote, any feedback is also welcome in the discussion 
> > > > thread.
> > > >
> > > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
>
>
>
> --
> Thanks & Best Regards



-- 
Thanks & Best Regards


[jira] [Created] (PARQUET-1618) Update encryption spec for Bloom filter encryption

2019-07-10 Thread Gidon Gershinsky (JIRA)
Gidon Gershinsky created PARQUET-1618:
-

 Summary: Update encryption spec for Bloom filter encryption
 Key: PARQUET-1618
 URL: https://issues.apache.org/jira/browse/PARQUET-1618
 Project: Parquet
  Issue Type: Sub-task
  Components: parquet-format
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


update Encryption.md with the new module types for Bloom filter encryption.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)