[jira] [Comment Edited] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2018-02-13 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318547#comment-16318547
 ] 

Zoltan Ivanfi edited comment on PARQUET-1065 at 2/13/18 10:50 AM:
--

[~lv], we can not change the ordering for the existing {{min}} and {{max}} 
fields for int96 timestamps, because statistics were already written for them 
according to the wrong byte order.

We do not want to define a new int96 ordering for the new {{min-value}} and 
{{max-value}} fields either, because:
 # We can not distuingish between an int96 number and a timestamps stored in an 
int96, although they would require different endianness.
 # Introducing reverse endian ordering would put an unnecessary burden on the 
implementors for the sake of a legacy type that we would like to get rid of.


was (Author: zi):
[~lv], we can not change the ordering for the existing {{min}} and {{max}} 
fields for int96 timestamps, because statistics were already written for them 
according to the wrong byte order.

We do not want to define a new int96 ordering for the new {{min-value}} and 
{{max-value}} fields either, because:

# We can distuingish between a timestamps stored in an int96 that requires 
little-endian ordering and an actual int96 that requires big-endian ordering.
# Introducing little-endian ordering would put an unnecessary burden on the 
implementors for the sake of a legacy type that we would like to get rid of.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>Priority: Major
> Fix For: 1.10.0
>
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2018-02-13 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205986#comment-16205986
 ] 

Zoltan Ivanfi edited comment on PARQUET-1065 at 2/13/18 10:41 AM:
--

Unfortunately, since INT96 timestamps are stored in the opposite byte order 
than how INT96 numbers are supposed to be stored, the value of the most 
significant byte of the number interpretation will vary wildly, spanning the 
whole range between 0x00 and 0xFF. As a result, when comparing the raw bytes, 
signed and unsigned comparison can lead to different results.

Edit: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store second precision only. In these cases, the 
least significant byte is always 0x00, so comparison signedness does not affect 
the order. The problem I described above is only present for sub-second 
precisions. (To be more exact, it affects precisions below 10^8 nanosec = 100 
msec, since 10^8 = 2^8 * 5^8 and the 2^8 part makes the least significant byte 
zero.)


was (Author: zi):
Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will vary 
wildly, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

Edit: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store second precision only. In these cases, the 
least significant byte is always 0x00, so comparison signedness does not affect 
the order. The problem I described above is only present for sub-second 
precisions. (To be more exact, it affects precisions below 10^8 nanosec = 100 
msec, since 10^8 = 2^8 * 5^8 and the 2^8 part makes the least significant byte 
zero.)

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>Priority: Major
> Fix For: 1.10.0
>
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2018-01-09 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318446#comment-16318446
 ] 

Zoltan Ivanfi edited comment on PARQUET-1065 at 1/9/18 1:56 PM:


I think it is worth going through an example, the timestamp '2000-01-01 
12:34:56' stored as an int96 and dumped with parquet-tools:

{noformat}
$ parquet-tools dump 
hdfs://n1/user/hive/warehouse/test/11481f03a2ea6bed-b19656cd_1937418586_data.0.parq
 | tail -n 1
value 1: R:0 D:1 V:117253024523396126668760320
{noformat}

Since 117253024523396126668760320 = 0x60FD4B322959682500, the 12 bytes are 
00 60 FD 4B 32 29 00 00 | 59 68 25 00, where | shows the boundary between the 
time and the date parts.

00 60 FD 4B 32 29 00 00 is the time part, if we reverse the bytes we get 
0x29324BFD6000 = 45296 * 10^9 nanoseconds = 45296 seconds = 12 hours + 34 
minutes + 56 seconds.

59 68 25 00 is the date part, if we reverse the bytes we get 0x00256859 = 
2451545 as the Julian day number, which [corresponds 
to|http://aa.usno.navy.mil/jdconverter?ID=AA=2451545] 2000-01-01.

For correct ordering based purely on numerical value, in comparisons the 
example above should not be interpreted as 0x0060FD4B322959682500 = 
117253024523396126668760320 like it currently is, but as 
0x0025685929324BFD6000 = 45223023200227578716446720 instead. But since we 
do not want to introduce a new little-endian comparison order, we should just 
deprecate the ordering for this type.


was (Author: zi):
I think it is worth going through an example, the timestamp '2000-01-01 
12:34:56' stored as an int96 and dumped with parquet-tools:

$ parquet-tools dump 
hdfs://n1/user/hive/warehouse/test/11481f03a2ea6bed-b19656cd_1937418586_data.0.parq
 | tail -n 1
value 1: R:0 D:1 V:117253024523396126668760320

Since 117253024523396126668760320 = 0x60FD4B322959682500, the 12 bytes are 
00 60 FD 4B 32 29 00 00 | 59 68 25 00, where | shows the boundary between the 
time and the date parts.

00 60 FD 4B 32 29 00 00 is the time part, if we reverse the bytes we get 
0x29324BFD6000 = 45296 * 10^9 nanoseconds = 45296 seconds = 12 hours + 34 
minutes + 56 seconds.

59 68 25 00 is the date part, if we reverse the bytes we get 0x00256859 = 
2451545 as the Julian day number, which corresponds to 2000-01-01.

For correct ordering based purely on numerical value, in comparisons the 
example above should not be interpreted as 0x0060FD4B322959682500 = 
117253024523396126668760320 like it currently is, but as 
0x0025685929324BFD6000 = 45223023200227578716446720 instead. But since we 
do not want to introduce a new little-endian comparison order, we should just 
deprecate the ordering for this type.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-17 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205986#comment-16205986
 ] 

Zoltan Ivanfi edited comment on PARQUET-1065 at 10/17/17 9:21 AM:
--

Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will vary 
wildly, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

Edit: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store second precision only. In these cases, the 
least significant byte is always 0x00, so comparison signedness does not affect 
the order. The problem I described above is only present for sub-second 
precisions. (To be more exact, it affects precisions below 10^8 nanosec = 100 
msec, since 10^8 = 2^8 * 5^8 and the 2^8 part makes the least significant byte 
zero.)


was (Author: zi):
Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will vary 
wildly, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

Edit: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store less precise timestamps (e.g., sec, millisec or 
microsec). In these cases, the least significant byte is always 0x00, so 
comparison signedness does not affect the order. The problem I described above 
is only present when the INT96 timestamp precision is utilized to its full 
extent, which may be a negligible fraction of all use cases. Still, the 
possibility is there.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-16 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205986#comment-16205986
 ] 

Zoltan Ivanfi edited comment on PARQUET-1065 at 10/16/17 3:59 PM:
--

Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will vary 
wildly, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

Edit: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store less precise timestamps (e.g., sec, millisec or 
microsec). In these cases, the least significant byte is always 0x00, so 
comparison signedness does not affect the order. The problem I described above 
is only present when the INT96 timestamp precision is utilized to its full 
extent, which may be a negligible fraction of all use cases. Still, the 
possibility is there.


was (Author: zi):
Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will wildly 
vary, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

EDIT: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store less precise timestamps (e.g., sec, millisec or 
microsec). In these cases, the least significant byte is always 0x00, so 
comparison signedness does not affect the order. The problem I described above 
is only present when the INT96 timestamp precision is utilized to its full 
extent, which may be a negligible fraction of all use cases. Still, the 
possibility is there.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-16 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205986#comment-16205986
 ] 

Zoltan Ivanfi edited comment on PARQUET-1065 at 10/16/17 3:57 PM:
--

Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will wildly 
vary, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

EDIT: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store less precise timestamps (e.g., sec, millisec or 
microsec). In these cases, the least significant byte is always 0x00, so 
comparison signedness does not affect the order. The problem I described above 
is only present when the INT96 timestamp precision is utilized to its full 
extent, which may be a negligible fraction of all use cases. Still, the 
possibility is there.


was (Author: zi):
Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will wildly 
vary, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

EDIT: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store less precise timestamps (sec, millisec or 
microsec). In these cases, the least significant byte is always 0x00, so 
comparison signedness does not affect the order. The problem I described above 
is only present when the INT96 timestamp precision is used to its full extent, 
which may be a negligible fraction of all use cases. Still, the possibility is 
there.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-16 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205986#comment-16205986
 ] 

Zoltan Ivanfi edited comment on PARQUET-1065 at 10/16/17 3:56 PM:
--

Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will wildly 
vary, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

EDIT: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store less precise timestamps (sec, millisec or 
microsec). In these cases, the least significant byte is always 0x00, so 
comparison signedness does not affect the order. The problem I described above 
is only present when the INT96 timestamp precision is used to its full extent, 
which may be a negligible fraction of all use cases. Still, the possibility is 
there.


was (Author: zi):
Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will wildly 
vary, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)