[GitHub] [arrow] kiszk commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-04-22 Thread GitBox


kiszk commented on issue #6981:
URL: https://github.com/apache/arrow/pull/6981#issuecomment-617935417


   @wesm do you have any comment on this change?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-04-21 Thread GitBox


kiszk commented on issue #6981:
URL: https://github.com/apache/arrow/pull/6981#issuecomment-617369335


   After the 1-day investigation, I knew the implementation looks a little 
complicated.   
   Regarding encoding, `TypedBufferBuilder` is used in some test cases, but it 
is not in some test cases. I am thinking about where we should make data 
little-endian.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-04-21 Thread GitBox


kiszk commented on issue #6981:
URL: https://github.com/apache/arrow/pull/6981#issuecomment-617149858


   I agree with it. Once it is stable, it looks good. Under the development, 
developers want the reader and writer independently. At least, for me.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-04-21 Thread GitBox


kiszk commented on issue #6981:
URL: https://github.com/apache/arrow/pull/6981#issuecomment-617113951


   Thank you for your suggestion. I think that these files are used only for 
read test now.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-04-21 Thread GitBox


kiszk commented on issue #6981:
URL: https://github.com/apache/arrow/pull/6981#issuecomment-617087421


   I think that the current test cases for parquet writer do not have tests to 
verify the bit pattern of the generated parquet file. I will also create the 
test case in another PR since they are important on the different native endian 
platforms.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-04-20 Thread GitBox


kiszk commented on issue #6981:
URL: https://github.com/apache/arrow/pull/6981#issuecomment-616759329


   Thank you for staring the discussion. I will watch at the thread.
   
   Yeah, `LETypedBufferBuilder` makes sense. It looks better than adding 
`AppendLE`.
   
   Regarding `Serialize`, it looks a good place where the class has both types 
of an element for Arrow and Parquet. But, encoding (i.e RLE and others) happens 
in `encoding.cc`. Let me check it tomorrow.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-04-20 Thread GitBox


kiszk commented on issue #6981:
URL: https://github.com/apache/arrow/pull/6981#issuecomment-616623385


   I have been thinking about place candidates of the interface between the 
native endian and a PARQUET little-endian. 
   
   One of the good candidates is `Serialize()` in `parquet/column_writer.cc`.  
Another candidate is `TypedBufferBuilder` in `arrow/buffer_builder.h`.
   
   Regarding `Serialize()`, this is because there is [a conversion 
loop](https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1781-1783)
 for Decimal128 that uses BigEndian. For big-endian, `Serialize()` of other 
primitive types including int96 needs to have such as conversion loop to 
little-endian. This is the first step.
   
   While the above approach leads to additional overhead, it would be good to 
have new methods
   `AppendLE` and `UnsafeAppendLE` in `TypedBufferBuilder` in addition to 
[`Append()` and 
`UnsafeAppend`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer_builder.h#L204-L240].
 These new method ensures to write typed data in little-endian.
   
   I think that we can support big-endian in Parquet using a two-step approach. 
What do you think?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-04-20 Thread GitBox


kiszk commented on issue #6981:
URL: https://github.com/apache/arrow/pull/6981#issuecomment-616540298


   Sure, at first, I derived this layout from [this 
method](https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.h#L576).
 
   
   Other resources:
   https://github.com/apache/arrow/issues/1920#issuecomment-541764279
   
https://stackoverflow.com/questions/53103762/cast-int96-timestamp-from-parquet-to-golang/53104516#53104516



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-04-20 Thread GitBox


kiszk commented on issue #6981:
URL: https://github.com/apache/arrow/pull/6981#issuecomment-616507292


   At first, this PR address only test cases. This PR does not address the 
routines (like `column_writer.h`) yet. To update test cases can clarify the 
specification and implementation like this.
   
   In this change, int96, which is a pair of uint64 and uint32, is stored in 
memory as follows:
   ```
   For little-endian
   Int96 Int96_min = {{1024, 2048, 4096}};
   00 04 00 00 00 08 00 00 00 10 00 00
   ~~~ ~~~
   uint64 (LE) uint32 (LE)
   
   For big-endian,
   Int96 Int96_min = {{2048, 1024, 4096}};
   00 00 08 00 00 00 04 00 00 00 10 00
   ~~~ ~~~
   uint64 (BE) uint32 (BE)
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-04-20 Thread GitBox


kiszk commented on issue #6981:
URL: https://github.com/apache/arrow/pull/6981#issuecomment-616462606


   Thank you for your clarification. Based on this, `Int96` structure in memory 
can be represented in native endian. When it will be written into a file, we 
carefully have to keep it in a little-endian.
   
   I just realized it that int32 in [int64, int32] is still a little-endian. 
After one or two hours, I will update this PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on issue #6981: PARQUET-1845: [C++] Add expected results of Int96 in big-endian

2020-04-20 Thread GitBox


kiszk commented on issue #6981:
URL: https://github.com/apache/arrow/pull/6981#issuecomment-616453579


   I understand your point. First, I implemented the approach ` entirely 
little-endian`.
   
   Then, I reconsidered it. Each primitive type should be represented in a 
little-endian as shown 
[here](https://github.com/apache/parquet-format/blob/master/Encodings.md) in 
**Parquet serialized form (e.g. file)**.  For example, int32 is kept using a 
native endian in a memory or register. It will be converted from/to a 
little-endian when it will be serialized. I extended this to Int96. This is a 
background of the current decision.
   
   Both approaches can work functionally correct. I understand that this is an 
implementation decision.
   
   It is ok with your approach. At that time, I have one question. How will we 
handle other primitive types (e.g. int32 or double)? Do we keep other 
primitives in entirely little-endian on both platforms (little-endian and 
big-endian CPUs)?
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org