[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user jcamachor commented on the issue: https://github.com/apache/orc/pull/249 @wgtmac , @omalley , thanks for the feedback. I think I have addressed all your points with last two commits, could you take another look? Thanks ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user jcamachor commented on the issue: https://github.com/apache/orc/pull/249 I have been testing the patch from Hive and everything seems to be working as expected. I have rebased the patch and merge both commits. Also, I had to extend my changes to the newly created ```WriterImplV2```. @omalley , @wgtmac , could you take a final look and merge if it is OK? Thanks ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user jcamachor commented on the issue: https://github.com/apache/orc/pull/249 I have just updated the patch now that we have moved to the new storage-api version. I will run some tests with Hive locally asap and will get back confirming that everything is working as expected. ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/249 @jcamachor You are right. WriterOptions/WriterContext are ideal places to set this kind of values. ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user jcamachor commented on the issue: https://github.com/apache/orc/pull/249 Pushed a new commit with the changes. We would still need a storage-api release for the ```TimestampColumnVector``` changes. ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user jcamachor commented on the issue: https://github.com/apache/orc/pull/249 @wgtmac , thanks for the feedback. Please bear with me for a bit, as it is first time I am touching ORC code base. OK, I think ```TypeDescription``` is not a problem then since we set the value at reader / writer, independently of the default that we use at creation time. For reader, everything seems easy. However, for the writer, it is a bit trickier since the stripe footer stores the information about the time zone, hence it should be set beforehand using, e.g., the context or options objects. Does that seem reasonable? ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/249 For reader, can we set useUTCTImestamp in the function TimestampTreeReader::nextVector? For writer, it is caller's responsibility to set useUTCTImestamp before calling TimestampTreeWriter::writeBatch. Does this help? @jcamachor ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user jcamachor commented on the issue: https://github.com/apache/orc/pull/249 @omalley , I have been trying to add the Boolean ```useUTCTimestamp``` as suggested. Making it work with the reader/writer does not seem to be a problem, since I can pass the information through the context. However, we also create column vectors in the ```TypeDescription``` class, where we do not seem to have any context information, just the type string representation. It seems that unless we pass the information through that representation, we cannot know the value for the boolean when we create the column over there, and I do not think we want to go in that direction. Any ideas? If we do not go in that direction, I thought that I can change current patch to use a ```boolean``` instead of the ```TimeZone``` itself (but without storing it). Please, let me know what you think. ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/249 @jcamachor Yes I was just meaning ORC doesn't have problems in dealing with timestamps itself. Definitely using UTC everywhere makes things way easier. ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user jcamachor commented on the issue: https://github.com/apache/orc/pull/249 @omalley , it seems like a good idea, let me explore it and refresh the PR. I will adapt HIVE-19226 to these new changes too. @wgtmac , I understand you are suggesting that this can be fixed only from Hive side? Problem is that existing ORC files should still be read properly, hence you would need to recognize old vs new ORC files. In addition, you will apply displacement twice when reading/writing, in Hive and in ORC. It seems to me the cleaner solution is just being able to point to ORC that timestamp data is in UTC from Java reader/writer. FWIW, change to stringify in TimestampColumnVector is needed indeed. ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user omalley commented on the issue: https://github.com/apache/orc/pull/249 Also note that the C++ reader already uses UTC for its TimestampColumnVector. :) ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user omalley commented on the issue: https://github.com/apache/orc/pull/249 @jcamachor I'd suggest a much simpler API: - Instead of passing in the reader timezone, make a boolean option to useUtcForTimestamp. - Extend TimestampColumnVector to have a boolean isUTC field. - The TimestampTreeWriter can use the isUTC in the ColumnVector to determine if it is in UTC. - The reader can set isUTC appropriately based on the option. Thoughts? ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/249 AFAIK, I don't think ORC has any issue in HIVE-12192. What ORC guarantees is that we should always get same wall clock time representation w/o timezone. Current Java implementation leverages java.sql.Timestamp which uses local timezone and that's why writer and reader always use timestamp values in local timezone. Unless we add a new TimestampColumnVector which enforces UTC timezone to adopt your change here. ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user jcamachor commented on the issue: https://github.com/apache/orc/pull/249 @wgtmac , see discussion in https://issues.apache.org/jira/browse/HIVE-12192 for more context. ---
[GitHub] orc issue #249: [ORC-341] Support time zone as a parameter for Java reader a...
Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/249 Why do we need this change? ---