[jira] [Commented] (ORC-445) [C++] Code improvements in RLEV2Util

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715651#comment-16715651
 ] 

ASF GitHub Bot commented on ORC-445:


wgtmac closed pull request #346: ORC-445: [C++] Code improvements in RLEV2Util.
URL: https://github.com/apache/orc/pull/346
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/c++/src/RLEV2Util.cc b/c++/src/RLEV2Util.cc
index 53d18a0bd1..12e2d057cd 100644
--- a/c++/src/RLEV2Util.cc
+++ b/c++/src/RLEV2Util.cc
@@ -21,9 +21,50 @@
 namespace orc {
 
   // Map FBS enum to bit width value.
-  const uint32_t FBSToBitWidthMap[FixedBitSizes::SIZE] = {
+  const uint8_t FBSToBitWidthMap[FixedBitSizes::SIZE] = {
 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 
22, 23, 24,
 26, 28, 30, 32, 40, 48, 56, 64
   };
 
+  // Map bit length i to closest fixed bit width that can contain i bits.
+  const uint8_t ClosestFixedBitsMap[65] = {
+1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 
21, 22, 23, 24,
+26, 26, 28, 28, 30, 30, 32, 32,
+40, 40, 40, 40, 40, 40, 40, 40,
+48, 48, 48, 48, 48, 48, 48, 48,
+56, 56, 56, 56, 56, 56, 56, 56,
+64, 64, 64, 64, 64, 64, 64, 64
+  };
+
+  // Map bit length i to closest aligned fixed bit width that can contain i 
bits.
+  const uint8_t ClosestAlignedFixedBitsMap[65] = {
+  1, 1, 2, 4, 4, 8, 8, 8, 8, 16, 16, 16, 16, 16, 16, 16, 16, 24, 24, 24, 
24, 24, 24, 24, 24,
+  32, 32, 32, 32, 32, 32, 32, 32,
+  40, 40, 40, 40, 40, 40, 40, 40,
+  48, 48, 48, 48, 48, 48, 48, 48,
+  56, 56, 56, 56, 56, 56, 56, 56,
+  64, 64, 64, 64, 64, 64, 64, 64
+  };
+
+  // Map bit width to FBS enum.
+  const uint8_t BitWidthToFBSMap[65] = {
+  FixedBitSizes::ONE, FixedBitSizes::ONE, FixedBitSizes::TWO, 
FixedBitSizes::THREE, FixedBitSizes::FOUR,
+  FixedBitSizes::FIVE, FixedBitSizes::SIX, FixedBitSizes::SEVEN, 
FixedBitSizes::EIGHT,
+  FixedBitSizes::NINE, FixedBitSizes::TEN, FixedBitSizes::ELEVEN, 
FixedBitSizes::TWELVE,
+  FixedBitSizes::THIRTEEN, FixedBitSizes::FOURTEEN, 
FixedBitSizes::FIFTEEN, FixedBitSizes::SIXTEEN,
+  FixedBitSizes::SEVENTEEN, FixedBitSizes::EIGHTEEN, 
FixedBitSizes::NINETEEN, FixedBitSizes::TWENTY,
+  FixedBitSizes::TWENTYONE, FixedBitSizes::TWENTYTWO, 
FixedBitSizes::TWENTYTHREE, FixedBitSizes::TWENTYFOUR,
+  FixedBitSizes::TWENTYSIX, FixedBitSizes::TWENTYSIX,
+  FixedBitSizes::TWENTYEIGHT, FixedBitSizes::TWENTYEIGHT,
+  FixedBitSizes::THIRTY, FixedBitSizes::THIRTY,
+  FixedBitSizes::THIRTYTWO, FixedBitSizes::THIRTYTWO,
+  FixedBitSizes::FORTY, FixedBitSizes::FORTY, FixedBitSizes::FORTY, 
FixedBitSizes::FORTY,
+  FixedBitSizes::FORTY, FixedBitSizes::FORTY, FixedBitSizes::FORTY, 
FixedBitSizes::FORTY,
+  FixedBitSizes::FORTYEIGHT, FixedBitSizes::FORTYEIGHT, 
FixedBitSizes::FORTYEIGHT, FixedBitSizes::FORTYEIGHT,
+  FixedBitSizes::FORTYEIGHT, FixedBitSizes::FORTYEIGHT, 
FixedBitSizes::FORTYEIGHT, FixedBitSizes::FORTYEIGHT,
+  FixedBitSizes::FIFTYSIX, FixedBitSizes::FIFTYSIX, 
FixedBitSizes::FIFTYSIX, FixedBitSizes::FIFTYSIX,
+  FixedBitSizes::FIFTYSIX, FixedBitSizes::FIFTYSIX, 
FixedBitSizes::FIFTYSIX, FixedBitSizes::FIFTYSIX,
+  FixedBitSizes::SIXTYFOUR, FixedBitSizes::SIXTYFOUR, 
FixedBitSizes::SIXTYFOUR, FixedBitSizes::SIXTYFOUR,
+  FixedBitSizes::SIXTYFOUR, FixedBitSizes::SIXTYFOUR, 
FixedBitSizes::SIXTYFOUR, FixedBitSizes::SIXTYFOUR
+  };
 }
diff --git a/c++/src/RLEV2Util.hh b/c++/src/RLEV2Util.hh
index 794d5f62ab..95a6826eaa 100644
--- a/c++/src/RLEV2Util.hh
+++ b/c++/src/RLEV2Util.hh
@@ -22,83 +22,35 @@
 #include "RLEv2.hh"
 
 namespace orc {
-  extern const uint32_t FBSToBitWidthMap[FixedBitSizes::SIZE];
+  extern const uint8_t FBSToBitWidthMap[FixedBitSizes::SIZE];
+  extern const uint8_t ClosestFixedBitsMap[65];
+  extern const uint8_t ClosestAlignedFixedBitsMap[65];
+  extern const uint8_t BitWidthToFBSMap[65];
 
+  // The input n must be less than FixedBitSizes::SIZE.
   inline uint32_t decodeBitWidth(uint32_t n) {
 return FBSToBitWidthMap[n];
   }
 
   inline uint32_t getClosestFixedBits(uint32_t n) {
-if (n == 0) {
-  return 1;
-}
-
-if (n >= 1 && n <= 24) {
-  return n;
-} else if (n <= 26) {
-  return 26;
-} else if (n <= 28) {
-  return 28;
-} else if (n <= 30) {
-  return 30;
-} else if (n <= 32) {
-  return 32;
-} else if (n <= 40) {
-  return 40;
-} else if (n <= 48) {
-  return 48;
-} else if (n <= 56) {
-  return 56;
+if (n <= 64) {
+  return ClosestFixedBitsMap[n];
 } else {
   return 64;
 }
   }
 
   inline uint32_t 

[jira] [Commented] (ORC-445) [C++] Code improvements in RLEV2Util

2018-12-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710742#comment-16710742
 ] 

ASF GitHub Bot commented on ORC-445:


fangzheng commented on a change in pull request #346: ORC-445: [C++] Code 
improvements in RLEV2Util.
URL: https://github.com/apache/orc/pull/346#discussion_r239274880
 
 

 ##
 File path: c++/src/RLEV2Util.hh
 ##
 @@ -23,85 +23,32 @@
 
 namespace orc {
   extern const uint32_t FBSToBitWidthMap[FixedBitSizes::SIZE];
+  extern const uint32_t ClosestFixedBitsMap[65];
+  extern const uint32_t ClosestAlignedFixedBitsMap[65];
+  extern const uint32_t BitWidthToFBSMap[65];
 
   inline uint32_t decodeBitWidth(uint32_t n) {
 return FBSToBitWidthMap[n];
 
 Review comment:
   Hi Gang, I thought this over and don't feel we need to add boundary check 
here for two reasons:
   1. in current RLEv2 encoder and decoder code, all the callers of 
decodeBitWidth() pass in a integer that only has the lowest 5 bits set, so the 
input n is always less than FixedBitSizes::SIZE (32).
   2. Since this function is used to decode the 5-bit length code, it would be 
a programming error for a caller to pass in any value that is >= 32. In this 
case, returning 64 (as the original implementation does) is not helping. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in RLEV2Util
> 
>
> Key: ORC-445
> URL: https://issues.apache.org/jira/browse/ORC-445
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> This is a follow-up of ORC-444. The following functions in RLEV2Util.hh can 
> be optimized by replacing the if-else statements with direct array lookup:
>   inline uint32_t getClosestFixedBits(uint32_t n);
>   inline uint32_t getClosestAlignedFixedBits(uint32_t n);
>   inline uint32_t encodeBitWidth(uint32_t n);
>   inline uint32_t findClosestNumBits(int64_t value);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-445) [C++] Code improvements in RLEV2Util

2018-12-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710741#comment-16710741
 ] 

ASF GitHub Bot commented on ORC-445:


fangzheng commented on issue #346: ORC-445: [C++] Code improvements in 
RLEV2Util.
URL: https://github.com/apache/orc/pull/346#issuecomment-444690006
 
 
   @wgtmac Hi Gang, 
   I put together a test program and some measurements here: 
https://github.com/fangzheng/RLEV2Util_performance
   
   The new functions are 20 to 40% faster than the original ones. 
   
   I've also committed some new changes:
   1. change the lookup arrays from uint32_t to uint8_t to reduce their sizes 
and get better cache behavior.
   2. add a unit test TestRLEV2Util.cc to verify that the new implementation 
produces the same results as original one.
   3. simplify the code in encodeBitWidth() to avoid calling 
getClosestFixedBits().
   
   Please let me know if you have further suggestions. Thanks.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in RLEV2Util
> 
>
> Key: ORC-445
> URL: https://issues.apache.org/jira/browse/ORC-445
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> This is a follow-up of ORC-444. The following functions in RLEV2Util.hh can 
> be optimized by replacing the if-else statements with direct array lookup:
>   inline uint32_t getClosestFixedBits(uint32_t n);
>   inline uint32_t getClosestAlignedFixedBits(uint32_t n);
>   inline uint32_t encodeBitWidth(uint32_t n);
>   inline uint32_t findClosestNumBits(int64_t value);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-445) [C++] Code improvements in RLEV2Util

2018-12-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710732#comment-16710732
 ] 

ASF GitHub Bot commented on ORC-445:


fangzheng commented on a change in pull request #346: ORC-445: [C++] Code 
improvements in RLEV2Util.
URL: https://github.com/apache/orc/pull/346#discussion_r239274880
 
 

 ##
 File path: c++/src/RLEV2Util.hh
 ##
 @@ -23,85 +23,32 @@
 
 namespace orc {
   extern const uint32_t FBSToBitWidthMap[FixedBitSizes::SIZE];
+  extern const uint32_t ClosestFixedBitsMap[65];
+  extern const uint32_t ClosestAlignedFixedBitsMap[65];
+  extern const uint32_t BitWidthToFBSMap[65];
 
   inline uint32_t decodeBitWidth(uint32_t n) {
 return FBSToBitWidthMap[n];
 
 Review comment:
   Hi Gang, I thought this over and don't feel we need to add boundary check 
here for two reasons:
   1. in current RLEv2 encoder and decoder code, all the callers of 
decodeBitWidth() pass in a integer that only has the lowest 5 bits set, so the 
input n is always less than FixedBitSizes::SIZE (32).
   2. Since this function is used to decode the 5-bit length code, it would be 
a programming error for a caller to pass in any value that is >= 32. In this 
case, returning 64 (as the original implementation does) is not helping. That 
would just silently mask the error (potentially file content corruption) and is 
likely to cause further error when decoding with the wrong bit width.



This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in RLEV2Util
> 
>
> Key: ORC-445
> URL: https://issues.apache.org/jira/browse/ORC-445
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> This is a follow-up of ORC-444. The following functions in RLEV2Util.hh can 
> be optimized by replacing the if-else statements with direct array lookup:
>   inline uint32_t getClosestFixedBits(uint32_t n);
>   inline uint32_t getClosestAlignedFixedBits(uint32_t n);
>   inline uint32_t encodeBitWidth(uint32_t n);
>   inline uint32_t findClosestNumBits(int64_t value);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-445) [C++] Code improvements in RLEV2Util

2018-12-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710466#comment-16710466
 ] 

ASF GitHub Bot commented on ORC-445:


fangzheng commented on issue #346: ORC-445: [C++] Code improvements in 
RLEV2Util.
URL: https://github.com/apache/orc/pull/346#issuecomment-444594456
 
 
   @wgtmac Hi Gang, thanks for the comments. I'll add performance measurements 
later this week.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in RLEV2Util
> 
>
> Key: ORC-445
> URL: https://issues.apache.org/jira/browse/ORC-445
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> This is a follow-up of ORC-444. The following functions in RLEV2Util.hh can 
> be optimized by replacing the if-else statements with direct array lookup:
>   inline uint32_t getClosestFixedBits(uint32_t n);
>   inline uint32_t getClosestAlignedFixedBits(uint32_t n);
>   inline uint32_t encodeBitWidth(uint32_t n);
>   inline uint32_t findClosestNumBits(int64_t value);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-445) [C++] Code improvements in RLEV2Util

2018-12-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709464#comment-16709464
 ] 

ASF GitHub Bot commented on ORC-445:


wgtmac commented on a change in pull request #346: ORC-445: [C++] Code 
improvements in RLEV2Util.
URL: https://github.com/apache/orc/pull/346#discussion_r238895485
 
 

 ##
 File path: c++/src/RLEV2Util.hh
 ##
 @@ -23,85 +23,32 @@
 
 namespace orc {
   extern const uint32_t FBSToBitWidthMap[FixedBitSizes::SIZE];
+  extern const uint32_t ClosestFixedBitsMap[65];
+  extern const uint32_t ClosestAlignedFixedBitsMap[65];
+  extern const uint32_t BitWidthToFBSMap[65];
 
   inline uint32_t decodeBitWidth(uint32_t n) {
 return FBSToBitWidthMap[n];
 
 Review comment:
   make sure this won't be out of bound.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in RLEV2Util
> 
>
> Key: ORC-445
> URL: https://issues.apache.org/jira/browse/ORC-445
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> This is a follow-up of ORC-444. The following functions in RLEV2Util.hh can 
> be optimized by replacing the if-else statements with direct array lookup:
>   inline uint32_t getClosestFixedBits(uint32_t n);
>   inline uint32_t getClosestAlignedFixedBits(uint32_t n);
>   inline uint32_t encodeBitWidth(uint32_t n);
>   inline uint32_t findClosestNumBits(int64_t value);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-445) [C++] Code improvements in RLEV2Util

2018-12-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709119#comment-16709119
 ] 

ASF GitHub Bot commented on ORC-445:


wgtmac commented on issue #346: ORC-445: [C++] Code improvements in RLEV2Util.
URL: https://github.com/apache/orc/pull/346#issuecomment-444218437
 
 
   Can you please provide some perf data to confirm the refactoring is 
effective?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in RLEV2Util
> 
>
> Key: ORC-445
> URL: https://issues.apache.org/jira/browse/ORC-445
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> This is a follow-up of ORC-444. The following functions in RLEV2Util.hh can 
> be optimized by replacing the if-else statements with direct array lookup:
>   inline uint32_t getClosestFixedBits(uint32_t n);
>   inline uint32_t getClosestAlignedFixedBits(uint32_t n);
>   inline uint32_t encodeBitWidth(uint32_t n);
>   inline uint32_t findClosestNumBits(int64_t value);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-445) [C++] Code improvements in RLEV2Util

2018-12-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709049#comment-16709049
 ] 

ASF GitHub Bot commented on ORC-445:


fangzheng commented on a change in pull request #346: ORC-445: [C++] Code 
improvements in RLEV2Util.
URL: https://github.com/apache/orc/pull/346#discussion_r238772598
 
 

 ##
 File path: c++/src/RLEV2Util.hh
 ##
 @@ -23,85 +23,24 @@
 
 namespace orc {
   extern const uint32_t FBSToBitWidthMap[FixedBitSizes::SIZE];
+  extern const uint32_t ClosestFixedBitsMap[65];
+  extern const uint32_t ClosestAlignedFixedBitsMap[65];
+  extern const uint32_t BitWidthToFBSMap[65];
 
   inline uint32_t decodeBitWidth(uint32_t n) {
 return FBSToBitWidthMap[n];
   }
 
   inline uint32_t getClosestFixedBits(uint32_t n) {
-if (n == 0) {
-  return 1;
-}
-
-if (n >= 1 && n <= 24) {
-  return n;
-} else if (n <= 26) {
-  return 26;
-} else if (n <= 28) {
-  return 28;
-} else if (n <= 30) {
-  return 30;
-} else if (n <= 32) {
-  return 32;
-} else if (n <= 40) {
-  return 40;
-} else if (n <= 48) {
-  return 48;
-} else if (n <= 56) {
-  return 56;
-} else {
-  return 64;
-}
+return ClosestFixedBitsMap[n];
 
 Review comment:
   @xndai Hi Xiening, thanks for the catch. When computing patch width, it is 
possible for the input n to be larger than 64. I have added boundary check in 
this function and getClosestAlignedFixedBits(). 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in RLEV2Util
> 
>
> Key: ORC-445
> URL: https://issues.apache.org/jira/browse/ORC-445
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> This is a follow-up of ORC-444. The following functions in RLEV2Util.hh can 
> be optimized by replacing the if-else statements with direct array lookup:
>   inline uint32_t getClosestFixedBits(uint32_t n);
>   inline uint32_t getClosestAlignedFixedBits(uint32_t n);
>   inline uint32_t encodeBitWidth(uint32_t n);
>   inline uint32_t findClosestNumBits(int64_t value);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-445) [C++] Code improvements in RLEV2Util

2018-12-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709018#comment-16709018
 ] 

ASF GitHub Bot commented on ORC-445:


xndai commented on a change in pull request #346: ORC-445: [C++] Code 
improvements in RLEV2Util.
URL: https://github.com/apache/orc/pull/346#discussion_r238764472
 
 

 ##
 File path: c++/src/RLEV2Util.hh
 ##
 @@ -23,85 +23,24 @@
 
 namespace orc {
   extern const uint32_t FBSToBitWidthMap[FixedBitSizes::SIZE];
+  extern const uint32_t ClosestFixedBitsMap[65];
+  extern const uint32_t ClosestAlignedFixedBitsMap[65];
+  extern const uint32_t BitWidthToFBSMap[65];
 
   inline uint32_t decodeBitWidth(uint32_t n) {
 return FBSToBitWidthMap[n];
   }
 
   inline uint32_t getClosestFixedBits(uint32_t n) {
-if (n == 0) {
-  return 1;
-}
-
-if (n >= 1 && n <= 24) {
-  return n;
-} else if (n <= 26) {
-  return 26;
-} else if (n <= 28) {
-  return 28;
-} else if (n <= 30) {
-  return 30;
-} else if (n <= 32) {
-  return 32;
-} else if (n <= 40) {
-  return 40;
-} else if (n <= 48) {
-  return 48;
-} else if (n <= 56) {
-  return 56;
-} else {
-  return 64;
-}
+return ClosestFixedBitsMap[n];
 
 Review comment:
   what if n > 64?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in RLEV2Util
> 
>
> Key: ORC-445
> URL: https://issues.apache.org/jira/browse/ORC-445
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> This is a follow-up of ORC-444. The following functions in RLEV2Util.hh can 
> be optimized by replacing the if-else statements with direct array lookup:
>   inline uint32_t getClosestFixedBits(uint32_t n);
>   inline uint32_t getClosestAlignedFixedBits(uint32_t n);
>   inline uint32_t encodeBitWidth(uint32_t n);
>   inline uint32_t findClosestNumBits(int64_t value);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-445) [C++] Code improvements in RLEV2Util

2018-12-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708296#comment-16708296
 ] 

ASF GitHub Bot commented on ORC-445:


fangzheng opened a new pull request #346: ORC-445: [C++] Code improvements in 
RLEV2Util.
URL: https://github.com/apache/orc/pull/346
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in RLEV2Util
> 
>
> Key: ORC-445
> URL: https://issues.apache.org/jira/browse/ORC-445
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> This is a follow-up of ORC-444. The following functions in RLEV2Util.hh can 
> be optimized by replacing the if-else statements with direct array lookup:
>   inline uint32_t getClosestFixedBits(uint32_t n);
>   inline uint32_t getClosestAlignedFixedBits(uint32_t n);
>   inline uint32_t encodeBitWidth(uint32_t n);
>   inline uint32_t findClosestNumBits(int64_t value);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-444) Fix errors in RLE section in ORC spec and improve RLEV2 encoder code.

2018-12-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707987#comment-16707987
 ] 

ASF GitHub Bot commented on ORC-444:


wgtmac closed pull request #345: ORC-444: Fix errors in RLE section in ORC spec 
and improve RLEV2 encoder code.
URL: https://github.com/apache/orc/pull/345
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/c++/src/CMakeLists.txt b/c++/src/CMakeLists.txt
index 72d408d9b0..235ced856a 100644
--- a/c++/src/CMakeLists.txt
+++ b/c++/src/CMakeLists.txt
@@ -199,6 +199,7 @@ set(SOURCE_FILES
   OrcFile.cc
   Reader.cc
   RLEv1.cc
+  RLEV2Util.cc
   RleDecoderV2.cc
   RleEncoderV2.cc
   RLE.cc
diff --git a/c++/src/RLEV2Util.cc b/c++/src/RLEV2Util.cc
new file mode 100644
index 00..53d18a0bd1
--- /dev/null
+++ b/c++/src/RLEV2Util.cc
@@ -0,0 +1,29 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with option work for additional information
+ * regarding copyright ownership.  The ASF licenses option file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use option file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "RLEV2Util.hh"
+
+namespace orc {
+
+  // Map FBS enum to bit width value.
+  const uint32_t FBSToBitWidthMap[FixedBitSizes::SIZE] = {
+1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 
22, 23, 24,
+26, 28, 30, 32, 40, 48, 56, 64
+  };
+
+}
diff --git a/c++/src/RLEV2Util.hh b/c++/src/RLEV2Util.hh
index a7bc5537ab..794d5f62ab 100644
--- a/c++/src/RLEV2Util.hh
+++ b/c++/src/RLEV2Util.hh
@@ -22,26 +22,10 @@
 #include "RLEv2.hh"
 
 namespace orc {
+  extern const uint32_t FBSToBitWidthMap[FixedBitSizes::SIZE];
+
   inline uint32_t decodeBitWidth(uint32_t n) {
-if (n <= FixedBitSizes::TWENTYFOUR) {
-  return n + 1;
-} else if (n == FixedBitSizes::TWENTYSIX) {
-  return 26;
-} else if (n == FixedBitSizes::TWENTYEIGHT) {
-  return 28;
-} else if (n == FixedBitSizes::THIRTY) {
-  return 30;
-} else if (n == FixedBitSizes::THIRTYTWO) {
-  return 32;
-} else if (n == FixedBitSizes::FORTY) {
-  return 40;
-} else if (n == FixedBitSizes::FORTYEIGHT) {
-  return 48;
-} else if (n == FixedBitSizes::FIFTYSIX) {
-  return 56;
-} else {
-  return 64;
-}
+return FBSToBitWidthMap[n];
   }
 
   inline uint32_t getClosestFixedBits(uint32_t n) {
diff --git a/site/specification/ORCv0.md b/site/specification/ORCv0.md
index 613298a086..336896ed77 100644
--- a/site/specification/ORCv0.md
+++ b/site/specification/ORCv0.md
@@ -438,7 +438,7 @@ values.
 * Run - a sequence of at least 3 identical values
 * Literals - a sequence of non-identical values
 
-The first byte of each group of values is a header than determines
+The first byte of each group of values is a header that determines
 whether it is a run (value between 0 to 127) or literal list (value
 between -128 to -1). For runs, the control byte is the length of the
 run minus the length of the minimal run (3) and the control byte for
diff --git a/site/specification/ORCv1.md b/site/specification/ORCv1.md
index 78350800e7..b799adc6bd 100644
--- a/site/specification/ORCv1.md
+++ b/site/specification/ORCv1.md
@@ -444,7 +444,7 @@ values.
 * Run - a sequence of at least 3 identical values
 * Literals - a sequence of non-identical values
 
-The first byte of each group of values is a header than determines
+The first byte of each group of values is a header that determines
 whether it is a run (value between 0 to 127) or literal list (value
 between -128 to -1). For runs, the control byte is the length of the
 run minus the length of the minimal run (3) and the control byte for
@@ -622,7 +622,7 @@ if the series is increasing or decreasing.
   * 9 bits for run length (L) (1 to 512 values)
 * Base value - encoded as (signed or unsigned) varint
 * Delta base - encoded as signed varint
-* Delta values $W * (L - 2)$ bytes - encode each delta after the first
+* Delta values (W * (L - 2)) bytes - encode each delta after the first
   one. If the delta base is positive, the sequence is increasing and if it is
   negative the sequence is decreasing.
 

[jira] [Commented] (ORC-363) Enable zstd decompression in ORC Java reader

2018-12-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707953#comment-16707953
 ] 

ASF GitHub Bot commented on ORC-363:


wgtmac commented on issue #306: ORC-363: Enable zstd for java writer/reader
URL: https://github.com/apache/orc/pull/306#issuecomment-443905274
 
 
   @omalley Can you review this again when you have time? Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enable zstd decompression in ORC Java reader
> 
>
> Key: ORC-363
> URL: https://issues.apache.org/jira/browse/ORC-363
> Project: ORC
>  Issue Type: Bug
>Reporter: Xiening Dai
>Assignee: Xiening Dai
>Priority: Major
>
> Update to aircompress lib 0.11 and enable zstd decompression.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-444) Fix errors in RLE section in ORC spec and improve RLEV2 encoder code.

2018-12-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707960#comment-16707960
 ] 

ASF GitHub Bot commented on ORC-444:


wgtmac commented on a change in pull request #345: ORC-444: Fix errors in RLE 
section in ORC spec and improve RLEV2 encoder code.
URL: https://github.com/apache/orc/pull/345#discussion_r238474364
 
 

 ##
 File path: c++/src/RLEV2Util.cc
 ##
 @@ -0,0 +1,29 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with option work for additional information
+ * regarding copyright ownership.  The ASF licenses option file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use option file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "RLEV2Util.hh"
+
+namespace orc {
+
+  // Map FBS enum to bit width value.
+  const uint32_t FBSToBitWidthMap[FixedBitSizes::SIZE] = {
 
 Review comment:
   After double-checking the definition of FixedBitSizes::FBS, I think you are 
right.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix errors in RLE section in ORC spec and improve RLEV2 encoder code.
> -
>
> Key: ORC-444
> URL: https://issues.apache.org/jira/browse/ORC-444
> Project: ORC
>  Issue Type: Improvement
>  Components: C++, documentation
>Reporter: Fang Zheng
>Priority: Minor
>
> 1. Fixed a few typos and format errors in the RLE section in ORC 
> specification documentation.
> 2. Improve decodeBitWidth() function in RLEV2Util.hh.
> Please see PR for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-443) [C++] Code improvements in ColumnWriter

2018-12-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707641#comment-16707641
 ] 

ASF GitHub Bot commented on ORC-443:


wgtmac closed pull request #344: ORC-443: [C++] Code improvements in 
ColumnWriter.
URL: https://github.com/apache/orc/pull/344
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/c++/src/ByteRLE.cc b/c++/src/ByteRLE.cc
index 2664019f67..4bf8b7a598 100644
--- a/c++/src/ByteRLE.cc
+++ b/c++/src/ByteRLE.cc
@@ -40,7 +40,7 @@ namespace orc {
 virtual ~ByteRleEncoderImpl() override;
 
 /**
- * Encode the next batch of values
+ * Encode the next batch of values.
  * @param data to be encoded
  * @param numValues the number of values to be encoded
  * @param notNull If the pointer is null, all values are read. If the
@@ -55,8 +55,8 @@ namespace orc {
 virtual uint64_t getBufferSize() const override;
 
 /**
- * Flushing underlying BufferedOutputStream
-*/
+ * Flush underlying BufferedOutputStream.
+ */
 virtual uint64_t flush() override;
 
 virtual void recordPosition(PositionRecorder* recorder) const override;
@@ -122,7 +122,7 @@ namespace orc {
 writeByte(
 static_cast(numLiterals - static_cast(MINIMUM_REPEAT)));
 writeByte(literals[0]);
- } else {
+  } else {
 writeByte(static_cast(-numLiterals));
 for (int i = 0; i < numLiterals; ++i) {
   writeByte(literals[i]);
diff --git a/c++/src/ColumnWriter.cc b/c++/src/ColumnWriter.cc
index c6c30f3833..ef31c45cbd 100644
--- a/c++/src/ColumnWriter.cc
+++ b/c++/src/ColumnWriter.cc
@@ -274,13 +274,13 @@ namespace orc {
ColumnVectorBatch& rowBatch,
uint64_t offset,
uint64_t numValues) {
-ColumnWriter::add(rowBatch, offset, numValues);
 const StructVectorBatch* structBatch =
   dynamic_cast();
 if (structBatch == nullptr) {
   throw InvalidArgument("Failed to cast to StructVectorBatch");
 }
 
+ColumnWriter::add(rowBatch, offset, numValues);
 for (uint32_t i = 0; i < children.size(); ++i) {
   children[i]->add(*structBatch->fields[i], offset, numValues);
 }
@@ -289,15 +289,17 @@ namespace orc {
 if (!structBatch->hasNulls) {
   colIndexStatistics->increase(numValues);
 } else {
+  uint64_t count = 0;
   bool hasNull = false;
   const char* notNull = structBatch->notNull.data() + offset;
   for (uint64_t i = 0; i < numValues; ++i) {
 if (notNull[i]) {
-  colIndexStatistics->increase(1);
+  ++count;
 } else if (!hasNull) {
   hasNull = true;
 }
   }
+  colIndexStatistics->increase(count);
   if (hasNull) {
 colIndexStatistics->setHasNull(true);
   }
@@ -445,13 +447,18 @@ namespace orc {
 ColumnVectorBatch& rowBatch,
 uint64_t offset,
 uint64_t numValues) {
-ColumnWriter::add(rowBatch, offset, numValues);
-
 const LongVectorBatch* longBatch =
   dynamic_cast();
 if (longBatch == nullptr) {
   throw InvalidArgument("Failed to cast to LongVectorBatch");
 }
+IntegerColumnStatisticsImpl* intStats =
+dynamic_cast(colIndexStatistics.get());
+if (intStats == nullptr) {
+  throw InvalidArgument("Failed to cast to IntegerColumnStatisticsImpl");
+}
+
+ColumnWriter::add(rowBatch, offset, numValues);
 
 const int64_t* data = longBatch->data.data() + offset;
 const char* notNull = longBatch->hasNulls ?
@@ -460,21 +467,17 @@ namespace orc {
 rleEncoder->add(data, numValues, notNull);
 
 // update stats
-IntegerColumnStatisticsImpl* intStats =
-  dynamic_cast(colIndexStatistics.get());
-if (intStats == nullptr) {
-  throw InvalidArgument("Failed to cast to IntegerColumnStatisticsImpl");
-}
-
+uint64_t count = 0;
 bool hasNull = false;
 for (uint64_t i = 0; i < numValues; ++i) {
   if (notNull == nullptr || notNull[i]) {
-intStats->increase(1);
+++count;
 intStats->update(data[i], 1);
   } else if (!hasNull) {
 hasNull = true;
   }
 }
+intStats->increase(count);
 if (hasNull) {
   intStats->setHasNull(true);
 }
@@ -549,12 +552,17 @@ namespace orc {
   void ByteColumnWriter::add(ColumnVectorBatch& rowBatch,
  uint64_t offset,
  uint64_t numValues) {
-ColumnWriter::add(rowBatch, offset, numValues);
-
 LongVectorBatch* byteBatch = dynamic_cast();
 if (byteBatch == 

[jira] [Commented] (ORC-443) [C++] Code improvements in ColumnWriter

2018-12-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706527#comment-16706527
 ] 

ASF GitHub Bot commented on ORC-443:


fangzheng commented on issue #344: ORC-443: [C++] Code improvements in 
ColumnWriter.
URL: https://github.com/apache/orc/pull/344#issuecomment-443556429
 
 
   @majetideepak Hi Deepak, the conflicts have been resolved. Thanks.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in ColumnWriter
> ---
>
> Key: ORC-443
> URL: https://issues.apache.org/jira/browse/ORC-443
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Assignee: Fang Zheng
>Priority: Minor
>
> A few changes to ColumnWriter and its derived classes:
> 1. in add() function, re-order code to verify input parameters before 
> modifying any internal state.
> 2. in add() function, move the calls to colIndexStatistics->increase(1) out 
> of the loop. Many of those are virtual function calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-443) [C++] Code improvements in ColumnWriter

2018-12-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706509#comment-16706509
 ] 

ASF GitHub Bot commented on ORC-443:


majetideepak commented on issue #344: ORC-443: [C++] Code improvements in 
ColumnWriter.
URL: https://github.com/apache/orc/pull/344#issuecomment-443551834
 
 
   @fangzheng can you please resolve the conflict? Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in ColumnWriter
> ---
>
> Key: ORC-443
> URL: https://issues.apache.org/jira/browse/ORC-443
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Assignee: Fang Zheng
>Priority: Minor
>
> A few changes to ColumnWriter and its derived classes:
> 1. in add() function, re-order code to verify input parameters before 
> modifying any internal state.
> 2. in add() function, move the calls to colIndexStatistics->increase(1) out 
> of the loop. Many of those are virtual function calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-442) [C++] Code improvements in Statistics and Writer

2018-12-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706507#comment-16706507
 ] 

ASF GitHub Bot commented on ORC-442:


majetideepak closed pull request #343: ORC-442: [C++] Code improvements in 
Statistics and Writer.
URL: https://github.com/apache/orc/pull/343
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/c++/include/orc/Statistics.hh b/c++/include/orc/Statistics.hh
index 5b894789d8..654956d860 100644
--- a/c++/include/orc/Statistics.hh
+++ b/c++/include/orc/Statistics.hh
@@ -34,7 +34,7 @@ namespace orc {
 
 /**
  * Get the number of values in this column. It will differ from the number
- * of rows because of NULL values and repeated values.
+ * of rows because of NULL values.
  * @return the number of values
  */
 virtual uint64_t getNumberOfValues() const = 0;
diff --git a/c++/src/ColumnWriter.hh b/c++/src/ColumnWriter.hh
index 14fc54a295..cee4e01318 100644
--- a/c++/src/ColumnWriter.hh
+++ b/c++/src/ColumnWriter.hh
@@ -37,7 +37,7 @@ namespace orc {
 /**
  * Get the stream for the given column/kind in this stripe.
  * @param kind the kind of the stream
- * @return the buffer output stream
+ * @return the buffered output stream
  */
 virtual std::unique_ptr
 createStream(proto::Stream_Kind kind) const = 0;
@@ -98,42 +98,45 @@ namespace orc {
  uint64_t offset,
  uint64_t numValues);
 /**
- * Flush column writer output steams
- * @param streams vector to store generated stream by flush()
+ * Flush column writer output streams.
+ * @param streams vector to store streams generated by flush()
  */
 virtual void flush(std::vector& streams);
 
 /**
- * Get estimated sized of buffer used
+ * Get estimated size of buffer used.
+ * @return estimated size of buffer used
  */
 virtual uint64_t getEstimatedSize() const;
 
 /**
  * Get the encoding used by the writer for this column.
- * ColumnEncoding info is pushed into the vector
+ * @param encodings vector to store the returned ColumnEncoding info
  */
 virtual void getColumnEncoding(
   std::vector& encodings) const = 0;
 
 /**
- * Get the stripe statistics for this column
+ * Get the stripe statistics for this column.
+ * @param stats vector to store the returned stripe statistics
  */
 virtual void getStripeStatistics(
   std::vector& stats) const;
 
 /**
- * Get the file statistics for this column
+ * Get the file statistics for this column.
+ * @param stats vector to store the returned file statistics
  */
 virtual void getFileStatistics(
   std::vector& stats) const;
 
 /**
- * Merge index stats into stripe stats and reset index stats
+ * Merge index stats into stripe stats and reset index stats.
  */
 virtual void mergeRowGroupStatsIntoStripeStats();
 
 /**
- * Merge stripe stats into file stats and reset stripe stats
+ * Merge stripe stats into file stats and reset stripe stats.
  */
 virtual void mergeStripeStatsIntoFileStats();
 
@@ -146,29 +149,29 @@ namespace orc {
 virtual void createRowIndexEntry();
 
 /**
- * Write row index streams for this column
+ * Write row index streams for this column.
  * @param streams output list of ROW_INDEX streams
  */
 virtual void writeIndex(std::vector ) const;
 
 /**
- * Record positions for index
+ * Record positions for index.
  *
- * This function is called by createRowIndexEntry() and ColumnWrtier's
+ * This function is called by createRowIndexEntry() and ColumnWriter's
  * constructor. So base classes do not need to call inherited classes'
  * recordPosition() function.
  */
 virtual void recordPosition() const;
 
 /**
- * Reset positions for index
+ * Reset positions for index.
  */
 virtual void reset();
 
   protected:
 /**
  * Utility function to translate ColumnStatistics into protobuf form and
- * add it to output list
+ * add it to output list.
  * @param statsList output list for protobuf stats
  * @param stats ColumnStatistics to be transformed and added
  */
diff --git a/c++/src/Compression.hh b/c++/src/Compression.hh
index 18f7cfd5ee..ff79377d83 100644
--- a/c++/src/Compression.hh
+++ b/c++/src/Compression.hh
@@ -43,7 +43,7 @@ namespace orc {
* @param outStream the output stream that is the underlying target
* @param strategy compression strategy
* @param bufferCapacity compression stream buffer total capacity
-   * @param 

[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-12-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706502#comment-16706502
 ] 

ASF GitHub Bot commented on ORC-420:


majetideepak closed pull request #326: ORC-420: [C++] Implement string 
dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/c++/include/orc/Writer.hh b/c++/include/orc/Writer.hh
index 1284c65d87..01e0128638 100644
--- a/c++/include/orc/Writer.hh
+++ b/c++/include/orc/Writer.hh
@@ -185,6 +185,12 @@ namespace orc {
  * @return if not set, the default is false
  */
 bool getEnableIndex() const;
+
+/**
+ * Get whether or not to enable dictionary encoding
+ * @return if not set, the default is false
+ */
+bool getEnableDictionary() const;
   };
 
   class Writer {
diff --git a/c++/src/ColumnWriter.cc b/c++/src/ColumnWriter.cc
index 9b128e3940..c6c30f3833 100644
--- a/c++/src/ColumnWriter.cc
+++ b/c++/src/ColumnWriter.cc
@@ -204,6 +204,10 @@ namespace orc {
 }
   }
 
+  void ColumnWriter::writeDictionary() {
+// PASS
+  }
+
   class StructColumnWriter : public ColumnWriter {
   public:
 StructColumnWriter(
@@ -237,6 +241,8 @@ namespace orc {
 virtual void writeIndex(
   std::vector ) const override;
 
+virtual void writeDictionary() override;
+
 virtual void reset() override;
 
   private:
@@ -382,6 +388,12 @@ namespace orc {
 }
   }
 
+  void StructColumnWriter::writeDictionary() {
+for (uint32_t i = 0; i < children.size(); ++i) {
+  children[i]->writeDictionary();
+}
+  }
+
   class IntegerColumnWriter : public ColumnWriter {
   public:
 IntegerColumnWriter(
@@ -831,6 +843,137 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+void getEntriesInInsertionOrder(std::vector&) const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
+uint64_t totalLength;
+
+// use friend class here to avoid being bothered by const function calls
+friend class StringColumnWriter;
+friend class CharColumnWriter;
+friend class VarCharColumnWriter;
+// store indexes of insertion order in the dictionary for not-null rows
+std::vector idxInDictBuffer;
+  };
+
+  // insert a new string into dictionary, return its insertion order
+  size_t StringDictionary::insert(const char * str, size_t len) {
+auto ret = dict.insert({DictEntry(str, len), dict.size()});
+if (ret.second) {
+  // make a copy to internal storage
+  data.push_back(std::vector(len));
+  memcpy(data.back().data(), str, len);
+  // update dictionary entry to link pointer to internal storage
+  DictEntry * entry = const_cast(&(ret.first->first));
+  entry->data = data.back().data();
+  totalLength += len;
+}
+return ret.first->second;
+  }
+
+  // write dictionary data & length to output buffer
+  void StringDictionary::flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const {
+for (auto it = dict.cbegin(); it != dict.cend(); ++it) {
+  dataStream->write(it->first.data, it->first.length);
+  lengthEncoder->write(static_cast(it->first.length));
+}
+  }
+
+  /**
+   * Reorder input index buffer from insertion order to dictionary order
+   *
+   * We require this function because string values are buffered by indexes
+   * in their insertion order. 

[jira] [Commented] (ORC-444) Fix errors in RLE section in ORC spec and improve RLEV2 encoder code.

2018-11-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705595#comment-16705595
 ] 

ASF GitHub Bot commented on ORC-444:


fangzheng commented on a change in pull request #345: ORC-444: Fix errors in 
RLE section in ORC spec and improve RLEV2 encoder code.
URL: https://github.com/apache/orc/pull/345#discussion_r238050318
 
 

 ##
 File path: c++/src/RLEV2Util.cc
 ##
 @@ -0,0 +1,29 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with option work for additional information
+ * regarding copyright ownership.  The ASF licenses option file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use option file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "RLEV2Util.hh"
+
+namespace orc {
+
+  // Map FBS enum to bit width value.
+  const uint32_t FBSToBitWidthMap[FixedBitSizes::SIZE] = {
 
 Review comment:
   Hi Gang, 
   I checked this again and think the code is correct. Please see the mapping 
below:
   
   enum value  | int value | bitwidth
   ONE | 0 | 1
   TWO| 1  | 2
   ...
   TWENTYFOUR  | 23| 24
   TWENTYSIX   | 24| 26
   TWENTYEIGHT | 25| 28
   THIRTY  | 26| 30
   THIRTYTWO   | 27| 32
   ...
   SIXTYFOUR   | 31| 64
   
   So, for an enum value e that is n as an int, its bitwidth is indeed 
FBSToBitWidthMap[e]. For example, TWENTYFOUR is 23 as an int, and 
FBSToBitWidthMap[23] is 24. THIRTYTWO is 27 as an int, and FBSToBitWidthMap[27] 
is 32.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix errors in RLE section in ORC spec and improve RLEV2 encoder code.
> -
>
> Key: ORC-444
> URL: https://issues.apache.org/jira/browse/ORC-444
> Project: ORC
>  Issue Type: Improvement
>  Components: C++, documentation
>Reporter: Fang Zheng
>Priority: Minor
>
> 1. Fixed a few typos and format errors in the RLE section in ORC 
> specification documentation.
> 2. Improve decodeBitWidth() function in RLEV2Util.hh.
> Please see PR for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-444) Fix errors in RLE section in ORC spec and improve RLEV2 encoder code.

2018-11-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705556#comment-16705556
 ] 

ASF GitHub Bot commented on ORC-444:


wgtmac commented on a change in pull request #345: ORC-444: Fix errors in RLE 
section in ORC spec and improve RLEV2 encoder code.
URL: https://github.com/apache/orc/pull/345#discussion_r238045567
 
 

 ##
 File path: c++/src/RLEV2Util.cc
 ##
 @@ -0,0 +1,29 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with option work for additional information
+ * regarding copyright ownership.  The ASF licenses option file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use option file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "RLEV2Util.hh"
+
+namespace orc {
+
+  // Map FBS enum to bit width value.
+  const uint32_t FBSToBitWidthMap[FixedBitSizes::SIZE] = {
 
 Review comment:
   I don't think this reflects the original mapping in the function 
decodeBitWidth. For instance, n = 24 or 32 will not get correct results.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix errors in RLE section in ORC spec and improve RLEV2 encoder code.
> -
>
> Key: ORC-444
> URL: https://issues.apache.org/jira/browse/ORC-444
> Project: ORC
>  Issue Type: Improvement
>  Components: C++, documentation
>Reporter: Fang Zheng
>Priority: Minor
>
> 1. Fixed a few typos and format errors in the RLE section in ORC 
> specification documentation.
> 2. Improve decodeBitWidth() function in RLEV2Util.hh.
> Please see PR for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-444) Fix errors in RLE section in ORC spec and improve RLEV2 encoder code.

2018-11-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705540#comment-16705540
 ] 

ASF GitHub Bot commented on ORC-444:


fangzheng opened a new pull request #345: ORC-444: Fix errors in RLE section in 
ORC spec and improve RLEV2 encoder code.
URL: https://github.com/apache/orc/pull/345
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix errors in RLE section in ORC spec and improve RLEV2 encoder code.
> -
>
> Key: ORC-444
> URL: https://issues.apache.org/jira/browse/ORC-444
> Project: ORC
>  Issue Type: Improvement
>  Components: C++, documentation
>Reporter: Fang Zheng
>Priority: Minor
>
> 1. Fixed a few typos and format errors in the RLE section in ORC 
> specification documentation.
> 2. Improve decodeBitWidth() function in RLEV2Util.hh.
> Please see PR for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-443) [C++] Code improvements in ColumnWriter

2018-11-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705120#comment-16705120
 ] 

ASF GitHub Bot commented on ORC-443:


wgtmac commented on a change in pull request #344: ORC-443: [C++] Code 
improvements in ColumnWriter.
URL: https://github.com/apache/orc/pull/344#discussion_r237956620
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -1477,22 +1508,25 @@ namespace orc {
   void Decimal64ColumnWriter::add(ColumnVectorBatch& rowBatch,
   uint64_t offset,
   uint64_t numValues) {
-ColumnWriter::add(rowBatch, offset, numValues);
 const Decimal64VectorBatch* decBatch =
   dynamic_cast();
 if (decBatch == nullptr) {
   throw InvalidArgument("Failed to cast to Decimal64VectorBatch");
 }
 
-const char* notNull = decBatch->hasNulls ?
-  decBatch->notNull.data() + offset : nullptr;
-const int64_t* values = decBatch->values.data() + offset;
 DecimalColumnStatisticsImpl* decStats =
-  dynamic_cast(colIndexStatistics.get());
+dynamic_cast(colIndexStatistics.get());
 
 Review comment:
   same here and below


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in ColumnWriter
> ---
>
> Key: ORC-443
> URL: https://issues.apache.org/jira/browse/ORC-443
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> A few changes to ColumnWriter and its derived classes:
> 1. in add() function, re-order code to verify input parameters before 
> modifying any internal state.
> 2. in add() function, move the calls to colIndexStatistics->increase(1) out 
> of the loop. Many of those are virtual function calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-443) [C++] Code improvements in ColumnWriter

2018-11-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705148#comment-16705148
 ] 

ASF GitHub Bot commented on ORC-443:


fangzheng commented on issue #344: ORC-443: [C++] Code improvements in 
ColumnWriter.
URL: https://github.com/apache/orc/pull/344#issuecomment-443297986
 
 
   Thanks, Gang! All the codes are fixed now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in ColumnWriter
> ---
>
> Key: ORC-443
> URL: https://issues.apache.org/jira/browse/ORC-443
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> A few changes to ColumnWriter and its derived classes:
> 1. in add() function, re-order code to verify input parameters before 
> modifying any internal state.
> 2. in add() function, move the calls to colIndexStatistics->increase(1) out 
> of the loop. Many of those are virtual function calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-443) [C++] Code improvements in ColumnWriter

2018-11-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705145#comment-16705145
 ] 

ASF GitHub Bot commented on ORC-443:


fangzheng commented on a change in pull request #344: ORC-443: [C++] Code 
improvements in ColumnWriter.
URL: https://github.com/apache/orc/pull/344#discussion_r237959728
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -762,25 +770,26 @@ namespace orc {
   void DoubleColumnWriter::add(ColumnVectorBatch& rowBatch,
uint64_t offset,
uint64_t numValues) {
-ColumnWriter::add(rowBatch, offset, numValues);
 const DoubleVectorBatch* dblBatch =
   dynamic_cast();
 if (dblBatch == nullptr) {
   throw InvalidArgument("Failed to cast to DoubleVectorBatch");
 }
-
-const double* doubleData = dblBatch->data.data() + offset;
-const char* notNull = dblBatch->hasNulls ?
-  dblBatch->notNull.data() + offset : nullptr;
-
 DoubleColumnStatisticsImpl* doubleStats =
-  dynamic_cast(colIndexStatistics.get());
+dynamic_cast(colIndexStatistics.get());
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in ColumnWriter
> ---
>
> Key: ORC-443
> URL: https://issues.apache.org/jira/browse/ORC-443
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> A few changes to ColumnWriter and its derived classes:
> 1. in add() function, re-order code to verify input parameters before 
> modifying any internal state.
> 2. in add() function, move the calls to colIndexStatistics->increase(1) out 
> of the loop. Many of those are virtual function calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-443) [C++] Code improvements in ColumnWriter

2018-11-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705144#comment-16705144
 ] 

ASF GitHub Bot commented on ORC-443:


fangzheng commented on a change in pull request #344: ORC-443: [C++] Code 
improvements in ColumnWriter.
URL: https://github.com/apache/orc/pull/344#discussion_r237959714
 
 

 ##
 File path: c++/src/ByteRLE.cc
 ##
 @@ -55,7 +55,7 @@ namespace orc {
 virtual uint64_t getBufferSize() const override;
 
 /**
- * Flushing underlying BufferedOutputStream
+ * Flush underlying BufferedOutputStream.
 */
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in ColumnWriter
> ---
>
> Key: ORC-443
> URL: https://issues.apache.org/jira/browse/ORC-443
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> A few changes to ColumnWriter and its derived classes:
> 1. in add() function, re-order code to verify input parameters before 
> modifying any internal state.
> 2. in add() function, move the calls to colIndexStatistics->increase(1) out 
> of the loop. Many of those are virtual function calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-443) [C++] Code improvements in ColumnWriter

2018-11-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705119#comment-16705119
 ] 

ASF GitHub Bot commented on ORC-443:


wgtmac commented on a change in pull request #344: ORC-443: [C++] Code 
improvements in ColumnWriter.
URL: https://github.com/apache/orc/pull/344#discussion_r237956027
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -762,25 +770,26 @@ namespace orc {
   void DoubleColumnWriter::add(ColumnVectorBatch& rowBatch,
uint64_t offset,
uint64_t numValues) {
-ColumnWriter::add(rowBatch, offset, numValues);
 const DoubleVectorBatch* dblBatch =
   dynamic_cast();
 if (dblBatch == nullptr) {
   throw InvalidArgument("Failed to cast to DoubleVectorBatch");
 }
-
-const double* doubleData = dblBatch->data.data() + offset;
-const char* notNull = dblBatch->hasNulls ?
-  dblBatch->notNull.data() + offset : nullptr;
-
 DoubleColumnStatisticsImpl* doubleStats =
-  dynamic_cast(colIndexStatistics.get());
+dynamic_cast(colIndexStatistics.get());
 
 Review comment:
   please use same indent


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in ColumnWriter
> ---
>
> Key: ORC-443
> URL: https://issues.apache.org/jira/browse/ORC-443
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> A few changes to ColumnWriter and its derived classes:
> 1. in add() function, re-order code to verify input parameters before 
> modifying any internal state.
> 2. in add() function, move the calls to colIndexStatistics->increase(1) out 
> of the loop. Many of those are virtual function calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-443) [C++] Code improvements in ColumnWriter

2018-11-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705121#comment-16705121
 ] 

ASF GitHub Bot commented on ORC-443:


wgtmac commented on a change in pull request #344: ORC-443: [C++] Code 
improvements in ColumnWriter.
URL: https://github.com/apache/orc/pull/344#discussion_r237947011
 
 

 ##
 File path: c++/src/ByteRLE.cc
 ##
 @@ -55,7 +55,7 @@ namespace orc {
 virtual uint64_t getBufferSize() const override;
 
 /**
- * Flushing underlying BufferedOutputStream
+ * Flush underlying BufferedOutputStream.
 */
 
 Review comment:
   please also align this line


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in ColumnWriter
> ---
>
> Key: ORC-443
> URL: https://issues.apache.org/jira/browse/ORC-443
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> A few changes to ColumnWriter and its derived classes:
> 1. in add() function, re-order code to verify input parameters before 
> modifying any internal state.
> 2. in add() function, move the calls to colIndexStatistics->increase(1) out 
> of the loop. Many of those are virtual function calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-442) [C++] Code improvements in Statistics and Writer

2018-11-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703932#comment-16703932
 ] 

ASF GitHub Bot commented on ORC-442:


fangzheng opened a new pull request #343: ORC-442: [C++] Code improvements in 
Statistics and Writer.
URL: https://github.com/apache/orc/pull/343
 
 
   A few code changes in Statistics and Writer classes:
   
   1. Change StatisticsImpl to use vector instead of list for storing 
ColumnStatistics. Because the required operations are push_back() in ctor, 
iteration in dtor, and random element access in getColumnStatistics(), and list 
does not support random access in constant time, vector would be more 
appropriate than list.
   
   2. InternalBooleanStatistics is currently typedef-ed as 
InternalStatisticsImpl. Since min/max/sum does not apply to 
BooleanColumnStatistics, we should define InternalBooleanStatistics to be 
InternalStatisticsImpl to save 21 bytes per instance.
   
   3. Misc. changes to ColumnWriter.hh, Writer.cc, Compression.hh, and 
Statistics.hh to fix typos in Doxygen and reduce object copies.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Code improvements in Statistics and Writer
> 
>
> Key: ORC-442
> URL: https://issues.apache.org/jira/browse/ORC-442
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> A few code changes in Statistics and Writer classes:
> 1. Change StatisticsImpl to use vector instead of list for storing 
> ColumnStatistics. Because the required operations are push_back() in ctor, 
> iteration in dtor, and random element access in getColumnStatistics(), and 
> list does not support random access in constant time, vector would be more 
> appropriate than list.
> 2.  InternalBooleanStatistics is currently typedef-ed as 
> InternalStatisticsImpl. Since min/max/sum does not apply to 
> BooleanColumnStatistics, we should define InternalBooleanStatistics to be 
> InternalStatisticsImpl to save 21 bytes per instance.
> 3. Misc. changes to ColumnWriter.hh, Writer.cc, Compression.hh, and 
> Statistics.hh to fix typos in Doxygen and reduce object copies.
> Please see PR for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703525#comment-16703525
 ] 

ASF GitHub Bot commented on ORC-420:


majetideepak commented on issue #326: ORC-420: [C++] Implement string 
dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#issuecomment-442916968
 
 
   @wgtmac I will take a look this today/tomorrow. I missed the previous 
notification.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-434) Fix incorrect documentation for Date/Timestamp ColumnStatistics

2018-11-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702332#comment-16702332
 ] 

ASF GitHub Bot commented on ORC-434:


wgtmac commented on issue #337: ORC-434: Fix incorrect documentation for 
Date/Timestamp ColumnStatistics
URL: https://github.com/apache/orc/pull/337#issuecomment-442580692
 
 
   Though we can still write v0 format, ORC-135 fix is not in it. IIUC, I 
believe it is OK not to mention this in ORCv0 doc. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix incorrect documentation for Date/Timestamp ColumnStatistics
> ---
>
> Key: ORC-434
> URL: https://issues.apache.org/jira/browse/ORC-434
> Project: ORC
>  Issue Type: Bug
>  Components: documentation
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> In the documentation of ORC specification, the following two statements are 
> wrong:
>  * Date columns record the minimum and maximum values as the number of days 
> since the epoch (1/1/2015).
>  * Timestamp columns record the minimum and maximum values as the number of 
> milliseconds since the epoch (1/1/2015).
> The correct epoch is UNIX time epoch which is January 1, 1970 in UTC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702199#comment-16702199
 ] 

ASF GitHub Bot commented on ORC-420:


wgtmac commented on issue #326: ORC-420: [C++] Implement string dictionary 
encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#issuecomment-442541764
 
 
   @majetideepak Do you have the chance to review it again? Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-434) Fix incorrect documentation for Date/Timestamp ColumnStatistics

2018-11-28 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702106#comment-16702106
 ] 

ASF GitHub Bot commented on ORC-434:


omalley commented on issue #337: ORC-434: Fix incorrect documentation for 
Date/Timestamp ColumnStatistics
URL: https://github.com/apache/orc/pull/337#issuecomment-442510469
 
 
   Ok, I committed this. I misremembered exactly what we did in ORC-135. It was 
always from the UNIX epoch, but we weren't adjusting the values into UTC before 
ORC-135. Please look at the text I put in to see if it is clear. ORCv2 won't 
include the pre-ORC-135, so I removed that part. Now that I think about it, 
ORCv0 probably should include the ORC-135 discussion, because we can still 
write the v0 format.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix incorrect documentation for Date/Timestamp ColumnStatistics
> ---
>
> Key: ORC-434
> URL: https://issues.apache.org/jira/browse/ORC-434
> Project: ORC
>  Issue Type: Bug
>  Components: documentation
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> In the documentation of ORC specification, the following two statements are 
> wrong:
>  * Date columns record the minimum and maximum values as the number of days 
> since the epoch (1/1/2015).
>  * Timestamp columns record the minimum and maximum values as the number of 
> milliseconds since the epoch (1/1/2015).
> The correct epoch is UNIX time epoch which is January 1, 1970 in UTC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-434) Fix incorrect documentation for Date/Timestamp ColumnStatistics

2018-11-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701193#comment-16701193
 ] 

ASF GitHub Bot commented on ORC-434:


wgtmac commented on a change in pull request #337: ORC-434: Fix incorrect 
documentation for Date/Timestamp ColumnStatistics
URL: https://github.com/apache/orc/pull/337#discussion_r236895567
 
 

 ##
 File path: site/specification/ORCv0.md
 ##
 @@ -308,7 +308,7 @@ message DateStatistics {
 ```
 
 Timestamp columns record the minimum and maximum values as the number of
-milliseconds since the epoch (1/1/2015).
+milliseconds since the UNIX epoch (1/1/1970 00:00:00 in UTC).
 
 Review comment:
   Thanks for pointing this out! I think ORC-135 belongs to ORCv1 only; 
therefore I will only fix ORCv1.md and leave ORCv0.md as is.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix incorrect documentation for Date/Timestamp ColumnStatistics
> ---
>
> Key: ORC-434
> URL: https://issues.apache.org/jira/browse/ORC-434
> Project: ORC
>  Issue Type: Bug
>  Components: documentation
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> In the documentation of ORC specification, the following two statements are 
> wrong:
>  * Date columns record the minimum and maximum values as the number of days 
> since the epoch (1/1/2015).
>  * Timestamp columns record the minimum and maximum values as the number of 
> milliseconds since the epoch (1/1/2015).
> The correct epoch is UNIX time epoch which is January 1, 1970 in UTC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-434) Fix incorrect documentation for Date/Timestamp ColumnStatistics

2018-11-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701184#comment-16701184
 ] 

ASF GitHub Bot commented on ORC-434:


omalley commented on a change in pull request #337: ORC-434: Fix incorrect 
documentation for Date/Timestamp ColumnStatistics
URL: https://github.com/apache/orc/pull/337#discussion_r236893331
 
 

 ##
 File path: site/specification/ORCv0.md
 ##
 @@ -297,7 +297,7 @@ message DecimalStatistics {
 ```
 
 Date columns record the minimum and maximum values as the number of
-days since the epoch (1/1/2015).
+days since the UNIX epoch (1/1/1970 in UTC).
 
 Review comment:
   This is correct.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix incorrect documentation for Date/Timestamp ColumnStatistics
> ---
>
> Key: ORC-434
> URL: https://issues.apache.org/jira/browse/ORC-434
> Project: ORC
>  Issue Type: Bug
>  Components: documentation
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> In the documentation of ORC specification, the following two statements are 
> wrong:
>  * Date columns record the minimum and maximum values as the number of days 
> since the epoch (1/1/2015).
>  * Timestamp columns record the minimum and maximum values as the number of 
> milliseconds since the epoch (1/1/2015).
> The correct epoch is UNIX time epoch which is January 1, 1970 in UTC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-434) Fix incorrect documentation for Date/Timestamp ColumnStatistics

2018-11-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701183#comment-16701183
 ] 

ASF GitHub Bot commented on ORC-434:


omalley commented on a change in pull request #337: ORC-434: Fix incorrect 
documentation for Date/Timestamp ColumnStatistics
URL: https://github.com/apache/orc/pull/337#discussion_r236893631
 
 

 ##
 File path: site/specification/ORCv0.md
 ##
 @@ -308,7 +308,7 @@ message DateStatistics {
 ```
 
 Timestamp columns record the minimum and maximum values as the number of
-milliseconds since the epoch (1/1/2015).
+milliseconds since the UNIX epoch (1/1/1970 00:00:00 in UTC).
 
 Review comment:
   This is wrong. Timestamps are millis since 1/1/2015, until we added 
minimumUtc and maximumUtc in ORC-135 f2b8b799c .


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix incorrect documentation for Date/Timestamp ColumnStatistics
> ---
>
> Key: ORC-434
> URL: https://issues.apache.org/jira/browse/ORC-434
> Project: ORC
>  Issue Type: Bug
>  Components: documentation
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> In the documentation of ORC specification, the following two statements are 
> wrong:
>  * Date columns record the minimum and maximum values as the number of days 
> since the epoch (1/1/2015).
>  * Timestamp columns record the minimum and maximum values as the number of 
> milliseconds since the epoch (1/1/2015).
> The correct epoch is UNIX time epoch which is January 1, 1970 in UTC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-435) Ability to read stripes that are greater than 2GB

2018-11-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701145#comment-16701145
 ] 

ASF GitHub Bot commented on ORC-435:


omalley closed pull request #338: ORC-435: Ability to read stripes that are 
greater than 2GB
URL: https://github.com/apache/orc/pull/338
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/java/core/src/java/org/apache/orc/OrcConf.java 
b/java/core/src/java/org/apache/orc/OrcConf.java
index bdf8c0d2db..23e301e96e 100644
--- a/java/core/src/java/org/apache/orc/OrcConf.java
+++ b/java/core/src/java/org/apache/orc/OrcConf.java
@@ -158,6 +158,9 @@
   + "HDFS blocks."),
   DIRECT_ENCODING_COLUMNS("orc.column.encoding.direct", 
"orc.column.encoding.direct", "",
   "Comma-separated list of columns for which dictionary encoding is to be 
skipped."),
+  // some JVM doesn't allow array creation of size Integer.MAX_VALUE, so chunk 
size is slightly less than max int
+  ORC_MAX_DISK_RANGE_CHUNK_LIMIT("orc.max.disk.range.chunk.limit", 
"hive.exec.orc.max.disk.range.chunk.limit",
+Integer.MAX_VALUE - 1024, "When reading stripes >2GB, specify max limit 
for the chunk size.")
   ;
 
   private final String attribute;
@@ -205,6 +208,22 @@ private String lookupValue(Properties tbl, Configuration 
conf) {
 return result;
   }
 
+  public int getInt(Properties tbl, Configuration conf) {
+String value = lookupValue(tbl, conf);
+if (value != null) {
+  return Integer.parseInt(value);
+}
+return ((Number) defaultValue).intValue();
+  }
+
+  public int getInt(Configuration conf) {
+return getInt(null, conf);
+  }
+
+  public void getInt(Configuration conf, int value) {
+conf.setInt(attribute, value);
+  }
+
   public long getLong(Properties tbl, Configuration conf) {
 String value = lookupValue(tbl, conf);
 if (value != null) {
diff --git a/java/core/src/java/org/apache/orc/impl/DataReaderProperties.java 
b/java/core/src/java/org/apache/orc/impl/DataReaderProperties.java
index fbdc145ce1..8420149a98 100644
--- a/java/core/src/java/org/apache/orc/impl/DataReaderProperties.java
+++ b/java/core/src/java/org/apache/orc/impl/DataReaderProperties.java
@@ -20,6 +20,7 @@
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcConf;
 
 public final class DataReaderProperties {
 
@@ -29,6 +30,7 @@
   private final boolean zeroCopy;
   private final int typeCount;
   private final int bufferSize;
+  private final int maxDiskRangeChunkLimit;
 
   private DataReaderProperties(Builder builder) {
 this.fileSystem = builder.fileSystem;
@@ -37,6 +39,7 @@ private DataReaderProperties(Builder builder) {
 this.zeroCopy = builder.zeroCopy;
 this.typeCount = builder.typeCount;
 this.bufferSize = builder.bufferSize;
+this.maxDiskRangeChunkLimit = builder.maxDiskRangeChunkLimit;
   }
 
   public FileSystem getFileSystem() {
@@ -63,6 +66,10 @@ public int getBufferSize() {
 return bufferSize;
   }
 
+  public int getMaxDiskRangeChunkLimit() {
+return maxDiskRangeChunkLimit;
+  }
+
   public static Builder builder() {
 return new Builder();
   }
@@ -75,6 +82,7 @@ public static Builder builder() {
 private boolean zeroCopy;
 private int typeCount;
 private int bufferSize;
+private int maxDiskRangeChunkLimit = (int) 
OrcConf.ORC_MAX_DISK_RANGE_CHUNK_LIMIT.getDefaultValue();
 
 private Builder() {
 
@@ -110,6 +118,11 @@ public Builder withBufferSize(int value) {
   return this;
 }
 
+public Builder withMaxDiskRangeChunkLimit(int value) {
+  this.maxDiskRangeChunkLimit = value;
+  return this;
+}
+
 public DataReaderProperties build() {
   if (fileSystem == null || path == null) {
 throw new NullPointerException("Filesystem = " + fileSystem +
diff --git a/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java 
b/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
index 2b8a6bf46a..3c4342a423 100644
--- a/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
+++ b/java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java
@@ -99,6 +99,7 @@
   private final DataReader dataReader;
   private final boolean ignoreNonUtf8BloomFilter;
   private final OrcFile.WriterVersion writerVersion;
+  private final int maxDiskRangeChunkLimit;
 
   /**
* Given a list of column names, find the given column and return the index.
@@ -231,7 +232,7 @@ protected RecordReaderImpl(ReaderImpl fileReader,
 rows += stripe.getNumberOfRows();
   }
 }
-
+this.maxDiskRangeChunkLimit = 
OrcConf.ORC_MAX_DISK_RANGE_CHUNK_LIMIT.getInt(fileReader.conf);
 

[jira] [Commented] (ORC-440) Fix deserialization of the StringStatisticsImpl when truncated

2018-11-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701044#comment-16701044
 ] 

ASF GitHub Bot commented on ORC-440:


omalley closed pull request #342: ORC-440. Fix deserialization of string column 
statistics with truncation
URL: https://github.com/apache/orc/pull/342
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
b/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java
index 1dd0418923..e983f04c68 100644
--- a/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java
+++ b/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java
@@ -538,9 +538,15 @@ public int hashCode() {
   OrcProto.StringStatistics str = stats.getStringStatistics();
   if (str.hasMaximum()) {
 maximum = new Text(str.getMaximum());
+  } else if (str.hasUpperBound()) {
+maximum = new Text(str.getUpperBound());
+isUpperBoundSet = true;
   }
   if (str.hasMinimum()) {
 minimum = new Text(str.getMinimum());
+  } else if (str.hasLowerBound()) {
+minimum = new Text(str.getLowerBound());
+isLowerBoundSet = true;
   }
   if(str.hasSum()) {
 sum = str.getSum();
@@ -642,10 +648,16 @@ public void merge(ColumnStatisticsImpl other) {
   OrcProto.StringStatistics.Builder str =
 OrcProto.StringStatistics.newBuilder();
   if (getNumberOfValues() != 0) {
-/* getLowerBound() will ALWAYS return min value */
-str.setMinimum(getLowerBound());
-/* getUpperBound() will ALWAYS return max value */
-str.setMaximum(getUpperBound());
+if (isLowerBoundSet) {
+  str.setLowerBound(minimum.toString());
+} else {
+  str.setMinimum(minimum.toString());
+}
+if (isUpperBoundSet) {
+  str.setUpperBound(maximum.toString());
+} else {
+  str.setMaximum(maximum.toString());
+}
 str.setSum(sum);
   }
   result.setStringStatistics(str);
diff --git a/java/core/src/test/org/apache/orc/TestColumnStatistics.java 
b/java/core/src/test/org/apache/orc/TestColumnStatistics.java
index 5908675070..b1f4dea13f 100644
--- a/java/core/src/test/org/apache/orc/TestColumnStatistics.java
+++ b/java/core/src/test/org/apache/orc/TestColumnStatistics.java
@@ -357,14 +357,24 @@ public void testUpperBoundCodepointIncrement() {
 final ColumnStatisticsImpl stats1 = ColumnStatisticsImpl.create(schema);
 byte[] utf8 = input.getBytes(StandardCharsets.UTF_8);
 stats1.updateString(utf8, 0, utf8.length, 1);
-
+stats1.increment();
 final StringColumnStatistics typed = (StringColumnStatistics) stats1;
 
 assertEquals(354, typed.getUpperBound().length());
 assertEquals(354, typed.getLowerBound().length());
+assertEquals(1764L, typed.getSum());
 
 assertEquals(upperbound, typed.getUpperBound());
 assertEquals(lowerBound, typed.getLowerBound());
+OrcProto.ColumnStatistics serial = stats1.serialize().build();
+ColumnStatisticsImpl stats2 =
+ColumnStatisticsImpl.deserialize(schema, serial);
+StringColumnStatistics typed2 = (StringColumnStatistics) stats2;
+assertEquals(null, typed2.getMinimum());
+assertEquals(null, typed2.getMaximum());
+assertEquals(lowerBound, typed2.getLowerBound());
+assertEquals(upperbound, typed2.getUpperBound());
+assertEquals(1764L, typed2.getSum());
   }
 
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix deserialization of the StringStatisticsImpl when truncated
> --
>
> Key: ORC-440
> URL: https://issues.apache.org/jira/browse/ORC-440
> Project: ORC
>  Issue Type: Task
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> In ORC-203, we added support for truncating minimum and maximum strings based 
> on their length. The deserialization method doesn't look for the new values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-434) Fix incorrect documentation for Date/Timestamp ColumnStatistics

2018-11-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701018#comment-16701018
 ] 

ASF GitHub Bot commented on ORC-434:


wgtmac commented on issue #337: ORC-434: Fix incorrect documentation for 
Date/Timestamp ColumnStatistics
URL: https://github.com/apache/orc/pull/337#issuecomment-442213522
 
 
   @omalley Can you please help verify this doc change? Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix incorrect documentation for Date/Timestamp ColumnStatistics
> ---
>
> Key: ORC-434
> URL: https://issues.apache.org/jira/browse/ORC-434
> Project: ORC
>  Issue Type: Bug
>  Components: documentation
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> In the documentation of ORC specification, the following two statements are 
> wrong:
>  * Date columns record the minimum and maximum values as the number of days 
> since the epoch (1/1/2015).
>  * Timestamp columns record the minimum and maximum values as the number of 
> milliseconds since the epoch (1/1/2015).
> The correct epoch is UNIX time epoch which is January 1, 1970 in UTC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-438) NPE in StringStatisticsImpl.merge()

2018-11-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699622#comment-16699622
 ] 

ASF GitHub Bot commented on ORC-438:


omalley closed pull request #341: ORC-438 - Fix possible NPEs in 
StringStatisticsImpl.merge()
URL: https://github.com/apache/orc/pull/341
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java 
b/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java
index 62a756359a..7f0012b74e 100644
--- a/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java
+++ b/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java
@@ -605,27 +605,15 @@ public void updateString(byte[] bytes, int offset, int 
length,
 public void merge(ColumnStatisticsImpl other) {
   if (other instanceof StringStatisticsImpl) {
 StringStatisticsImpl str = (StringStatisticsImpl) other;
-if (minimum == null) {
-  if (str.minimum != null) {
-maximum = new Text(str.getMaximum());
-minimum = new Text(str.getMinimum());
-  }
-  /* str.minimum == null when lower bound set */
-  else if (str.isLowerBoundSet) {
-minimum = new Text(str.getLowerBound());
+if (count == 0) {
+  if (str.count != 0) {
+minimum = new Text(str.minimum);
 isLowerBoundSet = str.isLowerBoundSet;
-
-/* check for upper bound before setting max */
-if (str.isUpperBoundSet) {
-  maximum = new Text(str.getUpperBound());
-  isUpperBoundSet = str.isUpperBoundSet;
-} else {
-  maximum = new Text(str.getMaximum());
-}
-  }
-  else {
+maximum = new Text(str.maximum);
+isUpperBoundSet = str.isUpperBoundSet;
+  } else {
   /* both are empty */
-maximum = minimum = null;
+  maximum = minimum = null;
   }
 } else if (str.minimum != null) {
   if (minimum.compareTo(str.minimum) > 0) {
@@ -660,8 +648,10 @@ else if (str.isLowerBoundSet) {
   OrcProto.StringStatistics.Builder str =
 OrcProto.StringStatistics.newBuilder();
   if (getNumberOfValues() != 0) {
-str.setMinimum(getMinimum());
-str.setMaximum(getMaximum());
+/* getLowerBound() will ALWAYS return min value */
+str.setMinimum(getLowerBound());
+/* getUpperBound() will ALWAYS return max value */
+str.setMaximum(getUpperBound());
 str.setSum(sum);
   }
   result.setStringStatistics(str);
@@ -1638,7 +1628,7 @@ public int hashCode() {
 }
   }
 
-  private long count = 0;
+  protected long count = 0;
   private boolean hasNull = false;
   private long bytesOnDisk = 0;
 
diff --git a/java/core/src/test/org/apache/orc/TestColumnStatistics.java 
b/java/core/src/test/org/apache/orc/TestColumnStatistics.java
index e6d6953815..5908675070 100644
--- a/java/core/src/test/org/apache/orc/TestColumnStatistics.java
+++ b/java/core/src/test/org/apache/orc/TestColumnStatistics.java
@@ -98,11 +98,14 @@ public void testStringMerge() throws Exception {
 
 ColumnStatisticsImpl stats1 = ColumnStatisticsImpl.create(schema);
 ColumnStatisticsImpl stats2 = ColumnStatisticsImpl.create(schema);
+stats1.increment(3);
 stats1.updateString(new Text("bob"));
 stats1.updateString(new Text("david"));
 stats1.updateString(new Text("charles"));
+stats2.increment(2);
 stats2.updateString(new Text("anne"));
 byte[] erin = new byte[]{0, 1, 2, 3, 4, 5, 101, 114, 105, 110};
+stats2.increment();
 stats2.updateString(erin, 6, 4, 5);
 assertEquals(24, ((StringColumnStatistics)stats2).getSum());
 stats1.merge(stats2);
@@ -111,6 +114,7 @@ public void testStringMerge() throws Exception {
 assertEquals("erin", typed.getMaximum());
 assertEquals(39, typed.getSum());
 stats1.reset();
+stats1.increment(2);
 stats1.updateString(new Text("aaa"));
 stats1.updateString(new Text("zzz"));
 stats1.merge(stats2);
@@ -131,6 +135,7 @@ public void testUpperAndLowerBounds() throws Exception {
 final ColumnStatisticsImpl stats2 = ColumnStatisticsImpl.create(schema);
 
 /* test a scenario for the first max string */
+stats1.increment();
 stats1.updateString(new Text(test));
 
 final StringColumnStatistics typed = (StringColumnStatistics) stats1;
@@ -146,6 +151,7 @@ public void testUpperAndLowerBounds() throws Exception {
 stats1.reset();
 
 /* test a scenario for the first max bytes */
+stats1.increment();
 

[jira] [Commented] (ORC-435) Ability to read stripes that are greater than 2GB

2018-11-26 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699474#comment-16699474
 ] 

ASF GitHub Bot commented on ORC-435:


omalley commented on a change in pull request #338: ORC-435: Ability to read 
stripes that are greater than 2GB
URL: https://github.com/apache/orc/pull/338#discussion_r236397669
 
 

 ##
 File path: java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java
 ##
 @@ -531,39 +534,51 @@ static DiskRangeList readDiskRanges(FSDataInputStream 
file,
 range = range.next;
 continue;
   }
-  int len = (int) (range.getEnd() - range.getOffset());
+  boolean hasReplaced = false;
+  long len = range.getEnd() - range.getOffset();
   long off = range.getOffset();
-  if (zcr != null) {
-file.seek(base + off);
-boolean hasReplaced = false;
-while (len > 0) {
-  ByteBuffer partial = zcr.readBuffer(len, false);
-  BufferChunk bc = new BufferChunk(partial, off);
-  if (!hasReplaced) {
-range.replaceSelfWith(bc);
-hasReplaced = true;
+  while (len > 0) {
+BufferChunk bc;
+
+// Stripe could be too large to read fully into a single buffer and 
will need to be chunked
 
 Review comment:
   No, it is caused because Java limits arrays to 2^31 elements, even on 64 bit 
architectures.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ability to read stripes that are greater than 2GB
> -
>
> Key: ORC-435
> URL: https://issues.apache.org/jira/browse/ORC-435
> Project: ORC
>  Issue Type: Bug
>  Components: Reader
>Affects Versions: 1.3.4, 1.4.4, 1.6.0, 1.5.3
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
> Fix For: 1.5.4, 1.6.0
>
>
> ORC reader fails with NegativeArraySizeException if the stripe size is >2GB. 
> Even though default stripe size is 64MB there are cases where stripe size 
> will reach >2GB even before memory manager can kick in to check memory size. 
> Say if we are inserting 500KB strings (mostly unique) by the time we reach 
> 5000 rows stripe size is already over 2GB. Reader will have to chunk the disk 
> range reads for such cases instead of reading the stripe as whole blob. 
> Exception thrown when reading such files
> {code:java}
> 2018-10-12 21:43:58,833 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.lang.NegativeArraySizeException
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:272)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1007)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:835)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1029)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1062)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1085){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-437) Make acid schema checks case insensitive

2018-11-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695175#comment-16695175
 ] 

ASF GitHub Bot commented on ORC-437:


vaibhavgumashta closed pull request #340: ORC-437: Make acid schema checks case 
insensitive
URL: https://github.com/apache/orc/pull/340
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/java/core/src/java/org/apache/orc/impl/SchemaEvolution.java 
b/java/core/src/java/org/apache/orc/impl/SchemaEvolution.java
index 480f13ab25..aa7ba9cb36 100644
--- a/java/core/src/java/org/apache/orc/impl/SchemaEvolution.java
+++ b/java/core/src/java/org/apache/orc/impl/SchemaEvolution.java
@@ -584,9 +584,15 @@ void buildIdentityConversion(TypeDescription readerType) {
   private static boolean checkAcidSchema(TypeDescription type) {
 if (type.getCategory().equals(TypeDescription.Category.STRUCT)) {
   List rootFields = type.getFieldNames();
-  if (acidEventFieldNames.equals(rootFields)) {
-return true;
+  if (rootFields.size() != acidEventFieldNames.size()) {
+return false;
+  }
+  for (int i = 0; i < rootFields.size(); i++) {
+if (!acidEventFieldNames.get(i).equalsIgnoreCase(rootFields.get(i))) {
+  return false;
+}
   }
+  return true;
 }
 return false;
   }


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Make acid schema checks case insensitive
> 
>
> Key: ORC-437
> URL: https://issues.apache.org/jira/browse/ORC-437
> Project: ORC
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.5.3
>Reporter: Vaibhav Gumashta
>Assignee: Vaibhav Gumashta
>Priority: Major
> Attachments: ORC-437.1.patch, ORC-437.2.patch
>
>
> When reading from an Orc file, SchemaEvolution evolution tries to determine 
> if this is an Acid compliant format by comparing field names with Acid event 
> names in {{SchemaEvolution.checkAcidSchema}}. Would be good to make this 
> comparison case insensitive.
> This requirement comes in from HIVE-20699 where a Hive query is being used to 
> run compaction (and hence write the compacted data to the bucket files via a 
> HiveQL query). Since hive converts all column names to lower case, the 
> compacted files end up with lower case Acid schema columns. The change is 
> much simpler when made in Orc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-437) Make acid schema checks case insensitive

2018-11-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695174#comment-16695174
 ] 

ASF GitHub Bot commented on ORC-437:


vaibhavgumashta commented on issue #340: ORC-437: Make acid schema checks case 
insensitive
URL: https://github.com/apache/orc/pull/340#issuecomment-440786311
 
 
   Pushed upstream


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Make acid schema checks case insensitive
> 
>
> Key: ORC-437
> URL: https://issues.apache.org/jira/browse/ORC-437
> Project: ORC
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.5.3
>Reporter: Vaibhav Gumashta
>Assignee: Vaibhav Gumashta
>Priority: Major
> Attachments: ORC-437.1.patch, ORC-437.2.patch
>
>
> When reading from an Orc file, SchemaEvolution evolution tries to determine 
> if this is an Acid compliant format by comparing field names with Acid event 
> names in {{SchemaEvolution.checkAcidSchema}}. Would be good to make this 
> comparison case insensitive.
> This requirement comes in from HIVE-20699 where a Hive query is being used to 
> run compaction (and hence write the compacted data to the bucket files via a 
> HiveQL query). Since hive converts all column names to lower case, the 
> compacted files end up with lower case Acid schema columns. The change is 
> much simpler when made in Orc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-437) Make acid schema checks case insensitive

2018-11-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693877#comment-16693877
 ] 

ASF GitHub Bot commented on ORC-437:


vaibhavgumashta closed pull request #340: ORC-437: Make acid schema checks case 
insensitive
URL: https://github.com/apache/orc/pull/340
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/java/core/src/java/org/apache/orc/impl/SchemaEvolution.java 
b/java/core/src/java/org/apache/orc/impl/SchemaEvolution.java
index 480f13ab25..aa7ba9cb36 100644
--- a/java/core/src/java/org/apache/orc/impl/SchemaEvolution.java
+++ b/java/core/src/java/org/apache/orc/impl/SchemaEvolution.java
@@ -584,9 +584,15 @@ void buildIdentityConversion(TypeDescription readerType) {
   private static boolean checkAcidSchema(TypeDescription type) {
 if (type.getCategory().equals(TypeDescription.Category.STRUCT)) {
   List rootFields = type.getFieldNames();
-  if (acidEventFieldNames.equals(rootFields)) {
-return true;
+  if (rootFields.size() != acidEventFieldNames.size()) {
+return false;
+  }
+  for (int i = 0; i < rootFields.size(); i++) {
+if (!acidEventFieldNames.get(i).equalsIgnoreCase(rootFields.get(i))) {
+  return false;
+}
   }
+  return true;
 }
 return false;
   }


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Make acid schema checks case insensitive
> 
>
> Key: ORC-437
> URL: https://issues.apache.org/jira/browse/ORC-437
> Project: ORC
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.5.3
>Reporter: Vaibhav Gumashta
>Assignee: Vaibhav Gumashta
>Priority: Major
> Attachments: ORC-437.1.patch, ORC-437.2.patch
>
>
> When reading from an Orc file, SchemaEvolution evolution tries to determine 
> if this is an Acid compliant format by comparing field names with Acid event 
> names in {{SchemaEvolution.checkAcidSchema}}. Would be good to make this 
> comparison case insensitive.
> This requirement comes in from HIVE-20699 where a Hive query is being used to 
> run compaction (and hence write the compacted data to the bucket files via a 
> HiveQL query). Since hive converts all column names to lower case, the 
> compacted files end up with lower case Acid schema columns. The change is 
> much simpler when made in Orc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-437) Make acid schema checks case insensitive

2018-11-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693878#comment-16693878
 ] 

ASF GitHub Bot commented on ORC-437:


vaibhavgumashta opened a new pull request #340: ORC-437: Make acid schema 
checks case insensitive
URL: https://github.com/apache/orc/pull/340
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Make acid schema checks case insensitive
> 
>
> Key: ORC-437
> URL: https://issues.apache.org/jira/browse/ORC-437
> Project: ORC
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.5.3
>Reporter: Vaibhav Gumashta
>Assignee: Vaibhav Gumashta
>Priority: Major
> Attachments: ORC-437.1.patch, ORC-437.2.patch
>
>
> When reading from an Orc file, SchemaEvolution evolution tries to determine 
> if this is an Acid compliant format by comparing field names with Acid event 
> names in {{SchemaEvolution.checkAcidSchema}}. Would be good to make this 
> comparison case insensitive.
> This requirement comes in from HIVE-20699 where a Hive query is being used to 
> run compaction (and hence write the compacted data to the bucket files via a 
> HiveQL query). Since hive converts all column names to lower case, the 
> compacted files end up with lower case Acid schema columns. The change is 
> much simpler when made in Orc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-437) Make acid schema checks case insensitive

2018-11-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693798#comment-16693798
 ] 

ASF GitHub Bot commented on ORC-437:


vaibhavgumashta opened a new pull request #340: ORC-437: Make acid schema 
checks case insensitive
URL: https://github.com/apache/orc/pull/340
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Make acid schema checks case insensitive
> 
>
> Key: ORC-437
> URL: https://issues.apache.org/jira/browse/ORC-437
> Project: ORC
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.5.3
>Reporter: Vaibhav Gumashta
>Assignee: Vaibhav Gumashta
>Priority: Major
> Attachments: ORC-437.1.patch
>
>
> When reading from an Orc file, SchemaEvolution evolution tries to determine 
> if this is an Acid compliant format by comparing field names with Acid event 
> names in {{SchemaEvolution.checkAcidSchema}}. Would be good to make this 
> comparison case insensitive.
> This requirement comes in from HIVE-20699 where a Hive query is being used to 
> run compaction (and hence write the compacted data to the bucket files via a 
> HiveQL query). Since hive converts all column names to lower case, the 
> compacted files end up with lower case Acid schema columns. The change is 
> much simpler when made in Orc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-437) Make acid schema checks case insensitive

2018-11-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693790#comment-16693790
 ] 

ASF GitHub Bot commented on ORC-437:


vaibhavgumashta closed pull request #339: ORC-437: Make acid schema checks case 
insensitive
URL: https://github.com/apache/orc/pull/339
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/java/core/src/java/org/apache/orc/impl/SchemaEvolution.java 
b/java/core/src/java/org/apache/orc/impl/SchemaEvolution.java
index 480f13ab25..5218e721de 100644
--- a/java/core/src/java/org/apache/orc/impl/SchemaEvolution.java
+++ b/java/core/src/java/org/apache/orc/impl/SchemaEvolution.java
@@ -584,9 +584,12 @@ void buildIdentityConversion(TypeDescription readerType) {
   private static boolean checkAcidSchema(TypeDescription type) {
 if (type.getCategory().equals(TypeDescription.Category.STRUCT)) {
   List rootFields = type.getFieldNames();
-  if (acidEventFieldNames.equals(rootFields)) {
-return true;
+  for (int i = 0; i < acidEventFieldNames.size(); i++) {
+if (!acidEventFieldNames.get(i).equalsIgnoreCase(rootFields.get(i))) {
+  return false;
+}
   }
+  return true;
 }
 return false;
   }


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Make acid schema checks case insensitive
> 
>
> Key: ORC-437
> URL: https://issues.apache.org/jira/browse/ORC-437
> Project: ORC
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.5.3
>Reporter: Vaibhav Gumashta
>Assignee: Vaibhav Gumashta
>Priority: Major
> Attachments: ORC-437.1.patch
>
>
> When reading from an Orc file, SchemaEvolution evolution tries to determine 
> if this is an Acid compliant format by comparing field names with Acid event 
> names in {{SchemaEvolution.checkAcidSchema}}. Would be good to make this 
> comparison case insensitive.
> This requirement comes in from HIVE-20699 where a Hive query is being used to 
> run compaction (and hence write the compacted data to the bucket files via a 
> HiveQL query). Since hive converts all column names to lower case, the 
> compacted files end up with lower case Acid schema columns. The change is 
> much simpler when made in Orc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-435) Ability to read stripes that are greater than 2GB

2018-11-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692422#comment-16692422
 ] 

ASF GitHub Bot commented on ORC-435:


xndai commented on a change in pull request #338: ORC-435: Ability to read 
stripes that are greater than 2GB
URL: https://github.com/apache/orc/pull/338#discussion_r234820639
 
 

 ##
 File path: java/core/src/java/org/apache/orc/impl/RecordReaderUtils.java
 ##
 @@ -531,39 +534,51 @@ static DiskRangeList readDiskRanges(FSDataInputStream 
file,
 range = range.next;
 continue;
   }
-  int len = (int) (range.getEnd() - range.getOffset());
+  boolean hasReplaced = false;
+  long len = range.getEnd() - range.getOffset();
   long off = range.getOffset();
-  if (zcr != null) {
-file.seek(base + off);
-boolean hasReplaced = false;
-while (len > 0) {
-  ByteBuffer partial = zcr.readBuffer(len, false);
-  BufferChunk bc = new BufferChunk(partial, off);
-  if (!hasReplaced) {
-range.replaceSelfWith(bc);
-hasReplaced = true;
+  while (len > 0) {
+BufferChunk bc;
+
+// Stripe could be too large to read fully into a single buffer and 
will need to be chunked
 
 Review comment:
   Just curious, under what condition will the reader read a stripe fully? At 
least I don't think there's such behavior on c++ side.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ability to read stripes that are greater than 2GB
> -
>
> Key: ORC-435
> URL: https://issues.apache.org/jira/browse/ORC-435
> Project: ORC
>  Issue Type: Bug
>  Components: Reader
>Affects Versions: 1.3.4, 1.4.4, 1.6.0, 1.5.3
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
> Fix For: 1.5.4, 1.6.0
>
>
> ORC reader fails with NegativeArraySizeException if the stripe size is >2GB. 
> Even though default stripe size is 64MB there are cases where stripe size 
> will reach >2GB even before memory manager can kick in to check memory size. 
> Say if we are inserting 500KB strings (mostly unique) by the time we reach 
> 5000 rows stripe size is already over 2GB. Reader will have to chunk the disk 
> range reads for such cases instead of reading the stripe as whole blob. 
> Exception thrown when reading such files
> {code:java}
> 2018-10-12 21:43:58,833 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.lang.NegativeArraySizeException
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:272)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1007)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:835)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1029)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1062)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1085){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-435) Ability to read stripes that are greater than 2GB

2018-11-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682108#comment-16682108
 ] 

ASF GitHub Bot commented on ORC-435:


prasanthj opened a new pull request #338: ORC-435: Ability to read stripes that 
are greater than 2GB
URL: https://github.com/apache/orc/pull/338
 
 
   ORC reader fails with NegativeArraySizeException if the stripe size is >2GB. 
Even though default stripe size is 64MB there are cases where stripe size will 
reach >2GB even before memory manager can kick in to check memory size. Say if 
we are inserting 500KB strings (mostly unique) by the time we reach 5000 rows 
stripe size is already over 2GB. Reader will have to chunk the disk range reads 
for such cases instead of reading the stripe as whole blob. 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ability to read stripes that are greater than 2GB
> -
>
> Key: ORC-435
> URL: https://issues.apache.org/jira/browse/ORC-435
> Project: ORC
>  Issue Type: Bug
>  Components: Reader
>Affects Versions: 1.3.4, 1.4.4, 1.6.0, 1.5.3
>Reporter: Prasanth Jayachandran
>Priority: Major
> Fix For: 1.5.4, 1.6.0
>
>
> ORC reader fails with NegativeArraySizeException if the stripe size is >2GB. 
> Even though default stripe size is 64MB there are cases where stripe size 
> will reach >2GB even before memory manager can kick in to check memory size. 
> Say if we are inserting 500KB strings (mostly unique) by the time we reach 
> 5000 rows stripe size is already over 2GB. Reader will have to chunk the disk 
> range reads for such cases instead of reading the stripe as whole blob. 
> Exception thrown when reading such files
> {code:java}
> 2018-10-12 21:43:58,833 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.lang.NegativeArraySizeException
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:272)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1007)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:835)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1029)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1062)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1085){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-434) Fix incorrect documentation for Date/Timestamp ColumnStatistics

2018-11-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680528#comment-16680528
 ] 

ASF GitHub Bot commented on ORC-434:


wgtmac opened a new pull request #337: ORC-434: Fix incorrect documentation for 
Date/Timestamp ColumnStatistics
URL: https://github.com/apache/orc/pull/337
 
 
   The old specification says Date and Timestamp ColumnStatistics stores the 
difference from epoch 1/1/2015. This is incorrect and should be changed to UNIX 
epoch time.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix incorrect documentation for Date/Timestamp ColumnStatistics
> ---
>
> Key: ORC-434
> URL: https://issues.apache.org/jira/browse/ORC-434
> Project: ORC
>  Issue Type: Bug
>  Components: documentation
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> In the documentation of ORC specification, the following two statements are 
> wrong:
>  * Date columns record the minimum and maximum values as the number of days 
> since the epoch (1/1/2015).
>  * Timestamp columns record the minimum and maximum values as the number of 
> milliseconds since the epoch (1/1/2015).
> The correct epoch is UNIX time epoch which is January 1, 1970 in UTC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680313#comment-16680313
 ] 

ASF GitHub Bot commented on ORC-420:


xndai commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r231735751
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,137 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+void getEntriesInInsertionOrder(std::vector&) const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
+uint64_t totalLength;
+
+// use friend class here to avoid being bothered by const function calls
+friend class StringColumnWriter;
+friend class CharColumnWriter;
+friend class VarCharColumnWriter;
+// store indexes of insertion order in the dictionary for not-null rows
+std::vector idxInDictBuffer;
 
 Review comment:
   does it have to be int64? Or uint32 is enough?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680414#comment-16680414
 ] 

ASF GitHub Bot commented on ORC-420:


wgtmac commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r232066309
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,121 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+std::vector entriesInInsertionOrder() const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
+uint64_t totalLength;
+  };
+
+  // insert a new string into dictionary, return its insertion order
+  size_t StringDictionary::insert(const char * str, size_t len) {
+auto ret = dict.insert({DictEntry(str, len), dict.size()});
+if (ret.second) {
+  // make a copy to internal storage
+  data.push_back(std::vector(len));
+  memcpy(data.back().data(), str, len);
+  // update dictionary entry to link pointer to internal storage
+  DictEntry * entry = const_cast(&(ret.first->first));
+  entry->data = data.back().data();
+  totalLength += len;
+}
+return ret.first->second;
+  }
+
+  // write dictionary data & length to output buffer
+  void StringDictionary::flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const {
+for (auto it = dict.cbegin(); it != dict.cend(); ++it) {
+  dataStream->write(it->first.data, it->first.length);
+  lengthEncoder->write(static_cast(it->first.length));
+}
+  }
+
+  // reorder input index buffer from insertion order to dictionary order
+  void StringDictionary::reorder(std::vector& idxBuffer) const {
 
 Review comment:
   As we are not bumping the writer version, I think keep it sorted is better 
right now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680415#comment-16680415
 ] 

ASF GitHub Bot commented on ORC-420:


wgtmac commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r232066028
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,137 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+void getEntriesInInsertionOrder(std::vector&) const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
+uint64_t totalLength;
+
+// use friend class here to avoid being bothered by const function calls
+friend class StringColumnWriter;
+friend class CharColumnWriter;
+friend class VarCharColumnWriter;
+// store indexes of insertion order in the dictionary for not-null rows
+std::vector idxInDictBuffer;
 
 Review comment:
   RleEncoder only has a streaming write interface for int64_t* buffer. Using 
int64_t here avoids copying the buffer for RleEncoder.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680413#comment-16680413
 ] 

ASF GitHub Bot commented on ORC-420:


wgtmac commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r232065012
 
 

 ##
 File path: c++/include/orc/Writer.hh
 ##
 @@ -185,6 +185,12 @@ namespace orc {
  * @return if not set, the default is false
  */
 bool getEnableIndex() const;
+
+/**
+ * Get whether or not to enable dictionary encoding
+ * @return if not set, the default is false
+ */
+bool getEnableDictionary() const;
 
 Review comment:
   Yup, we can do it in a separate change.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680311#comment-16680311
 ] 

ASF GitHub Bot commented on ORC-420:


xndai commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r232039365
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,121 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+std::vector entriesInInsertionOrder() const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
+uint64_t totalLength;
+  };
+
+  // insert a new string into dictionary, return its insertion order
+  size_t StringDictionary::insert(const char * str, size_t len) {
+auto ret = dict.insert({DictEntry(str, len), dict.size()});
+if (ret.second) {
+  // make a copy to internal storage
+  data.push_back(std::vector(len));
+  memcpy(data.back().data(), str, len);
+  // update dictionary entry to link pointer to internal storage
+  DictEntry * entry = const_cast(&(ret.first->first));
+  entry->data = data.back().data();
+  totalLength += len;
+}
+return ret.first->second;
+  }
+
+  // write dictionary data & length to output buffer
+  void StringDictionary::flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const {
+for (auto it = dict.cbegin(); it != dict.cend(); ++it) {
+  dataStream->write(it->first.data, it->first.length);
+  lengthEncoder->write(static_cast(it->first.length));
+}
+  }
+
+  // reorder input index buffer from insertion order to dictionary order
+  void StringDictionary::reorder(std::vector& idxBuffer) const {
 
 Review comment:
   This reordering really complicates things a lot. As discussed, we are moving 
forward to make sorted dictionary optional. If that's the case, I am not sure 
if we want to stick to a sorted dictionary for now. The current reader does not 
rely on, or take advantage of, the sorting property anyway.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680314#comment-16680314
 ] 

ASF GitHub Bot commented on ORC-420:


xndai commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r231720402
 
 

 ##
 File path: c++/include/orc/Writer.hh
 ##
 @@ -185,6 +185,12 @@ namespace orc {
  * @return if not set, the default is false
  */
 bool getEnableIndex() const;
+
+/**
+ * Get whether or not to enable dictionary encoding
+ * @return if not set, the default is false
+ */
+bool getEnableDictionary() const;
 
 Review comment:
   Do you want to support per column dictionary encoding? I am fine if it's 
done in a follow up change.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680312#comment-16680312
 ] 

ASF GitHub Bot commented on ORC-420:


xndai commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r231721737
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,121 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+std::vector entriesInInsertionOrder() const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
 
 Review comment:
   std::string can have null terminator. It has constructor that allows you to 
specify a pointer and buffer length.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680310#comment-16680310
 ] 

ASF GitHub Bot commented on ORC-420:


xndai commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r232021548
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,121 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+std::vector entriesInInsertionOrder() const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
+uint64_t totalLength;
+  };
+
+  // insert a new string into dictionary, return its insertion order
+  size_t StringDictionary::insert(const char * str, size_t len) {
+auto ret = dict.insert({DictEntry(str, len), dict.size()});
+if (ret.second) {
+  // make a copy to internal storage
+  data.push_back(std::vector(len));
+  memcpy(data.back().data(), str, len);
 
 Review comment:
   Again I think this is to work around strings that contains null terminators 
'\0'.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-363) Enable zstd decompression in ORC Java reader

2018-11-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680227#comment-16680227
 ] 

ASF GitHub Bot commented on ORC-363:


wgtmac commented on issue #306: ORC-363: Enable zstd for java writer/reader
URL: https://github.com/apache/orc/pull/306#issuecomment-437125329
 
 
   I have successfully built hadoop v2.9.1 with zstd native support in an 
Ubuntu 18.10 docker image and use the java ORC tool to read the ORC file 
compressed in ZSTD. The result is as below:
   
   > root@80d6bc18fa17:~/work/orc/java# java 
-Djava.library.path=$HADOOP_HOME/lib/native -cp 
$ORC_CLASSPATH:$HADOOP_CLASSPATH:$M2_HOME/io/airlift/aircompressor/0.10/aircompressor-0.10.jar
 org.apache.orc.tools.Driver meta /tmp/orc/2.orc
   > SLF4J: Class path contains multiple SLF4J bindings.
   > SLF4J: Found binding in 
[jar:file:/root/work/hadoop/hadoop-dist/target/hadoop-2.9.1/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   > SLF4J: Found binding in 
[jar:file:/root/work/hadoop/hadoop-dist/target/hadoop-2.9.1/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   > SLF4J: Found binding in 
[jar:file:/root/work/hadoop/hadoop-dist/target/hadoop-2.9.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
explanation.
   > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
   > Processing data file /tmp/orc/2.orc [length: 159282853]
   > 2018-11-08 18:34:25,739 INFO  [main] impl.OrcCodecPool 
(OrcCodecPool.java:getCodec(56)) - Got brand-new codec ZSTD
   > Structure for /tmp/orc/2.orc
   > File Version: 0.11 with FUTURE
   > 2018-11-08 18:34:25,852 INFO  [main] impl.ReaderImpl 
(ReaderImpl.java:rows(567)) - Reading ORC rows from /tmp/orc/2.orc with 
{include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: 
true}
   > 2018-11-08 18:34:25,868 INFO  [main] impl.RecordReaderImpl 
(RecordReaderImpl.java:(184)) - Reader schema not provided -- using file 
schema 
struct
   > Rows: 215138
   > Compression: ZSTD
   > Compression size: 1048576
   > Type: 
struct
   > 
   > Stripe Statistics:
   > 2018-11-08 18:34:26,747 INFO  [main] impl.OrcCodecPool 
(OrcCodecPool.java:getCodec(56)) - Got brand-new codec ZSTD
   >   Stripe 1:
   > Column 0: count: 178024 hasNull: false
   > Column 1: count: 178024 hasNull: false
   > Column 2: count: 29945 hasNull: true
   > Column 3: count: 178024 hasNull: false
   > Column 4: count: 178024 hasNull: false
   > Column 5: count: 163643 hasNull: true
   >   Stripe 2:
   >  // remaining part has been omitted
   
   @omalley Can you take a look again? We expect this can be shipped in 1.6 
release. Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enable zstd decompression in ORC Java reader
> 
>
> Key: ORC-363
> URL: https://issues.apache.org/jira/browse/ORC-363
> Project: ORC
>  Issue Type: Bug
>Reporter: Xiening Dai
>Assignee: Xiening Dai
>Priority: Major
>
> Update to aircompress lib 0.11 and enable zstd decompression.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-363) Enable zstd decompression in ORC Java reader

2018-11-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680222#comment-16680222
 ] 

ASF GitHub Bot commented on ORC-363:


wgtmac removed a comment on issue #306: ORC-363: Enable zstd for java 
writer/reader
URL: https://github.com/apache/orc/pull/306#issuecomment-422546740
 
 
   @omalley I have changed this PR per your request. Please take a look again 
when you have time. Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enable zstd decompression in ORC Java reader
> 
>
> Key: ORC-363
> URL: https://issues.apache.org/jira/browse/ORC-363
> Project: ORC
>  Issue Type: Bug
>Reporter: Xiening Dai
>Assignee: Xiening Dai
>Priority: Major
>
> Update to aircompress lib 0.11 and enable zstd decompression.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678850#comment-16678850
 ] 

ASF GitHub Bot commented on ORC-420:


wgtmac commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r228390402
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,121 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+std::vector entriesInInsertionOrder() const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
 
 Review comment:
   yes, I can change std::vector to std::string once your review is done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678836#comment-16678836
 ] 

ASF GitHub Bot commented on ORC-420:


wgtmac commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r231691975
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,121 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+std::vector entriesInInsertionOrder() const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
 
 Review comment:
   std::string does not work here because its constructor will terminate 
copying when '\0' is met. this is caught by my test cases. so I decide to keep 
the old implementation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678847#comment-16678847
 ] 

ASF GitHub Bot commented on ORC-420:


wgtmac commented on issue #326: ORC-420: [C++] Implement string dictionary 
encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#issuecomment-436794286
 
 
   @majetideepak Please review my latest commit. Thanks a lot!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-363) Enable zstd decompression in ORC Java reader

2018-11-06 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677015#comment-16677015
 ] 

ASF GitHub Bot commented on ORC-363:


xndai edited a comment on issue #268: ORC-363 Enable zstd decompression in ORC 
Java reader
URL: https://github.com/apache/orc/pull/268#issuecomment-436329540
 
 
   Close this PR since @wgtmac is working on this through ORC-363.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enable zstd decompression in ORC Java reader
> 
>
> Key: ORC-363
> URL: https://issues.apache.org/jira/browse/ORC-363
> Project: ORC
>  Issue Type: Bug
>Reporter: Xiening Dai
>Assignee: Xiening Dai
>Priority: Major
>
> Update to aircompress lib 0.11 and enable zstd decompression.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-363) Enable zstd decompression in ORC Java reader

2018-11-06 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677014#comment-16677014
 ] 

ASF GitHub Bot commented on ORC-363:


xndai closed pull request #268: ORC-363 Enable zstd decompression in ORC Java 
reader
URL: https://github.com/apache/orc/pull/268
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/java/core/src/java/org/apache/orc/CompressionKind.java 
b/java/core/src/java/org/apache/orc/CompressionKind.java
index 3cffe57ee9..4a1cd5c883 100644
--- a/java/core/src/java/org/apache/orc/CompressionKind.java
+++ b/java/core/src/java/org/apache/orc/CompressionKind.java
@@ -23,5 +23,5 @@
  * can be applied to ORC files.
  */
 public enum CompressionKind {
-  NONE, ZLIB, SNAPPY, LZO, LZ4
+  NONE, ZLIB, SNAPPY, LZO, LZ4, ZSTD
 }
diff --git a/java/core/src/java/org/apache/orc/OrcFile.java 
b/java/core/src/java/org/apache/orc/OrcFile.java
index b07355a970..6edb1e7c8d 100644
--- a/java/core/src/java/org/apache/orc/OrcFile.java
+++ b/java/core/src/java/org/apache/orc/OrcFile.java
@@ -423,6 +423,11 @@ protected WriterOptions(Properties tableProperties, 
Configuration conf) {
   compressValue =
   CompressionKind.valueOf(OrcConf.COMPRESS.getString(tableProperties,
   conf).toUpperCase());
+  // Zstd compression is not supported currently
+  // Will enable this once it's released through aircompressor lib
+  if (compressValue == CompressionKind.ZSTD) {
+  throw new IllegalArgumentException("Zstd compressor is not 
supported.");
+  }
   enforceBufferSize = 
OrcConf.ENFORCE_COMPRESSION_BUFFER_SIZE.getBoolean(tableProperties, conf);
   String versionName = OrcConf.WRITE_FORMAT.getString(tableProperties,
   conf);
@@ -581,6 +586,11 @@ public WriterOptions bloomFilterFpp(double fpp) {
  * Sets the generic compression that is used to compress the data.
  */
 public WriterOptions compress(CompressionKind value) {
+  // Zstd compression is not supported currently
+  // Will enable this once it's released through aircompressor lib
+  if (value == CompressionKind.ZSTD) {
+  throw new IllegalArgumentException("Zstd compressor is not 
supported.");
+  }
   compressValue = value;
   return this;
 }
diff --git a/java/core/src/java/org/apache/orc/impl/ReaderImpl.java 
b/java/core/src/java/org/apache/orc/impl/ReaderImpl.java
index bba580faee..2517daaaba 100644
--- a/java/core/src/java/org/apache/orc/impl/ReaderImpl.java
+++ b/java/core/src/java/org/apache/orc/impl/ReaderImpl.java
@@ -443,6 +443,7 @@ public ReaderImpl(Path path, OrcFile.ReaderOptions options) 
throws IOException {
   case SNAPPY:
   case LZO:
   case LZ4:
+  case ZSTD:
 break;
   default:
 throw new IllegalArgumentException("Unknown compression");
diff --git a/java/core/src/java/org/apache/orc/impl/WriterImpl.java 
b/java/core/src/java/org/apache/orc/impl/WriterImpl.java
index d6239f2f36..f1e530f8e3 100644
--- a/java/core/src/java/org/apache/orc/impl/WriterImpl.java
+++ b/java/core/src/java/org/apache/orc/impl/WriterImpl.java
@@ -31,6 +31,7 @@
 import io.airlift.compress.lz4.Lz4Decompressor;
 import io.airlift.compress.lzo.LzoCompressor;
 import io.airlift.compress.lzo.LzoDecompressor;
+import io.airlift.compress.zstd.ZstdDecompressor;
 import org.apache.orc.ColumnStatistics;
 import org.apache.orc.CompressionCodec;
 import org.apache.orc.CompressionKind;
@@ -241,6 +242,10 @@ public static CompressionCodec createCodec(CompressionKind 
kind) {
   case LZ4:
 return new AircompressorCodec(new Lz4Compressor(),
 new Lz4Decompressor());
+  case ZSTD:
+// Zstd compressor is not availiable currently
+// Will add it back after it's released
+return new AircompressorCodec(null, new ZstdDecompressor());
   default:
 throw new IllegalArgumentException("Unknown compression codec: " +
 kind);
diff --git a/java/core/src/test/org/apache/orc/TestVectorOrcFile.java 
b/java/core/src/test/org/apache/orc/TestVectorOrcFile.java
index 658c1cea71..67d01d141a 100644
--- a/java/core/src/test/org/apache/orc/TestVectorOrcFile.java
+++ b/java/core/src/test/org/apache/orc/TestVectorOrcFile.java
@@ -371,6 +371,49 @@ public void testReadFormat_0_11() throws Exception {
 rows.close();
   }
 
+  @Test
+  public void testReadZstd() throws Exception {
+Path filePath =
+new Path(getFileFromClasspath("orc-file-zstd.orc"));
+Reader reader = OrcFile.createReader(filePath,
+OrcFile.readerOptions(conf).filesystem(fs));
+
+int stripeCount = 0;
+int rowCount = 0;
+long currentOffset = -1;
+

[jira] [Commented] (ORC-363) Enable zstd decompression in ORC Java reader

2018-11-06 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677013#comment-16677013
 ] 

ASF GitHub Bot commented on ORC-363:


xndai commented on issue #268: ORC-363 Enable zstd decompression in ORC Java 
reader
URL: https://github.com/apache/orc/pull/268#issuecomment-436329540
 
 
   Close this PR since @wgtmac is working on this through ORC-136.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enable zstd decompression in ORC Java reader
> 
>
> Key: ORC-363
> URL: https://issues.apache.org/jira/browse/ORC-363
> Project: ORC
>  Issue Type: Bug
>Reporter: Xiening Dai
>Assignee: Xiening Dai
>Priority: Major
>
> Update to aircompress lib 0.11 and enable zstd decompression.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-395) [C++] Support ZSTD compression & decompression

2018-11-06 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676511#comment-16676511
 ] 

ASF GitHub Bot commented on ORC-395:


majetideepak commented on issue #301: ORC-395: Support ZSTD in C++ writer/reader
URL: https://github.com/apache/orc/pull/301#issuecomment-436202457
 
 
   +1 LGTM


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Support ZSTD compression & decompression
> --
>
> Key: ORC-395
> URL: https://issues.apache.org/jira/browse/ORC-395
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-395) [C++] Support ZSTD compression & decompression

2018-11-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675933#comment-16675933
 ] 

ASF GitHub Bot commented on ORC-395:


xndai commented on issue #301: ORC-395: Support ZSTD in C++ writer/reader
URL: https://github.com/apache/orc/pull/301#issuecomment-436089942
 
 
   Hi @majetideepak , do you have any other feedback? If not, I plan to commit 
this change. Thx.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Support ZSTD compression & decompression
> --
>
> Key: ORC-395
> URL: https://issues.apache.org/jira/browse/ORC-395
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-433) Suppress the downloading messages from Maven that make travis CI crash

2018-11-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673819#comment-16673819
 ] 

ASF GitHub Bot commented on ORC-433:


omalley closed pull request #336: ORC-433 Suppress the downloading messages 
from the maven bulid.
URL: https://github.com/apache/orc/pull/336
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/java/CMakeLists.txt b/java/CMakeLists.txt
index e82898cc4d..3cfbe50705 100644
--- a/java/CMakeLists.txt
+++ b/java/CMakeLists.txt
@@ -10,9 +10,14 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+set(NO_DOWNLOAD_MSG
+  --batch-mode
+  
-Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn)
+
 # set the version in the POM file to match the CMake version string
 execute_process(COMMAND mvn versions:set -DnewVersion=${ORC_VERSION}
  -DgenerateBackupPoms=false
+ ${NO_DOWNLOAD_MSG}
 WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR})
 
 set(ORC_JARS
@@ -30,7 +35,8 @@ endif()
 
 add_custom_command(
OUTPUT ${ORC_JARS}
-   COMMAND mvn ${JAVA_PROFILE} -Dbuild.dir=${CMAKE_CURRENT_BINARY_DIR} 
-DskipTests package
+   COMMAND mvn ${NO_DOWNLOAD_MSG} ${JAVA_PROFILE}
+ -Dbuild.dir=${CMAKE_CURRENT_BINARY_DIR} -DskipTests package
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
COMMENT "Build the java directory"
VERBATIM)
@@ -39,7 +45,8 @@ add_custom_target(java_build ALL DEPENDS ${ORC_JARS})
 
 add_test(
   NAME java-test
-  COMMAND mvn -Pcmake -Dbuild.dir=${CMAKE_CURRENT_BINARY_DIR} test
+  COMMAND mvn ${NO_DOWNLOAD_MSG} -Pcmake
+   -Dbuild.dir=${CMAKE_CURRENT_BINARY_DIR} test
   WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR})
 
 install(


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Suppress the downloading messages from Maven that make travis CI crash
> --
>
> Key: ORC-433
> URL: https://issues.apache.org/jira/browse/ORC-433
> Project: ORC
>  Issue Type: Bug
>  Components: build
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> Currently in the travis CI builds, we often have jobs killed by producing too 
> much log. A lot of that is consumed by the messages from Maven about 
> downloading poms and jars.
> We can suppress those log messages from the CMake builds to resolve the 
> problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673737#comment-16673737
 ] 

ASF GitHub Bot commented on ORC-420:


wgtmac commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r230520702
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -850,27 +977,76 @@ namespace orc {
 
 virtual void recordPosition() const override;
 
+virtual void createRowIndexEntry() override;
+
+virtual void writeDictionary() override;
+
+virtual void reset() override;
+
+  private:
+/**
+ * dictionary related functions
+ */
+bool checkDictionaryKeyRatio();
+void createDirectStreams();
+void createDictStreams();
+void deleteDictStreams();
+void fallbackToDirectEncoding();
+
   protected:
-std::unique_ptr lengthEncoder;
-std::unique_ptr dataStream;
 RleVersion rleVersion;
+bool useCompression;
+const StreamsFactory& streamsFactory;
+bool alignedBitPacking;
+
+// direct encoding streams
+std::unique_ptr directLengthEncoder;
+std::unique_ptr directDataStream;
+
+// dictionary encoding streams
+std::unique_ptr dictDataEncoder;
+std::unique_ptr dictLengthEncoder;
+std::unique_ptr dictStream;
+
+/**
+ * dictionary related variables
+ */
+StringDictionary dictionary;
+// whether or not dictionary checking is done
+bool doneDictionaryCheck;
+// whether or not it should be used
+bool useDictionary;
+// keys in the dictionary should not exceed this ratio
+double dictSizeThreshold;
+// record index of insertion order in the dictionary for not-null rows
+std::vector idxInDictBuffer;
 
 Review comment:
   sure, I will try to move these variables as many as I can to 
StringDictionary.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-11-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673736#comment-16673736
 ] 

ASF GitHub Bot commented on ORC-420:


wgtmac commented on a change in pull request #326: ORC-420: [C++] Implement 
string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r230520977
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,121 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+std::vector entriesInInsertionOrder() const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
+uint64_t totalLength;
+  };
+
+  // insert a new string into dictionary, return its insertion order
+  size_t StringDictionary::insert(const char * str, size_t len) {
+auto ret = dict.insert({DictEntry(str, len), dict.size()});
+if (ret.second) {
+  // make a copy to internal storage
+  data.push_back(std::vector(len));
+  memcpy(data.back().data(), str, len);
+  // update dictionary entry to link pointer to internal storage
+  DictEntry * entry = const_cast(&(ret.first->first));
+  entry->data = data.back().data();
+  totalLength += len;
+}
+return ret.first->second;
+  }
+
+  // write dictionary data & length to output buffer
+  void StringDictionary::flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const {
+for (auto it = dict.cbegin(); it != dict.cend(); ++it) {
+  dataStream->write(it->first.data, it->first.length);
+  lengthEncoder->write(static_cast(it->first.length));
+}
+  }
+
+  // reorder input index buffer from insertion order to dictionary order
+  void StringDictionary::reorder(std::vector& idxBuffer) const {
 
 Review comment:
   Yes. I will add more comments here. Thanks for your review!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-432) openjdk 8 has a bug that prevents surefire from working

2018-11-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673734#comment-16673734
 ] 

ASF GitHub Bot commented on ORC-432:


omalley closed pull request #335: ORC-432. Work around openjdk8 bug that causes 
failures in surefire plugin
URL: https://github.com/apache/orc/pull/335
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/docker/run-all.sh b/docker/run-all.sh
index c781e9bcfb..2dee97cf3a 100755
--- a/docker/run-all.sh
+++ b/docker/run-all.sh
@@ -39,10 +39,6 @@ for os in `cat os-list.txt`; do
   centos7|debian8|ubuntu14)
  OPTS="-DSNAPPY_HOME=/usr/local"
  ;;
-  debian9)
- # there is a bug in debian 9 that causes surefire to fail
- OPTS="-DBUILD_JAVA=OFF"
- ;;
   *)
  OPTS=""
  ;;
diff --git a/java/pom.xml b/java/pom.xml
index 204bc79a25..0f930b1039 100644
--- a/java/pom.xml
+++ b/java/pom.xml
@@ -156,6 +156,7 @@
 US/Pacific
 en_US.UTF-8
   
+  false
   false
   
 ${test.tmp.dir}


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> openjdk 8 has a bug that prevents surefire from working
> ---
>
> Key: ORC-432
> URL: https://issues.apache.org/jira/browse/ORC-432
> Project: ORC
>  Issue Type: Bug
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> It looks like the problem is 
> https://bugs.openjdk.java.net/browse/JDK-8030046. It looks like:
> {code:bash}
> [ERROR] Caused by: 
> org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?
> [ERROR] Command was /bin/sh -c cd /root/orc/java/shims && 
> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2048m -jar 
> /root/orc/java/shims/target/surefire/surefirebooter4015168140687556977.jar 
> /root/orc/java/shims/target/surefire 2018-11-02T12-32-24_319-jvmRun1 
> surefire8218006047690391850tmp surefire_0416529152079754tmp
> {code}
> The surefire-reports/*.dumpstream looks like:
> {code:bash}
> Error: Could not find or load main class 
> org.apache.maven.surefire.booter.ForkedBooter
> {code}
>  and we can work around the problem by changing the surefire configuration:
> {code:bash}
> +  false
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-433) Suppress the downloading messages from Maven that make travis CI crash

2018-11-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673731#comment-16673731
 ] 

ASF GitHub Bot commented on ORC-433:


omalley opened a new pull request #336: ORC-433 Suppress the downloading 
messages from the maven bulid.
URL: https://github.com/apache/orc/pull/336
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Suppress the downloading messages from Maven that make travis CI crash
> --
>
> Key: ORC-433
> URL: https://issues.apache.org/jira/browse/ORC-433
> Project: ORC
>  Issue Type: Bug
>  Components: build
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> Currently in the travis CI builds, we often have jobs killed by producing too 
> much log. A lot of that is consumed by the messages from Maven about 
> downloading poms and jars.
> We can suppress those log messages from the CMake builds to resolve the 
> problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-432) openjdk 8 has a bug that prevents surefire from working

2018-11-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673599#comment-16673599
 ] 

ASF GitHub Bot commented on ORC-432:


omalley opened a new pull request #335: ORC-432. Work around openjdk8 bug that 
causes failures in surefire plugin
URL: https://github.com/apache/orc/pull/335
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> openjdk 8 has a bug that prevents surefire from working
> ---
>
> Key: ORC-432
> URL: https://issues.apache.org/jira/browse/ORC-432
> Project: ORC
>  Issue Type: Bug
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> It looks like the problem is 
> https://bugs.openjdk.java.net/browse/JDK-8030046. It looks like:
> {code:bash}
> [ERROR] Caused by: 
> org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM 
> terminated without properly saying goodbye. VM crash or System.exit called?
> [ERROR] Command was /bin/sh -c cd /root/orc/java/shims && 
> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2048m -jar 
> /root/orc/java/shims/target/surefire/surefirebooter4015168140687556977.jar 
> /root/orc/java/shims/target/surefire 2018-11-02T12-32-24_319-jvmRun1 
> surefire8218006047690391850tmp surefire_0416529152079754tmp
> {code}
> The surefire-reports/*.dumpstream looks like:
> {code:bash}
> Error: Could not find or load main class 
> org.apache.maven.surefire.booter.ForkedBooter
> {code}
>  and we can work around the problem by changing the surefire configuration:
> {code:bash}
> +  false
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-429) Refactor code in TypeImpl.cc

2018-11-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673570#comment-16673570
 ] 

ASF GitHub Bot commented on ORC-429:


fangzheng commented on issue #333: ORC-429: [C++] Refactor code in TypeImpl.cc
URL: https://github.com/apache/orc/pull/333#issuecomment-435480711
 
 
   Thanks, Xiening!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Refactor code in TypeImpl.cc
> 
>
> Key: ORC-429
> URL: https://issues.apache.org/jira/browse/ORC-429
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> Propose to make two changes to the code in TypeImpl.cc
>  
> 1. In convertType() function: in the case of proto::Type_Kind_STRUCT, two 
> vectors are created but never used. They shall be removed.
> 2. In TypeImpl::parseType() function: the function calls input.substr() to 
> copy the substring before parsing it. This string copy can be avoided by 
> directly parsing on the input string. Please see pull request for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-431) Fix typo in exception message and simplify code logic

2018-11-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673566#comment-16673566
 ] 

ASF GitHub Bot commented on ORC-431:


xndai closed pull request #334: ORC-431: [C++] Fix typo in exception message 
and simplify code logic.
URL: https://github.com/apache/orc/pull/334
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/c++/src/Writer.cc b/c++/src/Writer.cc
index 621876927b..9a7a777cc2 100644
--- a/c++/src/Writer.cc
+++ b/c++/src/Writer.cc
@@ -135,7 +135,7 @@ namespace orc {
   privateBits->fileVersion = version;
   return *this;
 }
-throw std::logic_error("Unpoorted file version specified.");
+throw std::logic_error("Unsupported file version specified.");
   }
 
   FileVersion WriterOptions::getFileVersion() const {
@@ -571,7 +571,8 @@ namespace orc {
 *footer.add_types() = protoType;
 
 for (uint64_t i = 0; i < t.getSubtypeCount(); ++i) {
-  if (t.getKind() != LIST && t.getKind() != MAP && t.getKind() != UNION) {
+  // only add subtypes' field names if this type is STRUCT
+  if (t.getKind() == STRUCT) {
 footer.mutable_types(pos)->add_fieldnames(t.getFieldName(i));
   }
   footer.mutable_types(pos)->add_subtypes(++index);
diff --git a/c++/src/io/OutputStream.cc b/c++/src/io/OutputStream.cc
index be8ea4f85c..fd71c4b86f 100644
--- a/c++/src/io/OutputStream.cc
+++ b/c++/src/io/OutputStream.cc
@@ -51,7 +51,7 @@ namespace orc {
   newCapacity += dataBuffer->capacity();
 }
 dataBuffer->reserve(newCapacity);
-dataBuffer->resize(dataBuffer->size() + blockSize);
+dataBuffer->resize(newSize);
 *buffer = dataBuffer->data() + oldSize;
 return true;
   }
@@ -92,7 +92,7 @@ namespace orc {
 
   uint64_t BufferedOutputStream::flush() {
 uint64_t dataSize = dataBuffer->size();
-outputStream->write(dataBuffer->data(), dataBuffer->size());
+outputStream->write(dataBuffer->data(), dataSize);
 dataBuffer->resize(0);
 return dataSize;
   }


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix typo in exception message and simplify code logic
> -
>
> Key: ORC-431
> URL: https://issues.apache.org/jira/browse/ORC-431
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> 1. Fix typo in the exception message in WriterOptions::setFileVersion(): 
> "Unpoorted" should be "Unsupported".
> 2. Simplify some code in Writer.cc and OutputStream.cc to avoid 
> re-computation.
> 3. Simplify a condition check in WriterImpl::buildFooterType().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-429) Refactor code in TypeImpl.cc

2018-11-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673552#comment-16673552
 ] 

ASF GitHub Bot commented on ORC-429:


xndai closed pull request #333: ORC-429: [C++] Refactor code in TypeImpl.cc
URL: https://github.com/apache/orc/pull/333
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/c++/src/TypeImpl.cc b/c++/src/TypeImpl.cc
index 209ffcdfb9..d1d9c6203f 100644
--- a/c++/src/TypeImpl.cc
+++ b/c++/src/TypeImpl.cc
@@ -403,9 +403,6 @@ namespace orc {
 case proto::Type_Kind_STRUCT: {
   TypeImpl* result = new TypeImpl(STRUCT);
   std::unique_ptr return_value = std::unique_ptr(result);
-  uint64_t size = static_cast(type.subtypes_size());
-  std::vector typeList(size);
-  std::vector fieldList(size);
   for(int i=0; i < type.subtypes_size(); ++i) {
 result->addStructField(type.fieldnames(i),
convertType(footer.types(static_cast
@@ -644,32 +641,31 @@ namespace orc {
const std::string 
,
size_t start,
size_t end) {
-std::string types = input.substr(start, end - start);
 std::vector > > res;
-size_t pos = 0;
+size_t pos = start;
 
-while (pos < types.size()) {
+while (pos < end) {
   size_t endPos = pos;
-  while (endPos < types.size() && (isalnum(types[endPos]) || types[endPos] 
== '_')) {
+  while (endPos < end && (isalnum(input[endPos]) || input[endPos] == '_')) 
{
 ++endPos;
   }
 
   std::string fieldName;
-  if (types[endPos] == ':') {
-fieldName = types.substr(pos, endPos - pos);
+  if (input[endPos] == ':') {
+fieldName = input.substr(pos, endPos - pos);
 pos = ++endPos;
-while (endPos < types.size() && isalpha(types[endPos])) {
+while (endPos < end && isalpha(input[endPos])) {
   ++endPos;
 }
   }
 
   size_t nextPos = endPos + 1;
-  if (types[endPos] == '<') {
+  if (input[endPos] == '<') {
 int count = 1;
-while (nextPos < types.size()) {
-  if (types[nextPos] == '<') {
+while (nextPos < end) {
+  if (input[nextPos] == '<') {
 ++count;
-  } else if (types[nextPos] == '>') {
+  } else if (input[nextPos] == '>') {
 --count;
   }
   if (count == 0) {
@@ -677,24 +673,24 @@ namespace orc {
   }
   ++nextPos;
 }
-if (nextPos == types.size()) {
+if (nextPos == end) {
   throw std::logic_error("Invalid type string. Cannot find closing >");
 }
-  } else if (types[endPos] == '(') {
-while (nextPos < types.size() && types[nextPos] != ')') {
+  } else if (input[endPos] == '(') {
+while (nextPos < end && input[nextPos] != ')') {
   ++nextPos;
 }
-if (nextPos == types.size()) {
+if (nextPos == end) {
   throw std::logic_error("Invalid type string. Cannot find closing )");
 }
-  } else if (types[endPos] != ',' && types[endPos] != '\0') {
+  } else if (input[endPos] != ',' && endPos != end) {
 throw std::logic_error("Unrecognized character.");
   }
 
-  std::string category = types.substr(pos, endPos - pos);
-  res.push_back(std::make_pair(fieldName, parseCategory(category, types, 
endPos + 1, nextPos)));
+  std::string category = input.substr(pos, endPos - pos);
+  res.push_back(std::make_pair(fieldName, parseCategory(category, input, 
endPos + 1, nextPos)));
 
-  if (nextPos < types.size() && (types[nextPos] == ')' || types[nextPos] 
== '>')) {
+  if (nextPos < end && (input[nextPos] == ')' || input[nextPos] == '>')) {
 pos = nextPos + 2;
   } else {
 pos = nextPos;


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Refactor code in TypeImpl.cc
> 
>
> Key: ORC-429
> URL: https://issues.apache.org/jira/browse/ORC-429
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> Propose to make two changes to the code in TypeImpl.cc
>  
> 1. In convertType() function: in the case of proto::Type_Kind_STRUCT, 

[jira] [Commented] (ORC-431) Fix typo in exception message and simplify code logic

2018-11-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672344#comment-16672344
 ] 

ASF GitHub Bot commented on ORC-431:


fangzheng opened a new pull request #334: ORC-431: [C++] Fix typo in exception 
message and simplify code logic.
URL: https://github.com/apache/orc/pull/334
 
 
   1. Fix typo in the exception message in WriterOptions::setFileVersion(): 
"Unpoorted" should be "Unsupported".
   2. Simplify some code in Writer.cc and OutputStream.cc to avoid 
re-computation.
   3. Simplify a condition check in WriterImpl::buildFooterType().


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix typo in exception message and simplify code logic
> -
>
> Key: ORC-431
> URL: https://issues.apache.org/jira/browse/ORC-431
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> 1. Fix typo in the exception message in WriterOptions::setFileVersion(): 
> "Unpoorted" should be "Unsupported".
> 2. Simplify some code in Writer.cc and OutputStream.cc to avoid 
> re-computation.
> 3. Simplify a condition check in WriterImpl::buildFooterType().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-429) Refactor code in TypeImpl.cc

2018-11-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672314#comment-16672314
 ] 

ASF GitHub Bot commented on ORC-429:


xndai commented on issue #333: ORC-429: [C++] Refactor code in TypeImpl.cc
URL: https://github.com/apache/orc/pull/333#issuecomment-435214912
 
 
   Change looks good to me. I will commit this for you if I don't see any other 
feedback.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Refactor code in TypeImpl.cc
> 
>
> Key: ORC-429
> URL: https://issues.apache.org/jira/browse/ORC-429
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> Propose to make two changes to the code in TypeImpl.cc
>  
> 1. In convertType() function: in the case of proto::Type_Kind_STRUCT, two 
> vectors are created but never used. They shall be removed.
> 2. In TypeImpl::parseType() function: the function calls input.substr() to 
> copy the substring before parsing it. This string copy can be avoided by 
> directly parsing on the input string. Please see pull request for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-418) Fix broken centos6 compilation

2018-11-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671826#comment-16671826
 ] 

ASF GitHub Bot commented on ORC-418:


omalley commented on a change in pull request #328: ORC-418. Fix broken docker 
builds.
URL: https://github.com/apache/orc/pull/328#discussion_r230104881
 
 

 ##
 File path: docker/centos6/Dockerfile
 ##
 @@ -42,6 +43,22 @@ WORKDIR /root
 RUN wget 
"https://www.apache.org/dyn/closer.lua?action=download=/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz;
 -O maven.tgz
 RUN tar xzf maven.tgz
 RUN ln -s /root/apache-maven-3.3.9/bin/mvn /usr/bin/mvn
+# install a local build of protobuf
+RUN wget "https://github.com/protocolbuffers/protobuf/archive/v2.5.0.tar.gz; \
 
 Review comment:
   So we could use sed or something to do it, but I think it is less confusing 
to hard code it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix broken centos6 compilation
> --
>
> Key: ORC-418
> URL: https://issues.apache.org/jira/browse/ORC-418
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.6.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> We've accidentally broken the centos 6 C++ build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-418) Fix broken centos6 compilation

2018-11-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671822#comment-16671822
 ] 

ASF GitHub Bot commented on ORC-418:


omalley commented on a change in pull request #328: ORC-418. Fix broken docker 
builds.
URL: https://github.com/apache/orc/pull/328#discussion_r230104600
 
 

 ##
 File path: docker/centos6/Dockerfile
 ##
 @@ -42,6 +43,22 @@ WORKDIR /root
 RUN wget 
"https://www.apache.org/dyn/closer.lua?action=download=/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz;
 -O maven.tgz
 RUN tar xzf maven.tgz
 RUN ln -s /root/apache-maven-3.3.9/bin/mvn /usr/bin/mvn
+# install a local build of protobuf
+RUN wget "https://github.com/protocolbuffers/protobuf/archive/v2.5.0.tar.gz; \
 
 Review comment:
   Dockerfiles by design aren't parameterized. They are designed to be concrete 
steps to build the image. On the plus side, the current linux distros don't 
need the legacy versions of protobuf or snappy.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix broken centos6 compilation
> --
>
> Key: ORC-418
> URL: https://issues.apache.org/jira/browse/ORC-418
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.6.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> We've accidentally broken the centos 6 C++ build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-418) Fix broken centos6 compilation

2018-10-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671040#comment-16671040
 ] 

ASF GitHub Bot commented on ORC-418:


majetideepak commented on issue #328: ORC-418. Fix broken docker builds.
URL: https://github.com/apache/orc/pull/328#issuecomment-434911104
 
 
   Rest of the changes look straightforward to me.
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix broken centos6 compilation
> --
>
> Key: ORC-418
> URL: https://issues.apache.org/jira/browse/ORC-418
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.6.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> We've accidentally broken the centos 6 C++ build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-418) Fix broken centos6 compilation

2018-10-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671039#comment-16671039
 ] 

ASF GitHub Bot commented on ORC-418:


majetideepak commented on a change in pull request #328: ORC-418. Fix broken 
docker builds.
URL: https://github.com/apache/orc/pull/328#discussion_r229923490
 
 

 ##
 File path: docker/centos6/Dockerfile
 ##
 @@ -42,6 +43,22 @@ WORKDIR /root
 RUN wget 
"https://www.apache.org/dyn/closer.lua?action=download=/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz;
 -O maven.tgz
 RUN tar xzf maven.tgz
 RUN ln -s /root/apache-maven-3.3.9/bin/mvn /usr/bin/mvn
+# install a local build of protobuf
+RUN wget "https://github.com/protocolbuffers/protobuf/archive/v2.5.0.tar.gz; \
 
 Review comment:
   Does it make sense to make these version numbers as parameters?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix broken centos6 compilation
> --
>
> Key: ORC-418
> URL: https://issues.apache.org/jira/browse/ORC-418
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.6.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> We've accidentally broken the centos 6 C++ build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-429) Refactor code in TypeImpl.cc

2018-10-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670876#comment-16670876
 ] 

ASF GitHub Bot commented on ORC-429:


fangzheng opened a new pull request #333: ORC-429: [C++] Refactor code in 
TypeImpl.cc
URL: https://github.com/apache/orc/pull/333
 
 
   1. In convertType() function: in the case of proto::Type_Kind_STRUCT, two 
vectors are created but never used. They shall be removed.
   
   2. In TypeImpl::parseType() function: the function calls input.substr() to 
copy the substring before parsing it. This string copy can be avoided by 
directly parsing on the input string. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Refactor code in TypeImpl.cc
> 
>
> Key: ORC-429
> URL: https://issues.apache.org/jira/browse/ORC-429
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> Propose to make two changes to the code in TypeImpl.cc
>  
> 1. In convertType() function: in the case of proto::Type_Kind_STRUCT, two 
> vectors are created but never used. They shall be removed.
> 2. In TypeImpl::parseType() function: the function calls input.substr() to 
> copy the substring before parsing it. This string copy can be avoided by 
> directly parsing on the input string. Please see pull request for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-428) Use ORC_UNIQUE_PTR consistently in OrcFile, OrcHdfsFile, and Writer

2018-10-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670838#comment-16670838
 ] 

ASF GitHub Bot commented on ORC-428:


fangzheng closed pull request #331: ORC-428: Use ORC_UNIQUE_PTR consistently in 
OrcFile, OrcHdfsFile and …
URL: https://github.com/apache/orc/pull/331
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/c++/src/OrcFile.cc b/c++/src/OrcFile.cc
index 009a8ce971..96640159a6 100644
--- a/c++/src/OrcFile.cc
+++ b/c++/src/OrcFile.cc
@@ -104,8 +104,8 @@ namespace orc {
 #endif
   }
 
-  std::unique_ptr readLocalFile(const std::string& path) {
-  return std::unique_ptr(new FileInputStream(path));
+  ORC_UNIQUE_PTR readLocalFile(const std::string& path) {
+  return ORC_UNIQUE_PTR(new FileInputStream(path));
   }
 
   OutputStream::~OutputStream() {
@@ -176,7 +176,7 @@ namespace orc {
 }
   }
 
-  std::unique_ptr writeLocalFile(const std::string& path) {
-return std::unique_ptr(new FileOutputStream(path));
+  ORC_UNIQUE_PTR writeLocalFile(const std::string& path) {
+return ORC_UNIQUE_PTR(new FileOutputStream(path));
   }
 }
diff --git a/c++/src/OrcHdfsFile.cc b/c++/src/OrcHdfsFile.cc
index 58027c7a4b..ba093460f0 100644
--- a/c++/src/OrcHdfsFile.cc
+++ b/c++/src/OrcHdfsFile.cc
@@ -167,7 +167,7 @@ namespace orc {
   HdfsFileInputStream::~HdfsFileInputStream() {
   }
 
-  std::unique_ptr readHdfsFile(const std::string& path) {
-return std::unique_ptr(new HdfsFileInputStream(path));
+  ORC_UNIQUE_PTR readHdfsFile(const std::string& path) {
+return ORC_UNIQUE_PTR(new HdfsFileInputStream(path));
   }
 }
diff --git a/c++/src/Writer.cc b/c++/src/Writer.cc
index 621876927b..7b9af6f28e 100644
--- a/c++/src/Writer.cc
+++ b/c++/src/Writer.cc
@@ -584,11 +584,11 @@ namespace orc {
 return static_cast(kind);
   }
 
-  std::unique_ptr createWriter(
+ORC_UNIQUE_PTR createWriter(
const Type& type,
OutputStream* stream,
const WriterOptions& options) {
-return std::unique_ptr(
+return ORC_UNIQUE_PTR(
new WriterImpl(
 type,
 stream,


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use ORC_UNIQUE_PTR consistently in OrcFile, OrcHdfsFile, and Writer
> ---
>
> Key: ORC-428
> URL: https://issues.apache.org/jira/browse/ORC-428
> Project: ORC
>  Issue Type: Bug
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> In OrcFile.hh, the declarations of  readLocalFile() and other four functions 
> return  ORC_UNIQUE_PTR:
> ORC_UNIQUE_PTR readLocalFile(const std::string& path);
> ORC_UNIQUE_PTR readHdfsFile(const std::string& path);
> ORC_UNIQUE_PTR createReader(ORC_UNIQUE_PTR stream,
>  const ReaderOptions& options);
> ORC_UNIQUE_PTR createWriter(const Type& type, OutputStream* stream, 
> const WriterOptions& options);
> However, these functions' definitions all return std::unique_ptr. On a system 
> where ORC_UNIQUE_PTR is not defined as std::unique_ptr but std::auto_ptr, 
> there is inconsistency.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-428) Use ORC_UNIQUE_PTR consistently in OrcFile, OrcHdfsFile, and Writer

2018-10-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670837#comment-16670837
 ] 

ASF GitHub Bot commented on ORC-428:


fangzheng commented on issue #331: ORC-428: Use ORC_UNIQUE_PTR consistently in 
OrcFile, OrcHdfsFile and …
URL: https://github.com/apache/orc/pull/331#issuecomment-434867689
 
 
   I am cancelling this PR, as Gang and Owen explained how the macro and 
unique_ptr is handled in the dev mailing list 
(https://mail-archives.apache.org/mod_mbox/orc-dev/201810.mbox/%3cCAHfHakEanDzf=03fcpyxnqfanebr9xqz-5zfwod9vyk_vvb...@mail.gmail.com%3e).
 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use ORC_UNIQUE_PTR consistently in OrcFile, OrcHdfsFile, and Writer
> ---
>
> Key: ORC-428
> URL: https://issues.apache.org/jira/browse/ORC-428
> Project: ORC
>  Issue Type: Bug
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> In OrcFile.hh, the declarations of  readLocalFile() and other four functions 
> return  ORC_UNIQUE_PTR:
> ORC_UNIQUE_PTR readLocalFile(const std::string& path);
> ORC_UNIQUE_PTR readHdfsFile(const std::string& path);
> ORC_UNIQUE_PTR createReader(ORC_UNIQUE_PTR stream,
>  const ReaderOptions& options);
> ORC_UNIQUE_PTR createWriter(const Type& type, OutputStream* stream, 
> const WriterOptions& options);
> However, these functions' definitions all return std::unique_ptr. On a system 
> where ORC_UNIQUE_PTR is not defined as std::unique_ptr but std::auto_ptr, 
> there is inconsistency.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-423) Add docker scripts and update website for debian 9 and ubuntu 18

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669420#comment-16669420
 ] 

ASF GitHub Bot commented on ORC-423:


omalley opened a new pull request #332: ORC-423 Add debian 9 and ubuntu 18 
docker files
URL: https://github.com/apache/orc/pull/332
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add docker scripts and update website for debian 9 and ubuntu 18
> 
>
> Key: ORC-423
> URL: https://issues.apache.org/jira/browse/ORC-423
> Project: ORC
>  Issue Type: Bug
>  Components: build, site
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> We should add docker build scripts for the new versions of debian and ubuntu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-428) Use ORC_UNIQUE_PTR consistently in OrcFile, OrcHdfsFile, and Writer

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669408#comment-16669408
 ] 

ASF GitHub Bot commented on ORC-428:


fangzheng opened a new pull request #331: ORC-428: Use ORC_UNIQUE_PTR 
consistently in OrcFile, OrcHdfsFile and …
URL: https://github.com/apache/orc/pull/331
 
 
   …createWriter().


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use ORC_UNIQUE_PTR consistently in OrcFile, OrcHdfsFile, and Writer
> ---
>
> Key: ORC-428
> URL: https://issues.apache.org/jira/browse/ORC-428
> Project: ORC
>  Issue Type: Bug
>  Components: C++
>Reporter: Fang Zheng
>Priority: Minor
>
> In OrcFile.hh, the declarations of  readLocalFile() and other four functions 
> return  ORC_UNIQUE_PTR:
> ORC_UNIQUE_PTR readLocalFile(const std::string& path);
> ORC_UNIQUE_PTR readHdfsFile(const std::string& path);
> ORC_UNIQUE_PTR createReader(ORC_UNIQUE_PTR stream,
>  const ReaderOptions& options);
> ORC_UNIQUE_PTR createWriter(const Type& type, OutputStream* stream, 
> const WriterOptions& options);
> However, these functions' definitions all return std::unique_ptr. On a system 
> where ORC_UNIQUE_PTR is not defined as std::unique_ptr but std::auto_ptr, 
> there is inconsistency.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-427) Fix errors in ORC C++ public API Doxygen documentation

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669377#comment-16669377
 ] 

ASF GitHub Bot commented on ORC-427:


omalley closed pull request #330: ORC-427: Fix errors in ORC C++ public API 
Doxygen documentation.
URL: https://github.com/apache/orc/pull/330
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/c++/include/orc/OrcFile.hh b/c++/include/orc/OrcFile.hh
index bd866c28b8..c64853168a 100644
--- a/c++/include/orc/OrcFile.hh
+++ b/c++/include/orc/OrcFile.hh
@@ -121,7 +121,7 @@ namespace orc {
   ORC_UNIQUE_PTR readHdfsFile(const std::string& path);
 
   /**
-   * Create a reader to the for the ORC file.
+   * Create a reader to read the ORC file.
* @param stream the stream to read
* @param options the options for reading the file
*/
diff --git a/c++/include/orc/Reader.hh b/c++/include/orc/Reader.hh
index b5e2e7ee32..5818b46a86 100644
--- a/c++/include/orc/Reader.hh
+++ b/c++/include/orc/Reader.hh
@@ -55,7 +55,7 @@ namespace orc {
 ReaderOptions& setErrorStream(std::ostream& stream);
 
 /**
- * Open the file used a serialized copy of the file tail.
+ * Set a serialized copy of the file tail to be used when opening the file.
  *
  * When one process opens the file and other processes need to read
  * the rows, we want to enable clients to just read the tail once.
@@ -236,7 +236,7 @@ namespace orc {
 
   /**
* The interface for reading ORC file meta-data and constructing RowReaders.
-   * This is an an abstract class that will subclassed as necessary.
+   * This is an an abstract class that will be subclassed as necessary.
*/
   class Reader {
   public:
@@ -257,7 +257,7 @@ namespace orc {
 
 /**
  * Get the user metadata keys.
- * @return the set of metadata keys
+ * @return the set of user metadata keys
  */
 virtual std::list getMetadataKeys() const = 0;
 
@@ -306,7 +306,7 @@ namespace orc {
 virtual WriterVersion getWriterVersion() const = 0;
 
 /**
- * Get the number of rows per a entry in the row index.
+ * Get the number of rows per an entry in the row index.
  * @return the number of rows per an entry in the row index or 0 if there
  * is no row index.
  */
@@ -320,7 +320,7 @@ namespace orc {
 
 /**
  * Get the information about a stripe.
- * @param stripeIndex the stripe 0 to N-1 to get information about
+ * @param stripeIndex the index of the stripe (0 to N-1) to get 
information about
  * @return the information about that stripe
  */
 virtual ORC_UNIQUE_PTR
@@ -334,7 +334,7 @@ namespace orc {
 
 /**
  * Get the statistics about a stripe.
- * @param stripeIndex the stripe 0 to N-1 to get statistics about
+ * @param stripeIndex the index of the stripe (0 to N-1) to get statistics 
about
  * @return the statistics about that stripe
  */
 virtual ORC_UNIQUE_PTR
@@ -347,19 +347,19 @@ namespace orc {
 virtual uint64_t getContentLength() const = 0;
 
 /**
- * Get the length of the file stripe statistics
+ * Get the length of the file stripe statistics.
  * @return the number of compressed bytes in the file stripe statistics
  */
 virtual uint64_t getStripeStatisticsLength() const = 0;
 
 /**
- * Get the length of the file footer
+ * Get the length of the file footer.
  * @return the number of compressed bytes in the file footer
  */
 virtual uint64_t getFileFooterLength() const = 0;
 
 /**
- * Get the length of the file postscript
+ * Get the length of the file postscript.
  * @return the number of bytes in the file postscript
  */
 virtual uint64_t getFilePostscriptLength() const = 0;
@@ -378,13 +378,14 @@ namespace orc {
 
 /**
  * Get the statistics about a single column in the file.
+ * @param columnId id of the column
  * @return the information about the column
  */
 virtual ORC_UNIQUE_PTR
 getColumnStatistics(uint32_t columnId) const = 0;
 
 /**
- * check file has correct column statistics
+ * Check if the file has correct column statistics.
  */
 virtual bool hasCorrectStatistics() const = 0;
 
@@ -443,17 +444,17 @@ namespace orc {
 virtual uint64_t getMemoryUseByFieldId(const std::list& include, 
int stripeIx=-1) = 0;
 
 /**
+ * @param names Column Names
  * @param stripeIx index of the stripe to be read (if not specified,
  *all stripes are considered).
- * @param names Column Names
  * @return upper bound on memory use by selected columns
  */
 virtual uint64_t getMemoryUseByName(const std::list& names, 
int stripeIx=-1) 

[jira] [Commented] (ORC-426) Errors in ORC Specification

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669373#comment-16669373
 ] 

ASF GitHub Bot commented on ORC-426:


omalley closed pull request #329: ORC-426: Fix errors in ORC specification.
URL: https://github.com/apache/orc/pull/329
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/site/specification/ORCv0.md b/site/specification/ORCv0.md
index 32ce14a151..b4fea4e81b 100644
--- a/site/specification/ORCv0.md
+++ b/site/specification/ORCv0.md
@@ -725,7 +725,7 @@ DIRECT| PRESENT | Yes  | Boolean RLE
 ## Map Columns
 
 Maps are encoded as the PRESENT stream and a length stream with number
-of items in each list. They have a child column for the key and
+of items in each map. They have a child column for the key and
 another child column for the value.
 
 Encoding  | Stream Kind | Optional | Contents
diff --git a/site/specification/ORCv1.md b/site/specification/ORCv1.md
index fb90c8353c..5dbd3d027f 100644
--- a/site/specification/ORCv1.md
+++ b/site/specification/ORCv1.md
@@ -581,8 +581,6 @@ the index values and the additional value bits.
   bit is set, the entire value is negated.
 * Data values (W * L bits padded to the byte) - A sequence of W bit positive
   values that are added to the base value.
-* Data values (W * L bits padded to the byte) - A sequence of W bit positive
-  values that are added to the base value.
 * Patch list (PLL * (PGW + PW) bytes) - A list of patches for values
   that didn't fit within W bits. Each entry in the list consists of a
   gap, which is the number of elements skipped from the previous
@@ -899,7 +897,7 @@ DIRECT_V2 | PRESENT | Yes  | Boolean RLE
 ## Map Columns
 
 Maps are encoded as the PRESENT stream and a length stream with number
-of items in each list. They have a child column for the key and
+of items in each map. They have a child column for the key and
 another child column for the value.
 
 Encoding  | Stream Kind | Optional | Contents
@@ -978,7 +976,7 @@ group (default to 10,000 rows) in a column. Only the row 
groups that
 satisfy min/max row index evaluation will be evaluated against the
 bloom filter index.
 
-Each BloomFilterEntry stores the number of hash functions ('k') used
+Each bloom filter entry stores the number of hash functions ('k') used
 and the bitset backing the bloom filter. The original encoding (pre
 ORC-101) of bloom filters used the bitset field encoded as a repeating
 sequence of longs in the bitset field with a little endian encoding
diff --git a/site/specification/ORCv2.md b/site/specification/ORCv2.md
index 76ee571f0e..d91139c0fe 100644
--- a/site/specification/ORCv2.md
+++ b/site/specification/ORCv2.md
@@ -601,8 +601,6 @@ the index values and the additional value bits.
   bit is set, the entire value is negated.
 * Data values (W * L bits padded to the byte) - A sequence of W bit positive
   values that are added to the base value.
-* Data values (W * L bits padded to the byte) - A sequence of W bit positive
-  values that are added to the base value.
 * Patch list (PLL * (PGW + PW) bytes) - A list of patches for values
   that didn't fit within W bits. Each entry in the list consists of a
   gap, which is the number of elements skipped from the previous
@@ -916,7 +914,7 @@ DIRECT_V2 | PRESENT | Yes  | Boolean RLE
 ## Map Columns
 
 Maps are encoded as the PRESENT stream and a length stream with number
-of items in each list. They have a child column for the key and
+of items in each map. They have a child column for the key and
 another child column for the value.
 
 Encoding  | Stream Kind | Optional | Contents
@@ -995,7 +993,7 @@ group (default to 10,000 rows) in a column. Only the row 
groups that
 satisfy min/max row index evaluation will be evaluated against the
 bloom filter index.
 
-Each BloomFilterEntry stores the number of hash functions ('k') used
+Each bloom filter entry stores the number of hash functions ('k') used
 and the bitset backing the bloom filter. The original encoding (pre
 ORC-101) of bloom filters used the bitset field encoded as a repeating
 sequence of longs in the bitset field with a little endian encoding


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Errors in ORC Specification
> ---
>
> Key: ORC-426
> URL: https://issues.apache.org/jira/browse/ORC-426

[jira] [Commented] (ORC-427) Fix errors in ORC C++ public API Doxygen documentation

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669338#comment-16669338
 ] 

ASF GitHub Bot commented on ORC-427:


fangzheng opened a new pull request #330: ORC-427: Fix errors in ORC C++ public 
API Doxygen documentation.
URL: https://github.com/apache/orc/pull/330
 
 
   Fix a few typos, grammar and formatting errors in C++ public API's Doxygen 
documentation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix errors in ORC C++ public API Doxygen documentation
> --
>
> Key: ORC-427
> URL: https://issues.apache.org/jira/browse/ORC-427
> Project: ORC
>  Issue Type: Bug
>  Components: C++, documentation
>Reporter: Fang Zheng
>Priority: Minor
>
> There are a few typos, grammar and formatting errors in the C++ public API 
> Doxygen documentation. For example, there is a broken sentence in 
> createReader() function in OrcFile.hh:
> "Create a reader to the for the ORC file."
> Please see the pull request for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-418) Fix broken centos6 compilation

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669332#comment-16669332
 ] 

ASF GitHub Bot commented on ORC-418:


omalley commented on issue #328: ORC-418. Fix broken docker builds.
URL: https://github.com/apache/orc/pull/328#issuecomment-434468936
 
 
   The final patch makes some minor tweaks to the run-all script, so that the 
output is saved to a file rather than stdout. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix broken centos6 compilation
> --
>
> Key: ORC-418
> URL: https://issues.apache.org/jira/browse/ORC-418
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.6.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> We've accidentally broken the centos 6 C++ build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-418) Fix broken centos6 compilation

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668976#comment-16668976
 ] 

ASF GitHub Bot commented on ORC-418:


omalley commented on issue #328: ORC-418. Fix broken docker builds.
URL: https://github.com/apache/orc/pull/328#issuecomment-434368372
 
 
   Ok, that didn't work well either. I'm going to switch it to uint64_t, which 
works in other contexts.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix broken centos6 compilation
> --
>
> Key: ORC-418
> URL: https://issues.apache.org/jira/browse/ORC-418
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.6.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> We've accidentally broken the centos 6 C++ build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668957#comment-16668957
 ] 

ASF GitHub Bot commented on ORC-420:


majetideepak commented on a change in pull request #326: ORC-420: [C++] 
Implement string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r229376031
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -850,27 +977,76 @@ namespace orc {
 
 virtual void recordPosition() const override;
 
+virtual void createRowIndexEntry() override;
+
+virtual void writeDictionary() override;
+
+virtual void reset() override;
+
+  private:
+/**
+ * dictionary related functions
+ */
+bool checkDictionaryKeyRatio();
+void createDirectStreams();
+void createDictStreams();
+void deleteDictStreams();
+void fallbackToDirectEncoding();
+
   protected:
-std::unique_ptr lengthEncoder;
-std::unique_ptr dataStream;
 RleVersion rleVersion;
+bool useCompression;
+const StreamsFactory& streamsFactory;
+bool alignedBitPacking;
+
+// direct encoding streams
+std::unique_ptr directLengthEncoder;
+std::unique_ptr directDataStream;
+
+// dictionary encoding streams
+std::unique_ptr dictDataEncoder;
+std::unique_ptr dictLengthEncoder;
+std::unique_ptr dictStream;
+
+/**
+ * dictionary related variables
+ */
+StringDictionary dictionary;
+// whether or not dictionary checking is done
+bool doneDictionaryCheck;
+// whether or not it should be used
+bool useDictionary;
+// keys in the dictionary should not exceed this ratio
+double dictSizeThreshold;
+// record index of insertion order in the dictionary for not-null rows
+std::vector idxInDictBuffer;
 
 Review comment:
   In my opinion, the StringColumnWriter must not have any state corresponding 
to the dictionary encoding.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668953#comment-16668953
 ] 

ASF GitHub Bot commented on ORC-420:


majetideepak commented on a change in pull request #326: ORC-420: [C++] 
Implement string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r229362176
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,121 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+std::vector entriesInInsertionOrder() const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
+uint64_t totalLength;
+  };
+
+  // insert a new string into dictionary, return its insertion order
+  size_t StringDictionary::insert(const char * str, size_t len) {
+auto ret = dict.insert({DictEntry(str, len), dict.size()});
+if (ret.second) {
+  // make a copy to internal storage
+  data.push_back(std::vector(len));
+  memcpy(data.back().data(), str, len);
+  // update dictionary entry to link pointer to internal storage
+  DictEntry * entry = const_cast(&(ret.first->first));
+  entry->data = data.back().data();
+  totalLength += len;
+}
+return ret.first->second;
+  }
+
+  // write dictionary data & length to output buffer
+  void StringDictionary::flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const {
+for (auto it = dict.cbegin(); it != dict.cend(); ++it) {
+  dataStream->write(it->first.data, it->first.length);
+  lengthEncoder->write(static_cast(it->first.length));
+}
+  }
+
+  // reorder input index buffer from insertion order to dictionary order
+  void StringDictionary::reorder(std::vector& idxBuffer) const {
 
 Review comment:
   It took me a while to understand that we are actually calculating the 
dictionary indexes of the input values. Is this correct? Can we include this in 
the code comments?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668951#comment-16668951
 ] 

ASF GitHub Bot commented on ORC-420:


majetideepak commented on a change in pull request #326: ORC-420: [C++] 
Implement string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r229362176
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,121 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+std::vector entriesInInsertionOrder() const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
+uint64_t totalLength;
+  };
+
+  // insert a new string into dictionary, return its insertion order
+  size_t StringDictionary::insert(const char * str, size_t len) {
+auto ret = dict.insert({DictEntry(str, len), dict.size()});
+if (ret.second) {
+  // make a copy to internal storage
+  data.push_back(std::vector(len));
+  memcpy(data.back().data(), str, len);
+  // update dictionary entry to link pointer to internal storage
+  DictEntry * entry = const_cast(&(ret.first->first));
+  entry->data = data.back().data();
+  totalLength += len;
+}
+return ret.first->second;
+  }
+
+  // write dictionary data & length to output buffer
+  void StringDictionary::flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const {
+for (auto it = dict.cbegin(); it != dict.cend(); ++it) {
+  dataStream->write(it->first.data, it->first.length);
+  lengthEncoder->write(static_cast(it->first.length));
+}
+  }
+
+  // reorder input index buffer from insertion order to dictionary order
+  void StringDictionary::reorder(std::vector& idxBuffer) const {
 
 Review comment:
   It took me a while to understand that we are actually calculating the 
dictionary indexes of the input values. Is this correct?
   Can we improve the comments about this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668942#comment-16668942
 ] 

ASF GitHub Bot commented on ORC-420:


majetideepak commented on a change in pull request #326: ORC-420: [C++] 
Implement string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r229362176
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,121 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+std::vector entriesInInsertionOrder() const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
+uint64_t totalLength;
+  };
+
+  // insert a new string into dictionary, return its insertion order
+  size_t StringDictionary::insert(const char * str, size_t len) {
+auto ret = dict.insert({DictEntry(str, len), dict.size()});
+if (ret.second) {
+  // make a copy to internal storage
+  data.push_back(std::vector(len));
+  memcpy(data.back().data(), str, len);
+  // update dictionary entry to link pointer to internal storage
+  DictEntry * entry = const_cast(&(ret.first->first));
+  entry->data = data.back().data();
+  totalLength += len;
+}
+return ret.first->second;
+  }
+
+  // write dictionary data & length to output buffer
+  void StringDictionary::flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const {
+for (auto it = dict.cbegin(); it != dict.cend(); ++it) {
+  dataStream->write(it->first.data, it->first.length);
+  lengthEncoder->write(static_cast(it->first.length));
+}
+  }
+
+  // reorder input index buffer from insertion order to dictionary order
+  void StringDictionary::reorder(std::vector& idxBuffer) const {
 
 Review comment:
   It took me a while to understand that we are actually calculating the 
dictionary indexes of the input values. Is this correct?
   Can we name this `getIndices`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668947#comment-16668947
 ] 

ASF GitHub Bot commented on ORC-420:


majetideepak commented on a change in pull request #326: ORC-420: [C++] 
Implement string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r229362176
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,121 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+std::vector entriesInInsertionOrder() const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
+uint64_t totalLength;
+  };
+
+  // insert a new string into dictionary, return its insertion order
+  size_t StringDictionary::insert(const char * str, size_t len) {
+auto ret = dict.insert({DictEntry(str, len), dict.size()});
+if (ret.second) {
+  // make a copy to internal storage
+  data.push_back(std::vector(len));
+  memcpy(data.back().data(), str, len);
+  // update dictionary entry to link pointer to internal storage
+  DictEntry * entry = const_cast(&(ret.first->first));
+  entry->data = data.back().data();
+  totalLength += len;
+}
+return ret.first->second;
+  }
+
+  // write dictionary data & length to output buffer
+  void StringDictionary::flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const {
+for (auto it = dict.cbegin(); it != dict.cend(); ++it) {
+  dataStream->write(it->first.data, it->first.length);
+  lengthEncoder->write(static_cast(it->first.length));
+}
+  }
+
+  // reorder input index buffer from insertion order to dictionary order
+  void StringDictionary::reorder(std::vector& idxBuffer) const {
 
 Review comment:
   It took me a while to understand that we are actually calculating the 
dictionary indexes of the input values. Is this correct?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-418) Fix broken centos6 compilation

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668904#comment-16668904
 ] 

ASF GitHub Bot commented on ORC-418:


omalley commented on a change in pull request #328: ORC-418. Fix broken docker 
builds.
URL: https://github.com/apache/orc/pull/328#discussion_r229362650
 
 

 ##
 File path: c++/src/RLEv1.hh
 ##
 @@ -22,6 +22,7 @@
 #include "Adaptor.hh"
 #include "RLE.hh"
 
+#include 
 
 Review comment:
   Oh, thank you. I couldn't believe that uint wasn't built in. I couldn't 
figure out what to google to find the answer and I don't have a windows box.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix broken centos6 compilation
> --
>
> Key: ORC-418
> URL: https://issues.apache.org/jira/browse/ORC-418
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.6.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> We've accidentally broken the centos 6 C++ build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-420) [C++] Implement string dictionary encoding for C++ writer

2018-10-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668901#comment-16668901
 ] 

ASF GitHub Bot commented on ORC-420:


majetideepak commented on a change in pull request #326: ORC-420: [C++] 
Implement string dictionary encoding for C++ writer
URL: https://github.com/apache/orc/pull/326#discussion_r229362176
 
 

 ##
 File path: c++/src/ColumnWriter.cc
 ##
 @@ -831,6 +843,121 @@ namespace orc {
 dataStream->recordPosition(rowIndexPosition.get());
   }
 
+  /**
+   * Implementation of increasing sorted string dictionary
+   */
+  class StringDictionary {
+  public:
+struct DictEntry {
+  DictEntry(const char * str, size_t len):data(str),length(len) {}
+  const char * data;
+  size_t length;
+};
+
+StringDictionary():totalLength(0) {}
+
+// insert a new string into dictionary, return its insertion order
+size_t insert(const char * data, size_t len);
+
+// write dictionary data & length to output buffer
+void flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const;
+
+// reorder input index buffer from insertion order to dictionary order
+void reorder(std::vector& idxBuffer) const;
+
+// get dict entries in insertion order
+std::vector entriesInInsertionOrder() const;
+
+// return count of entries
+size_t size() const;
+
+// return total length of strings in the dictioanry
+uint64_t length() const;
+
+void clear();
+
+  private:
+struct LessThan {
+  bool operator()(const DictEntry& left, const DictEntry& right) const {
+int ret = memcmp(left.data, right.data, std::min(left.length, 
right.length));
+if (ret != 0) {
+  return ret < 0;
+}
+return left.length < right.length;
+  }
+};
+
+std::map dict;
+std::vector> data;
+uint64_t totalLength;
+  };
+
+  // insert a new string into dictionary, return its insertion order
+  size_t StringDictionary::insert(const char * str, size_t len) {
+auto ret = dict.insert({DictEntry(str, len), dict.size()});
+if (ret.second) {
+  // make a copy to internal storage
+  data.push_back(std::vector(len));
+  memcpy(data.back().data(), str, len);
+  // update dictionary entry to link pointer to internal storage
+  DictEntry * entry = const_cast(&(ret.first->first));
+  entry->data = data.back().data();
+  totalLength += len;
+}
+return ret.first->second;
+  }
+
+  // write dictionary data & length to output buffer
+  void StringDictionary::flush(AppendOnlyBufferedStream * dataStream,
+   RleEncoder * lengthEncoder) const {
+for (auto it = dict.cbegin(); it != dict.cend(); ++it) {
+  dataStream->write(it->first.data, it->first.length);
+  lengthEncoder->write(static_cast(it->first.length));
+}
+  }
+
+  // reorder input index buffer from insertion order to dictionary order
+  void StringDictionary::reorder(std::vector& idxBuffer) const {
 
 Review comment:
   It took me a while to understand that we are actually calculating the 
dictionary indexes of the input values. Is this correct?
   Can we name this as `getIndices`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Implement string dictionary encoding for C++ writer
> -
>
> Key: ORC-420
> URL: https://issues.apache.org/jira/browse/ORC-420
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> The scope of this Jira is to add string dictionary encoding support to C++ 
> writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-418) Fix broken centos6 compilation

2018-10-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668114#comment-16668114
 ] 

ASF GitHub Bot commented on ORC-418:


majetideepak commented on a change in pull request #328: ORC-418. Fix broken 
docker builds.
URL: https://github.com/apache/orc/pull/328#discussion_r229172742
 
 

 ##
 File path: c++/src/RLEv1.hh
 ##
 @@ -22,6 +22,7 @@
 #include "Adaptor.hh"
 #include "RLE.hh"
 
+#include 
 
 Review comment:
   http://www.cplusplus.com/forum/beginner/128573/
   says must include `windows.h`
   ```
   #ifdef _WIN32
   #include 
   #endif
   ```
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fix broken centos6 compilation
> --
>
> Key: ORC-418
> URL: https://issues.apache.org/jira/browse/ORC-418
> Project: ORC
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.6.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> We've accidentally broken the centos 6 C++ build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   10   >