[GitHub] spark pull request #16550: [SPARK-19178][SQL] convert string of large number...

2017-01-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16550


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16550: [SPARK-19178][SQL] convert string of large number...

2017-01-12 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/16550#discussion_r95871812
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -835,6 +835,187 @@ public UTF8String translate(Map 
dict) {
 return fromString(sb.toString());
   }
 
+  private int getDigit(byte b) {
+if (b >= '0' && b <= '9') {
+  return b - '0';
+}
+throw new NumberFormatException(toString());
+  }
+
+  /**
+   * Parses this UTF8String to long.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyLong.parseLong in Hive.
+   */
+  public long toLong() {
+if (numBytes == 0) {
+  throw new NumberFormatException("Empty string!");
--- End diff --

Could you remove `!` from the exception?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16550: [SPARK-19178][SQL] convert string of large number...

2017-01-12 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/16550#discussion_r95868053
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -835,6 +835,187 @@ public UTF8String translate(Map 
dict) {
 return fromString(sb.toString());
   }
 
+  private int getDigit(byte b) {
+if (b >= '0' && b <= '9') {
+  return b - '0';
+}
+throw new NumberFormatException(toString());
+  }
+
+  /**
+   * Parses this UTF8String to long.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyLong.parseLong in Hive.
+   */
+  public long toLong() {
+if (numBytes == 0) {
+  throw new NumberFormatException("Empty string!");
+}
+
+byte b = getByte(0);
+final boolean negative = b == '-';
+int offset = 0;
+if (negative || b == '+') {
+  offset++;
+  if (numBytes == 1) {
+throw new NumberFormatException(toString());
+  }
+}
+
+final byte separator = '.';
+final int radix = 10;
+final long stopValue = Long.MIN_VALUE / radix;
+long result = 0;
+
+while (offset < numBytes) {
+  b = getByte(offset);
+  offset++;
+  if (b == separator) {
+// We allow decimals and will return a truncated integral in that 
case.
+// Therefore we won't throw an exception here (checking the 
fractional
+// part happens below.)
+break;
+  }
+
+  int digit = getDigit(b);
+  // We are going to process the new digit and accumulate the result. 
However, before doing
+  // this, if the result is already smaller than the 
stopValue(Long.MIN_VALUE / 10), then
+  // result * 10 will definitely be smaller than minValue, and we can 
stop and throw exception.
+  if (result < stopValue) {
+throw new NumberFormatException(toString());
+  }
+
+  result = result * radix - digit;
+  // Since the previous result is less than or equal to 
stopValue(Long.MIN_VALUE / 10), we can
+  // just use `result > 0` to check overflow. If result overflows, we 
should stop and throw
+  // exception.
+  if (result > 0) {
+throw new NumberFormatException(toString());
+  }
+}
+
+// This is the case when we've encountered a decimal separator. The 
fractional
+// part will not change the number, but we will verify that the 
fractional part
+// is well formed.
+while (offset < numBytes) {
+  if (getDigit(getByte(offset)) == -1) {
+throw new NumberFormatException(toString());
+  }
+  offset++;
+}
+
+if (!negative) {
+  result = -result;
+  if (result < 0) {
+throw new NumberFormatException(toString());
+  }
+}
+
+return result;
+  }
+
+  /**
+   * Parses this UTF8String to int.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyInt.parseInt in Hive.
+   */
+  public int toInt() {
+if (numBytes == 0) {
+  throw new NumberFormatException("Empty string!");
+}
+
+byte b = getByte(0);
+final boolean negative = b == '-';
+int offset = 0;
+if (negative || b == '+') {
+  offset++;
+  if (numBytes == 1) {
+throw new NumberFormatException(toString());
+  }
+}
+
+final byte separator = '.';
+final int radix = 10;
+final int stopValue = Integer.MIN_VALUE / radix;
+int result = 0;
+
+while (offset < numBytes) {
+  b = getByte(offset);
+  offset++;
+  if (b == separator) {
+// We allow decimals and will return a truncated integral in that 
case.
+// Therefore we won't throw an exception here (checking the 
fractional
+// part happens below.)
+break;
+  }
+
+  int digit = getDigit(b);
+  // We are going to process the new digit and accumulate the result. 
However

[GitHub] spark pull request #16550: [SPARK-19178][SQL] convert string of large number...

2017-01-12 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/16550#discussion_r95868295
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -835,6 +835,187 @@ public UTF8String translate(Map 
dict) {
 return fromString(sb.toString());
   }
 
+  private int getDigit(byte b) {
+if (b >= '0' && b <= '9') {
+  return b - '0';
+}
+throw new NumberFormatException(toString());
+  }
+
+  /**
+   * Parses this UTF8String to long.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyLong.parseLong in Hive.
+   */
+  public long toLong() {
+if (numBytes == 0) {
+  throw new NumberFormatException("Empty string!");
+}
+
+byte b = getByte(0);
+final boolean negative = b == '-';
+int offset = 0;
+if (negative || b == '+') {
+  offset++;
+  if (numBytes == 1) {
+throw new NumberFormatException(toString());
+  }
+}
+
+final byte separator = '.';
+final int radix = 10;
+final long stopValue = Long.MIN_VALUE / radix;
+long result = 0;
+
+while (offset < numBytes) {
+  b = getByte(offset);
+  offset++;
+  if (b == separator) {
+// We allow decimals and will return a truncated integral in that 
case.
+// Therefore we won't throw an exception here (checking the 
fractional
+// part happens below.)
+break;
+  }
+
+  int digit = getDigit(b);
+  // We are going to process the new digit and accumulate the result. 
However, before doing
+  // this, if the result is already smaller than the 
stopValue(Long.MIN_VALUE / 10), then
+  // result * 10 will definitely be smaller than minValue, and we can 
stop and throw exception.
+  if (result < stopValue) {
+throw new NumberFormatException(toString());
+  }
+
+  result = result * radix - digit;
+  // Since the previous result is less than or equal to 
stopValue(Long.MIN_VALUE / 10), we can
+  // just use `result > 0` to check overflow. If result overflows, we 
should stop and throw
+  // exception.
+  if (result > 0) {
+throw new NumberFormatException(toString());
+  }
+}
+
+// This is the case when we've encountered a decimal separator. The 
fractional
+// part will not change the number, but we will verify that the 
fractional part
+// is well formed.
+while (offset < numBytes) {
+  if (getDigit(getByte(offset)) == -1) {
+throw new NumberFormatException(toString());
+  }
+  offset++;
+}
+
+if (!negative) {
+  result = -result;
+  if (result < 0) {
+throw new NumberFormatException(toString());
+  }
+}
+
+return result;
+  }
+
+  /**
+   * Parses this UTF8String to int.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyInt.parseInt in Hive.
+   */
+  public int toInt() {
+if (numBytes == 0) {
+  throw new NumberFormatException("Empty string!");
+}
+
+byte b = getByte(0);
+final boolean negative = b == '-';
+int offset = 0;
+if (negative || b == '+') {
+  offset++;
+  if (numBytes == 1) {
+throw new NumberFormatException(toString());
+  }
+}
+
+final byte separator = '.';
+final int radix = 10;
+final int stopValue = Integer.MIN_VALUE / radix;
+int result = 0;
+
+while (offset < numBytes) {
+  b = getByte(offset);
+  offset++;
+  if (b == separator) {
+// We allow decimals and will return a truncated integral in that 
case.
+// Therefore we won't throw an exception here (checking the 
fractional
+// part happens below.)
+break;
+  }
+
+  int digit = getDigit(b);
+  // We are going to process the new digit and accumulate the result. 
However

[GitHub] spark pull request #16550: [SPARK-19178][SQL] convert string of large number...

2017-01-12 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/16550#discussion_r95858028
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -835,6 +835,187 @@ public UTF8String translate(Map 
dict) {
 return fromString(sb.toString());
   }
 
+  private int getDigit(byte b) {
+if (b >= '0' && b <= '9') {
+  return b - '0';
+}
+throw new NumberFormatException(toString());
+  }
+
+  /**
+   * Parses this UTF8String to long.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyLong.parseLong in Hive.
--- End diff --

`These codes are mostly copied ...` => `This code is mostly copied ...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16550: [SPARK-19178][SQL] convert string of large number...

2017-01-12 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/16550#discussion_r95869209
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -835,6 +835,185 @@ public UTF8String translate(Map 
dict) {
 return fromString(sb.toString());
   }
 
+  private int getDigit(byte b) {
+if (b >= '0' && b <= '9') {
+  return b - '0';
+}
+throw new NumberFormatException(toString());
+  }
+
+  /**
+   * Parses this UTF8String to long.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyLong.parseLong in Hive.
+   */
+  public long toLong() {
+if (numBytes == 0) {
+  throw new NumberFormatException("Empty string!");
+}
+
+byte b = getByte(0);
+final boolean negative = b == '-';
+int offset = 0;
+if (negative || b == '+') {
+  offset++;
+  if (numBytes == 1) {
+throw new NumberFormatException(toString());
+  }
+}
+
+final byte separator = '.';
+final long stopValue = Long.MIN_VALUE / 10;
+long result = 0;
+
+while (offset < numBytes) {
+  b = getByte(offset);
+  offset++;
+  if (b == separator) {
+// We allow decimals and will return a truncated integral in that 
case.
+// Therefore we won't throw an exception here (checking the 
fractional
+// part happens below.)
+break;
+  }
+
+  int digit = getDigit(b);
+  // We are going to process the new digit and accumulate the result. 
However, before doing
+  // this, if the result is already smaller than the 
stopValue(Long.MIN_VALUE / 10), then
+  // result * 10 will definitely be smaller than minValue, and we can 
stop and throw exception.
+  if (result < stopValue) {
+throw new NumberFormatException(toString());
+  }
+
+  result = result * 10 - digit;
+  // Since the previous result is less than or equal to 
stopValue(Long.MIN_VALUE / 10), we can
+  // just use `result > 0` to check overflow. If result overflows, we 
should stop and throw
+  // exception.
+  if (result > 0) {
+throw new NumberFormatException(toString());
+  }
+}
+
+// This is the case when we've encountered a decimal separator. The 
fractional
+// part will not change the number, but we will verify that the 
fractional part
+// is well formed.
+while (offset < numBytes) {
+  if (getDigit(getByte(offset)) == -1) {
+throw new NumberFormatException(toString());
+  }
+  offset++;
+}
+
+if (!negative) {
+  result = -result;
+  if (result < 0) {
+throw new NumberFormatException(toString());
+  }
+}
+
+return result;
+  }
+
+  /**
+   * Parses this UTF8String to int.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyInt.parseInt in Hive.
+   */
+  public int toInt() {
--- End diff --

can you add a comment about this ? the only reason it will hurt perf is for 
int conversion where the result will have to be typecasted from long -> int ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16550: [SPARK-19178][SQL] convert string of large number...

2017-01-12 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/16550#discussion_r95868850
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -835,6 +835,187 @@ public UTF8String translate(Map 
dict) {
 return fromString(sb.toString());
   }
 
+  private int getDigit(byte b) {
+if (b >= '0' && b <= '9') {
+  return b - '0';
+}
+throw new NumberFormatException(toString());
+  }
+
+  /**
+   * Parses this UTF8String to long.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyLong.parseLong in Hive.
+   */
+  public long toLong() {
+if (numBytes == 0) {
+  throw new NumberFormatException("Empty string!");
+}
+
+byte b = getByte(0);
+final boolean negative = b == '-';
+int offset = 0;
+if (negative || b == '+') {
+  offset++;
+  if (numBytes == 1) {
+throw new NumberFormatException(toString());
+  }
+}
+
+final byte separator = '.';
+final int radix = 10;
+final long stopValue = Long.MIN_VALUE / radix;
+long result = 0;
+
+while (offset < numBytes) {
+  b = getByte(offset);
+  offset++;
+  if (b == separator) {
+// We allow decimals and will return a truncated integral in that 
case.
+// Therefore we won't throw an exception here (checking the 
fractional
+// part happens below.)
+break;
+  }
+
+  int digit = getDigit(b);
+  // We are going to process the new digit and accumulate the result. 
However, before doing
+  // this, if the result is already smaller than the 
stopValue(Long.MIN_VALUE / 10), then
+  // result * 10 will definitely be smaller than minValue, and we can 
stop and throw exception.
+  if (result < stopValue) {
+throw new NumberFormatException(toString());
+  }
+
+  result = result * radix - digit;
+  // Since the previous result is less than or equal to 
stopValue(Long.MIN_VALUE / 10), we can
+  // just use `result > 0` to check overflow. If result overflows, we 
should stop and throw
+  // exception.
+  if (result > 0) {
+throw new NumberFormatException(toString());
+  }
+}
+
+// This is the case when we've encountered a decimal separator. The 
fractional
+// part will not change the number, but we will verify that the 
fractional part
+// is well formed.
+while (offset < numBytes) {
+  if (getDigit(getByte(offset)) == -1) {
+throw new NumberFormatException(toString());
+  }
+  offset++;
+}
+
+if (!negative) {
+  result = -result;
+  if (result < 0) {
+throw new NumberFormatException(toString());
+  }
+}
+
+return result;
+  }
+
+  /**
+   * Parses this UTF8String to int.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyInt.parseInt in Hive.
--- End diff --

`These codes are mostly copied ...` => `This code is mostly copied ...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16550: [SPARK-19178][SQL] convert string of large number...

2017-01-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16550#discussion_r95850133
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/cast.sql ---
@@ -0,0 +1,43 @@
+-- cast string representing a valid fractional number to integral should 
truncate the number
+SELECT CAST('1.23' AS int);
+SELECT CAST('1.23' AS long);
+SELECT CAST('-4.56' AS int);
+SELECT CAST('-4.56' AS long);
+
+-- cast string which are not numbers to integral should return null
+SELECT CAST('abc' AS int);
+SELECT CAST('abc' AS long);
+
+-- cast string representing a very large number to integral should return 
null
+SELECT CAST('1234567890123' AS int);
+SELECT CAST('12345678901234567890123' AS long);
+
+-- cast empty string to integral should return null
+SELECT CAST('' AS int);
+SELECT CAST('' AS long);
+
+-- cast null to integral should return null
+SELECT CAST(NULL AS int);
+SELECT CAST(NULL AS long);
+
+-- cast invalid decimal string to integral should return null
+SELECT CAST('123.a' AS int);
+SELECT CAST('123.a' AS long);
+
+-- '-2147483648' is the smallest int value
+SELECT CAST('-2147483648' AS int);
+SELECT CAST('-2147483649' AS int);
+
+-- '2147483647' is the largest int value
+SELECT CAST('2147483647' AS int);
+SELECT CAST('2147483648' AS int);
+
+-- '-9223372036854775808' is the smallest long value
+SELECT CAST('-9223372036854775808' AS long);
+SELECT CAST('-9223372036854775809' AS long);
+
+-- '9223372036854775807' is the largest long value
+SELECT CAST('9223372036854775807' AS long);
+SELECT CAST('9223372036854775808' AS long);
+
+-- TODO: migrate all cast tests here.
--- End diff --

+ 1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16550: [SPARK-19178][SQL] convert string of large number...

2017-01-12 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16550#discussion_r95850028
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -835,6 +835,187 @@ public UTF8String translate(Map 
dict) {
 return fromString(sb.toString());
   }
 
+  private int getDigit(byte b) {
+if (b >= '0' && b <= '9') {
+  return b - '0';
+}
+throw new NumberFormatException(toString());
+  }
+
+  /**
+   * Parses this UTF8String to long.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyLong.parseLong in Hive.
+   */
+  public long toLong() {
+if (numBytes == 0) {
+  throw new NumberFormatException("Empty string!");
+}
+
+byte b = getByte(0);
+final boolean negative = b == '-';
+int offset = 0;
+if (negative || b == '+') {
+  offset++;
+  if (numBytes == 1) {
+throw new NumberFormatException(toString());
+  }
+}
+
+final byte separator = '.';
+final int radix = 10;
+final long stopValue = Long.MIN_VALUE / radix;
+long result = 0;
+
+while (offset < numBytes) {
+  b = getByte(offset);
+  offset++;
+  if (b == separator) {
+// We allow decimals and will return a truncated integral in that 
case.
+// Therefore we won't throw an exception here (checking the 
fractional
+// part happens below.)
+break;
+  }
+
+  int digit = getDigit(b);
+  // We are going to process the new digit and accumulate the result. 
However, before doing
+  // this, if the result is already smaller than the 
stopValue(Long.MIN_VALUE / 10), then
+  // result * 10 will definitely be smaller than minValue, and we can 
stop and throw exception.
+  if (result < stopValue) {
+throw new NumberFormatException(toString());
+  }
+
+  result = result * radix - digit;
+  // Since the previous result is less than or equal to 
stopValue(Long.MIN_VALUE / 10), we can
+  // just use `result > 0` to check overflow. If result overflows, we 
should stop and throw
+  // exception.
+  if (result > 0) {
+throw new NumberFormatException(toString());
+  }
+}
+
+// This is the case when we've encountered a decimal separator. The 
fractional
+// part will not change the number, but we will verify that the 
fractional part
+// is well formed.
+while (offset < numBytes) {
+  if (getDigit(getByte(offset)) == -1) {
+throw new NumberFormatException(toString());
+  }
+  offset++;
+}
+
+if (!negative) {
+  result = -result;
+  if (result < 0) {
+throw new NumberFormatException(toString());
+  }
+}
+
+return result;
+  }
+
+  /**
+   * Parses this UTF8String to int.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyInt.parseInt in Hive.
--- End diff --

To the other reviewers, in the latest master of Hive, it is 
`LazyInteger.parseInt`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16550: [SPARK-19178][SQL] convert string of large number...

2017-01-11 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16550#discussion_r95717360
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -835,6 +835,185 @@ public UTF8String translate(Map 
dict) {
 return fromString(sb.toString());
   }
 
+  private int getDigit(byte b) {
+if (b >= '0' && b <= '9') {
+  return b - '0';
+}
+throw new NumberFormatException(toString());
+  }
+
+  /**
+   * Parses this UTF8String to long.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyLong.parseLong in Hive.
+   */
+  public long toLong() {
+if (numBytes == 0) {
+  throw new NumberFormatException("Empty string!");
+}
+
+byte b = getByte(0);
+final boolean negative = b == '-';
+int offset = 0;
+if (negative || b == '+') {
+  offset++;
+  if (numBytes == 1) {
+throw new NumberFormatException(toString());
+  }
+}
+
+final byte separator = '.';
+final long stopValue = Long.MIN_VALUE / 10;
+long result = 0;
+
+while (offset < numBytes) {
+  b = getByte(offset);
+  offset++;
+  if (b == separator) {
+// We allow decimals and will return a truncated integral in that 
case.
+// Therefore we won't throw an exception here (checking the 
fractional
+// part happens below.)
+break;
+  }
+
+  int digit = getDigit(b);
+  // We are going to process the new digit and accumulate the result. 
However, before doing
+  // this, if the result is already smaller than the 
stopValue(Long.MIN_VALUE / 10), then
+  // result * 10 will definitely be smaller than minValue, and we can 
stop and throw exception.
+  if (result < stopValue) {
+throw new NumberFormatException(toString());
+  }
+
+  result = result * 10 - digit;
--- End diff --

Then, we can use `radix` here. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16550: [SPARK-19178][SQL] convert string of large number...

2017-01-11 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16550#discussion_r95717275
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -835,6 +835,185 @@ public UTF8String translate(Map 
dict) {
 return fromString(sb.toString());
   }
 
+  private int getDigit(byte b) {
+if (b >= '0' && b <= '9') {
+  return b - '0';
+}
+throw new NumberFormatException(toString());
+  }
+
+  /**
+   * Parses this UTF8String to long.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyLong.parseLong in Hive.
+   */
+  public long toLong() {
+if (numBytes == 0) {
+  throw new NumberFormatException("Empty string!");
+}
+
+byte b = getByte(0);
+final boolean negative = b == '-';
+int offset = 0;
+if (negative || b == '+') {
+  offset++;
+  if (numBytes == 1) {
+throw new NumberFormatException(toString());
+  }
+}
+
+final byte separator = '.';
+final long stopValue = Long.MIN_VALUE / 10;
--- End diff --

Could we define the constant `10` as a variable `radix`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16550: [SPARK-19178][SQL] convert string of large number...

2017-01-11 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16550#discussion_r95588520
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -835,6 +835,185 @@ public UTF8String translate(Map 
dict) {
 return fromString(sb.toString());
   }
 
+  private int getDigit(byte b) {
+if (b >= '0' && b <= '9') {
+  return b - '0';
+}
+throw new NumberFormatException(toString());
+  }
+
+  /**
+   * Parses this UTF8String to long.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyLong.parseLong in Hive.
+   */
+  public long toLong() {
+if (numBytes == 0) {
+  throw new NumberFormatException("Empty string!");
+}
+
+byte b = getByte(0);
+final boolean negative = b == '-';
+int offset = 0;
+if (negative || b == '+') {
+  offset++;
+  if (numBytes == 1) {
+throw new NumberFormatException(toString());
+  }
+}
+
+final byte separator = '.';
+final long stopValue = Long.MIN_VALUE / 10;
+long result = 0;
+
+while (offset < numBytes) {
+  b = getByte(offset);
+  offset++;
+  if (b == separator) {
+// We allow decimals and will return a truncated integral in that 
case.
+// Therefore we won't throw an exception here (checking the 
fractional
+// part happens below.)
+break;
+  }
+
+  int digit = getDigit(b);
+  // We are going to process the new digit and accumulate the result. 
However, before doing
+  // this, if the result is already smaller than the 
stopValue(Long.MIN_VALUE / 10), then
+  // result * 10 will definitely be smaller than minValue, and we can 
stop and throw exception.
+  if (result < stopValue) {
+throw new NumberFormatException(toString());
+  }
+
+  result = result * 10 - digit;
+  // Since the previous result is less than or equal to 
stopValue(Long.MIN_VALUE / 10), we can
+  // just use `result > 0` to check overflow. If result overflows, we 
should stop and throw
+  // exception.
+  if (result > 0) {
+throw new NumberFormatException(toString());
+  }
+}
+
+// This is the case when we've encountered a decimal separator. The 
fractional
+// part will not change the number, but we will verify that the 
fractional part
+// is well formed.
+while (offset < numBytes) {
+  if (getDigit(getByte(offset)) == -1) {
+throw new NumberFormatException(toString());
+  }
+  offset++;
+}
+
+if (!negative) {
+  result = -result;
+  if (result < 0) {
+throw new NumberFormatException(toString());
+  }
+}
+
+return result;
+  }
+
+  /**
+   * Parses this UTF8String to int.
+   *
+   * Note that, in this method we accumulate the result in negative 
format, and convert it to
+   * positive format at the end, if this string is not started with '-'. 
This is because min value
+   * is bigger than max value in digits, e.g. Integer.MAX_VALUE is 
'2147483647' and
+   * Integer.MIN_VALUE is '-2147483648'.
+   *
+   * These codes are mostly copied from LazyInt.parseInt in Hive.
+   */
+  public int toInt() {
--- End diff --

Hive also duplicates the code for parsing to long and int, I'm not sure how 
to remove the duplication without hurting the performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16550: [SPARK-19178][SQL] convert string of large number...

2017-01-11 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/16550

[SPARK-19178][SQL] convert string of large numbers to int should return null

## What changes were proposed in this pull request?

When we convert a string to integral, we will convert that string to 
`decimal(20, 0)` first, so that we can turn a string with decimal format to 
truncated integral, e.g. `CAST('1.2' AS int)` will return `1`.

However, this brings problems when we convert a string with large numbers 
to integral, e.g. `CAST('1234567890123' AS int)` will return `1912276171`, 
while Hive returns null as we expected.

This is a long standing bug(seems it was there the first day Spark SQL was 
created), this PR fixes this bug by adding the native support to convert 
`UTF8String` to integral.

## How was this patch tested?

new regression tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark string-to-int

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16550.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16550


commit afa7066b394217316789c06db467acf40c74cd28
Author: Wenchen Fan 
Date:   2017-01-11T08:14:44Z

native support for converting UTF8String to integral




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org