[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433962#comment-16433962
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on issue #1880: ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380465437
 
 
   Fair enough.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433957#comment-16433957
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on issue #1880: ARROW-2224: [C++] Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380464846
 
 
   I took it from here:
   > isdigit and isxdigit are the only standard narrow character classification 
functions that are not affected by the currently installed C locale. although 
some implementations (e.g. Microsoft in 1252 codepage) may classify additional 
single-byte characters as digits. 
   
   http://en.cppreference.com/w/cpp/string/byte/isdigit
   
   Not sure how authoritative that page is.
   
   That said, `static_cast(std::isdigit(c))` is not very pretty.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433952#comment-16433952
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on issue #1880: ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380464015
 
 
   > It's not like isdigit is more readable anyway
   
   Readability wasn't my original concern, reimplementing a builtin function 
was.
   
   > and apparently it risks being locale-dependent on Windows (which is a can 
of worms).
   
   Is there some documentation on this somewhere? I found the following lines 
in the [`setlocale` documentation for Visual Studio 
2015](https://msdn.microsoft.com/en-us/library/x99tb11d.aspx):
   
   > LC_CTYPE
   The character-handling functions (except isdigit, isxdigit, mbstowcs, and 
mbtowc, which are unaffected).
   
   That suggests `isdigit` is *not* affected by locale. Am I reading something 
wrong?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433951#comment-16433951
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on a change in pull request #1880: ARROW-2224: [C++] Remove 
boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180765500
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
 
 Review comment:
   I mean if the parse function is taking a `string::const_iterator`, it maybe 
won't accept a different kind of iterator. Or we need to make it a template 
function, piling more layers of abstraction without any concrete advantage.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433942#comment-16433942
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on issue #1880: ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380460395
 
 
   @pitrou Sweet. Thanks for doing this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433941#comment-16433941
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on a change in pull request #1880: ARROW-2224: [C++] Remove 
boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180761749
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
 
 Review comment:
   I don't follow how a `view`/`span` class is less abstracted, since that 
would presumably implement the c++ iterator interface, like every 
implementation of it usually does. However, as I said, I don't think this is 
really worth spending too much time on.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433888#comment-16433888
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on issue #1880: ARROW-2224: [C++] Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380443938
 
 
   Also adding a Decimal::FromString benchmark. That benchmark is 60% faster 
with the PR (1.8M items/second up from 700k items/second here).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433867#comment-16433867
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on issue #1880: ARROW-2224: [C++] Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380439980
 
 
   This PR reduces conversion time for decimals from Python to Arrow by ~45%:
   * before:
   ```
   [100.00%] ··· Running convert_builtins.ConvertPyListToArray.time_convert 
  ok
   [100.00%]  
   =
  type  
   -
decimal  177±0.06ms 
   =
   ```
   
   * after:
   ```
   [100.00%] ··· Running convert_builtins.ConvertPyListToArray.time_convert 
  ok
   [100.00%]  
   =
  type  
   -
decimal  101±0.5ms  
   =
   ```
   
   (irrelevant lines removed)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-11 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433683#comment-16433683
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on issue #1880: ARROW-2224: [C++] Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380401339
 
 
   Reverted the `std::isdigit` change because of a silly performance warning 
(turned into an error) on MSVC. It's not like `isdigit` is more readable 
anyway, and apparently it risks being locale-dependent on Windows (which is a 
can of worms).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433182#comment-16433182
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380279472
 
 
   > I would think it's okay to remove them if they weren't there before the PR 
that introduced boost-regex.
   
   It seems the boost-regex references in the Python build files were 
introduced specifically for parquet: see cb5da9c1 and 8b1c8118.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433178#comment-16433178
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278645
 
 
   > By the way, I don't know if it's ok to remove boost-regex references from 
the Python build files, since parquet still requires boost-regex AFAIR.
   
   @xhochy Can you comment on this?
   
   I would think it's okay to remove them if they weren't there before the PR 
that introduced boost-regex.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433175#comment-16433175
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278461
 
 
   I think the current version supports that, does it not? I remember adding 
that.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433176#comment-16433176
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278524
 
 
   I may be misunderstand the usage of "should" here :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433174#comment-16433174
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278313
 
 
   Another note: the new implementation should support exponents without a 
sign, e.g. "1.23e45". The Python decimal module also does this. I need to add 
tests, though.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433170#comment-16433170
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380277704
 
 
   By the way, I don't know if it's ok to remove boost-regex references from 
the Python build files, since parquet still requires boost-regex AFAIR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433160#comment-16433160
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180594466
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
 
 Review comment:
   By no means should you feel obliged to do that extra work :) If this works, 
then that's great. I thought maybe you had chosen not to use iterators for a 
particular reason.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433158#comment-16433158
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180594140
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
+  size_t pos = 0;
 
-if (scale != nullptr) {
-  *scale = 0;
+  if (size == 0) {
+return false;
+  }
+  // Sign of the number
+  if (IsSign(s[pos])) {
+out->sign = std::string(s + pos, 1);
+++pos;
+  }
+  // First run of digits
+  pos = ParseDigitsRun(s, pos, size, >whole_digits);
+  if (pos == size) {
+return !out->whole_digits.empty();
+  }
+  // Optional dot (if given in fractional form)
+  bool has_dot = IsDot(s[pos]);
+  if (has_dot) {
+// Second run of digits
+++pos;
+pos = ParseDigitsRun(s, pos, size, >fractional_digits);
+  }
+  if (out->whole_digits.empty() && out->fractional_digits.empty()) {
+// Need at least some digits (whole or fractional)
+return false;
+  }
+  if (pos == size) {
+return true;
+  }
+  // Optional exponent
+  if (StartsExponent(s[pos])) {
+++pos;
+if (pos == size) {
+  return false;
+}
+// Optional exponent sign
+if (IsSign(s[pos])) {
+  out->exponent_sign = std::string(s + pos, 1);
+  ++pos;
+}
+pos = ParseDigitsRun(s, pos, size, >exponent_digits);
+if (out->exponent_digits.empty()) {
+  // Need some exponent digits
+  return false;
 }
-
-*out = 0;
-return Status::OK();
   }
+  return pos == size;
+}
 
-  boost::smatch results;
-  const bool matches = boost::regex_match(s, results, DECIMAL_REGEX);
+}  // namespace
 
-  if (!matches) {
-std::stringstream ss;
-ss << "The string " << s << " is not a valid decimal number";
-return Status::Invalid(ss.str());
+Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
+  int32_t* scale) {
+  if (s.empty()) {
+return Status::Invalid("Empty string cannot be converted to decimal");
   }
 
-  const std::string sign = results["SIGN"];
-  const std::string integer = results["INTEGER"];
-
-  const std::string left_digits = results["LEFT_DIGITS"];
-  const std::string first_right_digits = results["FIRST_RIGHT_DIGITS"];
-
-  const std::string second_right_digits = results["SECOND_RIGHT_DIGITS"];
-
-  const std::string first_exp_value = results["FIRST_EXP_VALUE"];
-  const std::string second_exp_value = results["SECOND_EXP_VALUE"];
-
-  std::string whole_part;
-  std::string fractional_part;
-  std::string exponent_value;
-
-  if (!integer.empty()) {
-whole_part = integer;
-  } else if (!left_digits.empty()) {
-DCHECK(second_right_digits.empty()) << s << " " << second_right_digits;
-whole_part = left_digits;
-fractional_part = first_right_digits;
-  } else {
-DCHECK(first_right_digits.empty()) << s << " " << first_right_digits;
-fractional_part = second_right_digits;
+  DecimalComponents dec;
+  if (!ParseDecimalComponents(s.data(), s.size(), )) {
+

[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433138#comment-16433138
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180592157
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
 Review comment:
   don't need parens here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433150#comment-16433150
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180593109
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
+  size_t pos = 0;
 
-if (scale != nullptr) {
-  *scale = 0;
+  if (size == 0) {
+return false;
+  }
+  // Sign of the number
+  if (IsSign(s[pos])) {
+out->sign = std::string(s + pos, 1);
+++pos;
+  }
+  // First run of digits
+  pos = ParseDigitsRun(s, pos, size, >whole_digits);
+  if (pos == size) {
+return !out->whole_digits.empty();
+  }
+  // Optional dot (if given in fractional form)
+  bool has_dot = IsDot(s[pos]);
+  if (has_dot) {
+// Second run of digits
+++pos;
+pos = ParseDigitsRun(s, pos, size, >fractional_digits);
+  }
+  if (out->whole_digits.empty() && out->fractional_digits.empty()) {
+// Need at least some digits (whole or fractional)
+return false;
+  }
+  if (pos == size) {
+return true;
+  }
+  // Optional exponent
+  if (StartsExponent(s[pos])) {
+++pos;
+if (pos == size) {
+  return false;
+}
+// Optional exponent sign
+if (IsSign(s[pos])) {
+  out->exponent_sign = std::string(s + pos, 1);
+  ++pos;
+}
+pos = ParseDigitsRun(s, pos, size, >exponent_digits);
+if (out->exponent_digits.empty()) {
+  // Need some exponent digits
+  return false;
 }
-
-*out = 0;
-return Status::OK();
   }
+  return pos == size;
+}
 
-  boost::smatch results;
-  const bool matches = boost::regex_match(s, results, DECIMAL_REGEX);
+}  // namespace
 
-  if (!matches) {
-std::stringstream ss;
-ss << "The string " << s << " is not a valid decimal number";
-return Status::Invalid(ss.str());
+Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
+  int32_t* scale) {
+  if (s.empty()) {
+return Status::Invalid("Empty string cannot be converted to decimal");
   }
 
-  const std::string sign = results["SIGN"];
-  const std::string integer = results["INTEGER"];
-
-  const std::string left_digits = results["LEFT_DIGITS"];
-  const std::string first_right_digits = results["FIRST_RIGHT_DIGITS"];
-
-  const std::string second_right_digits = results["SECOND_RIGHT_DIGITS"];
-
-  const std::string first_exp_value = results["FIRST_EXP_VALUE"];
-  const std::string second_exp_value = results["SECOND_EXP_VALUE"];
-
-  std::string whole_part;
-  std::string fractional_part;
-  std::string exponent_value;
-
-  if (!integer.empty()) {
-whole_part = integer;
-  } else if (!left_digits.empty()) {
-DCHECK(second_right_digits.empty()) << s << " " << second_right_digits;
-whole_part = left_digits;
-fractional_part = first_right_digits;
-  } else {
-DCHECK(first_right_digits.empty()) << s << " " << first_right_digits;
-fractional_part = second_right_digits;
+  DecimalComponents dec;
+  if (!ParseDecimalComponents(s.data(), s.size(), )) {
+

[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433147#comment-16433147
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180592851
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
 Review comment:
   I was too lazy to look it up :-) Will do.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433145#comment-16433145
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180592574
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
+  size_t pos = 0;
 
-if (scale != nullptr) {
-  *scale = 0;
+  if (size == 0) {
+return false;
+  }
+  // Sign of the number
+  if (IsSign(s[pos])) {
+out->sign = std::string(s + pos, 1);
+++pos;
+  }
+  // First run of digits
+  pos = ParseDigitsRun(s, pos, size, >whole_digits);
+  if (pos == size) {
+return !out->whole_digits.empty();
+  }
+  // Optional dot (if given in fractional form)
+  bool has_dot = IsDot(s[pos]);
+  if (has_dot) {
+// Second run of digits
+++pos;
+pos = ParseDigitsRun(s, pos, size, >fractional_digits);
+  }
+  if (out->whole_digits.empty() && out->fractional_digits.empty()) {
+// Need at least some digits (whole or fractional)
+return false;
+  }
+  if (pos == size) {
+return true;
+  }
+  // Optional exponent
+  if (StartsExponent(s[pos])) {
+++pos;
+if (pos == size) {
+  return false;
+}
+// Optional exponent sign
+if (IsSign(s[pos])) {
+  out->exponent_sign = std::string(s + pos, 1);
+  ++pos;
+}
+pos = ParseDigitsRun(s, pos, size, >exponent_digits);
+if (out->exponent_digits.empty()) {
+  // Need some exponent digits
+  return false;
 }
-
-*out = 0;
-return Status::OK();
   }
+  return pos == size;
+}
 
-  boost::smatch results;
-  const bool matches = boost::regex_match(s, results, DECIMAL_REGEX);
+}  // namespace
 
-  if (!matches) {
-std::stringstream ss;
-ss << "The string " << s << " is not a valid decimal number";
-return Status::Invalid(ss.str());
+Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
+  int32_t* scale) {
+  if (s.empty()) {
+return Status::Invalid("Empty string cannot be converted to decimal");
   }
 
-  const std::string sign = results["SIGN"];
-  const std::string integer = results["INTEGER"];
-
-  const std::string left_digits = results["LEFT_DIGITS"];
-  const std::string first_right_digits = results["FIRST_RIGHT_DIGITS"];
-
-  const std::string second_right_digits = results["SECOND_RIGHT_DIGITS"];
-
-  const std::string first_exp_value = results["FIRST_EXP_VALUE"];
-  const std::string second_exp_value = results["SECOND_EXP_VALUE"];
-
-  std::string whole_part;
-  std::string fractional_part;
-  std::string exponent_value;
-
-  if (!integer.empty()) {
-whole_part = integer;
-  } else if (!left_digits.empty()) {
-DCHECK(second_right_digits.empty()) << s << " " << second_right_digits;
-whole_part = left_digits;
-fractional_part = first_right_digits;
-  } else {
-DCHECK(first_right_digits.empty()) << s << " " << first_right_digits;
-fractional_part = second_right_digits;
+  DecimalComponents dec;
+  if (!ParseDecimalComponents(s.data(), s.size(), )) {
+

[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433136#comment-16433136
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180591991
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
 
 Review comment:
   Why not use iterators instead of a pointer plus size? It seems like we're 
subverting the language.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433117#comment-16433117
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou opened a new pull request #1880: [WIP] ARROW-2224: [C++] Remove 
boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880
 
 
   Use a hand-written decimal parser.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432996#comment-16432996
 ] 

Antoine Pitrou commented on ARROW-2224:
---

Note that the day where we really care about decimal performance, we should 
probably use Stefan Krah's libmpdec, which is also used by the Python _decimal 
module. It's a pure C library, is fast and would probably allow us to have 
tighter interoperation with Python decimals.

> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432995#comment-16432995
 ] 

Phillip Cloud commented on ARROW-2224:
--

The original implementation used C++ to implement a handwritten parser, but it 
quickly got complex once we needed to support integers with exponential 
notation. Maybe the complexity of dealing with boost is greater than the 
complexity of debugging a handwritten parser, in which case it would be worth 
it to reimplement it without a regular expression.

> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432987#comment-16432987
 ] 

Antoine Pitrou commented on ARROW-2224:
---

How important is it to have fast parsing of decimal types? If we reimplement 
our own parsing, we can tweak performance *and* remove the troublesome 
dependency.

> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432928#comment-16432928
 ] 

Phillip Cloud commented on ARROW-2224:
--

The "should" here isn't related to linking issues, rather, it's related to 
performance. See here for some benchmarks: 
http://lh3lh3.users.sourceforge.net/reb.shtml

Those benchmarks are now coming up on 8 years old, so we should revisit that.

Some of the linking issues might be mitigated by using {{}} on Windows 
and {{regcomp}} on UNIX platforms. I'm unaware of the perils of that solution 
(if any).

> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432890#comment-16432890
 ] 

Antoine Pitrou commented on ARROW-2224:
---

Can you explain the "should"? AFAICS libre2 isn't header-only, so it may 
present the same linking and ABI issues as boost-regex.

> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)