[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-06-06 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 Yep, should be doable without too much effort. On Sun, Jun 4, 2017 at 9:54 PM, Xiao Li <notificati...@github.com> wrote: > @NathanHowell <https://github.com/

[GitHub] spark issue #12217: [WIP][SPARK-14408][CORE] Changed RDD.treeAggregate to us...

2017-06-02 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/12217 Nothing looks obviously broken, their combiner looks fine. Rerunning the tests would help. On Jun 2, 2017 07:02, "Hyukjin Kwon" <notificati...@github.com> wrot

[GitHub] spark pull request #17255: [SPARK-19918][SQL] Use TextFileFormat in implemen...

2017-03-14 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/17255#discussion_r105942833 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala --- @@ -40,18 +40,11 @@ private[sql] object

[GitHub] spark issue #17255: [SPARK-19918][SQL] Use TextFileFormat in implementation ...

2017-03-12 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/17255 Would there be any additional benefit of replacing more (or all?) of the uses of `RDD` with the equivalent `Dataset` operations? --- If your project is set up for it, you can reply

[GitHub] spark pull request #17255: [SPARK-19918][SQL] Use TextFileFormat in implemen...

2017-03-12 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/17255#discussion_r105563532 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala --- @@ -23,24 +23,25 @@ import

[GitHub] spark pull request #16976: [SPARK-19610][SQL] Support parsing multiline CSV ...

2017-02-23 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16976#discussion_r102668816 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed

[GitHub] spark pull request #16976: [SPARK-19610][SQL] Support parsing multiline CSV ...

2017-02-23 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16976#discussion_r102667812 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed

[GitHub] spark pull request #16976: [SPARK-19610][SQL] Support parsing multiline CSV ...

2017-02-23 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16976#discussion_r102665872 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed

[GitHub] spark pull request #16976: [SPARK-19610][SQL] Support parsing multiline CSV ...

2017-02-23 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16976#discussion_r102665619 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed

[GitHub] spark pull request #16976: [SPARK-19610][SQL] Support parsing multiline CSV ...

2017-02-23 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16976#discussion_r102663016 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -43,23 +37,26 @@ class CSVFileFormat

[GitHub] spark pull request #16976: [SPARK-19610][SQL] Support parsing multiline CSV ...

2017-02-23 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16976#discussion_r102662637 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed

[GitHub] spark pull request #16976: [SPARK-19610][SQL] Support parsing multiline CSV ...

2017-02-23 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16976#discussion_r102662258 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed

[GitHub] spark pull request #16976: [SPARK-19610][SQL] Support parsing multiline CSV ...

2017-02-23 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16976#discussion_r102662330 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -0,0 +1,256 @@ +/* + * Licensed

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-16 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r101671453 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1764,4 +1769,117 @@ class JsonSuite extends

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-16 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 @cloud-fan When implementing tests for the other modes I've uncovered an existing bug in schema inference in `DROPMALFORMED` mode: https://issues.apache.org/jira/browse/SPARK-19641. Since

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 @cloud-fan I just pushed a few more changes to address some of your comments. I'll be back later next week to continue work. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100653879 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1764,4 +1769,123 @@ class JsonSuite extends

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100653835 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1764,4 +1769,123 @@ class JsonSuite extends

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100653757 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1764,4 +1769,123 @@ class JsonSuite extends

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100653580 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1764,4 +1769,123 @@ class JsonSuite extends

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100652445 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala --- @@ -0,0 +1,213 @@ +/* + * Licensed

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100652259 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1764,4 +1769,123 @@ class JsonSuite extends

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100652192 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1764,4 +1769,123 @@ class JsonSuite extends

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100651990 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1764,4 +1769,123 @@ class JsonSuite extends

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100651910 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1764,4 +1769,123 @@ class JsonSuite extends

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100651706 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala --- @@ -79,7 +80,7 @@ private[sql] object

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100650450 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -394,36 +447,32 @@ class JacksonParser

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100649836 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -48,69 +47,110 @@ class JacksonParser

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100649635 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -31,10 +31,17 @@ import

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100648524 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -394,36 +447,32 @@ class JacksonParser

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100646497 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -48,69 +47,98 @@ class JacksonParser

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100640620 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -48,69 +47,98 @@ class JacksonParser

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100610662 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -48,69 +47,98 @@ class JacksonParser

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-10 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-09 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100474809 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -48,69 +47,102 @@ class JacksonParser

[GitHub] spark issue #16199: [SPARK-18772][SQL] NaN/Infinite float parsing in JSON is...

2017-02-09 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16199 @HyukjinKwon Good idea, I'll take another stab and try to revive the original pull request. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-09 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100344879 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -48,69 +47,102 @@ class JacksonParser

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-09 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100344282 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -48,69 +47,102 @@ class JacksonParser

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-08 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100219153 --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala --- @@ -194,5 +195,8 @@ class PortableDataStream

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-08 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 Rebased again to pickup the build break hotfix in c618ccdbe9ac103dfa3182346e2a14a1e7fca91a --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-08 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 I rebased to master and hopefully addressed all of your comments @cloud-fan, please have another look. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-08 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100104738 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -298,22 +312,22 @@ class JacksonParser

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-08 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100103739 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -227,66 +267,71 @@ class JacksonParser

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-08 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100101464 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -160,7 +164,17 @@ public void writeTo(OutputStream out

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-08 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100100641 --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala --- @@ -194,5 +195,8 @@ class PortableDataStream

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-08 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100099791 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -31,10 +31,17 @@ import

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-08 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100098008 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -1764,4 +1769,125 @@ class JsonSuite extends

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-08 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r100097749 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala --- @@ -0,0 +1,213 @@ +/* + * Licensed

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-01-23 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 Any other comments? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-01-10 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-01-10 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 Can someone kick off the tests again? The last failure was in another module (Kafka). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2016-12-29 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 @HyukjinKwon I just pushed a change that makes the corrupt record handling consistent: if a corrupt record column is defined it will always get the json text for failed records. If `wholeFile

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2016-12-27 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 The tests failed for an unrelated reason, looks to be running out of heap space in SBT somewhere. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2016-12-27 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 @HyukjinKwon I agree that overloading the corrupt record column is undesirable and `F.input_file_name` is a better way to fetch the filename. It would be nice to extend this concept further

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2016-12-27 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r93970059 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala --- @@ -36,29 +31,31 @@ import

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2016-12-27 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16386#discussion_r93969732 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala --- @@ -0,0 +1,204 @@ +/* + * Licensed

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2016-12-23 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 @srowen It is functionally the same as what you're suggesting. The question is how (or if) it should it be first class in the `DataFrameReader` api. If we agree that it should be exposed

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2016-12-22 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16386 Hello recent JacksonGenerator.scala commiters, please take a look. cc/ @rxin @hvanhovell @clockfly @hyukjinkwon @cloud-fan --- If your project is set up for it, you can reply

[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2016-12-22 Thread NathanHowell
GitHub user NathanHowell opened a pull request: https://github.com/apache/spark/pull/16386 [SPARK-18352][SQL] Support parsing multiline json files ## What changes were proposed in this pull request? If a new option `wholeFile` is set to `true` the JSON reader will parse

[GitHub] spark pull request #16375: [SPARK-18963] o.a.s.unsafe.types.UTF8StringSuite....

2016-12-21 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16375#discussion_r93460507 --- Diff: common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java --- @@ -591,7 +591,11 @@ public void

[GitHub] spark issue #16199: [SPARK-18772][SQL] NaN/Infinite float parsing in JSON is...

2016-12-07 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16199 Hello @HyukjinKwon, can you take a look at this one? I am unsure if we should be accepting lowercased values like `nan` (versus strictly testing for `NaN`) but I think this PR matches

[GitHub] spark pull request #16199: [SPARK-18772][SQL] NaN/Infinite float parsing in ...

2016-12-07 Thread NathanHowell
GitHub user NathanHowell opened a pull request: https://github.com/apache/spark/pull/16199 [SPARK-18772][SQL] NaN/Infinite float parsing in JSON is inconsistent ## What changes were proposed in this pull request? This relaxes the parsing of `Float` and `Double` columns

[GitHub] spark issue #16107: SPARK-18677: Fix parsing ['key'] in JSON path expression...

2016-12-02 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16107 I wrote the buggy version, doh... but this LGTM. Thanks for fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...

2016-12-01 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16089#discussion_r90566162 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -147,6 +147,17 @@ public void writeTo(ByteBuffer buffer

[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...

2016-12-01 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16089#discussion_r90521381 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala --- @@ -245,24 +230,12 @@ private[csv] class

[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...

2016-12-01 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16089#discussion_r90509459 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -194,4 +194,8 @@ private[sql] class

[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...

2016-12-01 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16089 @steveloughran Spark is handling the output committing somewhere further up the stack. The path being passed in to `OutputWriterFactory.newInstance` is to a temporary file, such as `/private

[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...

2016-12-01 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16089#discussion_r90503024 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala --- @@ -0,0 +1,74 @@ +/* + * Licensed

[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...

2016-12-01 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16089#discussion_r90502343 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -147,6 +147,17 @@ public void writeTo(ByteBuffer buffer

[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...

2016-12-01 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16089#discussion_r90501927 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -194,4 +194,8 @@ private[sql] class

[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...

2016-12-01 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16089#discussion_r90488454 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -132,39 +128,17 @@ class

[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...

2016-12-01 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16089#discussion_r90468858 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -132,39 +128,17 @@ class

[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...

2016-12-01 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16089#discussion_r90468563 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala --- @@ -0,0 +1,73 @@ +/* + * Licensed

[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...

2016-11-30 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16089#discussion_r90385252 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala --- @@ -0,0 +1,73 @@ +/* + * Licensed

[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...

2016-11-30 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16089 Doh, forgot to run the Hive tests. Should be fixed now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...

2016-11-30 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16089#discussion_r90380594 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -132,39 +128,17 @@ class

[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...

2016-11-30 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16089 Yep. It uses the Hadoop `FileSystem` class to open files, just like `TextOutputFormat` does. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...

2016-11-30 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16089 This touches a fair number of components. I also haven't done any performance testing to see what the impact of this is. Curious what your thoughts are? cc/ @marmbrus @rxin @JoshRosen

[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...

2016-11-30 Thread NathanHowell
GitHub user NathanHowell opened a pull request: https://github.com/apache/spark/pull/16089 [SPARK-18658][SQL] Write text records directly to a FileOutputStream ## What changes were proposed in this pull request? This replaces uses of `TextOutputFormat` with an `OutputStream

[GitHub] spark issue #16084: [SPARK-18654][SQL] Remove unreachable patterns in makeRo...

2016-11-30 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16084 cc/ @HyukjinKwon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark pull request #16084: [SPARK-18654][SQL] Remove unreachable patterns in...

2016-11-30 Thread NathanHowell
GitHub user NathanHowell opened a pull request: https://github.com/apache/spark/pull/16084 [SPARK-18654][SQL] Remove unreachable patterns in makeRootConverter ## What changes were proposed in this pull request? `makeRootConverter` is only called with a `StructType` value

[GitHub] spark issue #15813: [SPARK-18362][SQL] Use TextFileFormat in JsonFileFormat ...

2016-11-22 Thread NathanHowell
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/15813 Any thoughts on modifying `JsonToStruct` to support arrays (and options), then parsing could be something like: ``` dataset.select( Column(Inline( JsonToValue

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-28 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/12750#discussion_r61526935 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala --- @@ -246,12 +263,39 @@ private[sql] object

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-28 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/12750#discussion_r61526900 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala --- @@ -246,12 +263,39 @@ private[sql] object

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-28 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/12750#discussion_r61526786 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala --- @@ -76,6 +78,15 @@ private[sql] object

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-28 Thread NathanHowell
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/12750#issuecomment-215607825 Alright, here's a few ideas that will at least reduce allocations by a bit. Your version with the merge sort is likely better than the insertion sort here but I

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-28 Thread NathanHowell
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/12750#issuecomment-215585269 Would Guava's `Iterables.mergeSorted[T]` help out here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

2016-03-18 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/10231#discussion_r56742884 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -956,7 +956,7 @@ private[ml] object RandomForest extends Logging

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

2016-01-06 Thread NathanHowell
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/10231#issuecomment-169393380 @sethah looks good to me. :+1: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

2015-12-09 Thread NathanHowell
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/10231#issuecomment-163422959 Yeah I can take a look tonight or tomorrow On Dec 9, 2015 14:25, "Seth Hendrickson" <notificati...@github.com> wrote: > @

[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-10-07 Thread NathanHowell
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-146348263 There were already tests for the returned split lengths, so I just removed the metadata checks. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-10-06 Thread NathanHowell
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-146009085 I'll have time tomorrow --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-9617] [SQL] Implement json_tuple

2015-09-30 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/7946#discussion_r40846432 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonFunctions.scala --- @@ -307,3 +308,140 @@ case class GetJsonObject

[GitHub] spark pull request: [SPARK-9617] [SQL] Implement json_tuple

2015-09-30 Thread NathanHowell
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/7946#issuecomment-144529990 Alright, I think I've addressed all your comments @yhuai. I haven't run the tests though :-) --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-9617] [SQL] Implement json_tuple

2015-09-30 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/7946#discussion_r40810335 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonFunctions.scala --- @@ -307,3 +308,140 @@ case class GetJsonObject

[GitHub] spark pull request: [SPARK-9617] [SQL] Implement json_tuple

2015-09-30 Thread NathanHowell
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/7946#issuecomment-15101 @yhuai I'll see what I can do, running some larger jobs today so I may have a long enough gap to fix this up. --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-9617] [SQL] Implement json_tuple

2015-09-30 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/7946#discussion_r40809995 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonFunctions.scala --- @@ -307,3 +308,140 @@ case class GetJsonObject

[GitHub] spark pull request: [SPARK-9617] [SQL] Implement json_tuple

2015-09-30 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/7946#discussion_r40810618 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonFunctions.scala --- @@ -307,3 +308,140 @@ case class GetJsonObject

[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-12 Thread NathanHowell
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139792166 I tend to rebase out of habit to prevent merge-build failures. I'll look at the test failure on Monday, they were all passing at one point. --- If your project

[GitHub] spark pull request: [SPARK-10064] [ML] Parallelize decision tree b...

2015-09-10 Thread NathanHowell
Github user NathanHowell commented on the pull request: https://github.com/apache/spark/pull/8246#issuecomment-139334977 @manishamde yes, same parameters. this dataset is about 100m examples, not sure offhand on the exact number of features but probably about 5k categorical features

  1   2   >