GitHub user NathanHowell opened a pull request:
https://github.com/apache/spark/pull/257
Minimize the scope of runAsUser to support Hadoop token passing
To support accessing a secure HDFS installation from Mesos executed Spark
jobs, the resources attached to the `SparkContext
Github user NathanHowell commented on the pull request:
https://github.com/apache/spark/pull/257#issuecomment-38820877
Yes, that is what I'm doing. It's the only way I've gotten jobs that access
our secure HDFS cluster to run under Mesos. YARN sets this environmen
Github user NathanHowell commented on the pull request:
https://github.com/apache/spark/pull/257#issuecomment-38830103
No, this was the the bare minimum I could do to get a demo hacked together
to run on under Mesos. I haven't tested it extensively or tried standalone
Github user NathanHowell commented on the pull request:
https://github.com/apache/spark/pull/257#issuecomment-53469729
Indeed, we are using Spark 1.0.x without this patch.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
Github user NathanHowell closed the pull request at:
https://github.com/apache/spark/pull/257
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
Rebased again to pickup the build break hotfix in
c618ccdbe9ac103dfa3182346e2a14a1e7fca91a
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100219153
--- Diff:
core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
@@ -194,5 +195,8 @@ class PortableDataStream
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100344282
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
---
@@ -48,69 +47,102 @@ class JacksonParser
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100344879
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
---
@@ -48,69 +47,102 @@ class JacksonParser
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16199
@HyukjinKwon Good idea, I'll take another stab and try to revive the
original pull request.
---
If your project is set up for it, you can reply to this email and have your
reply appe
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100474809
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
---
@@ -48,69 +47,102 @@ class JacksonParser
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100610662
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
---
@@ -48,69 +47,98 @@ class JacksonParser
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100640620
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
---
@@ -48,69 +47,98 @@ class JacksonParser
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100646497
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
---
@@ -48,69 +47,98 @@ class JacksonParser
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100648524
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
---
@@ -394,36 +447,32 @@ class JacksonParser
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100649635
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
---
@@ -31,10 +31,17 @@ import
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100649836
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
---
@@ -48,69 +47,110 @@ class JacksonParser
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100650450
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
---
@@ -394,36 +447,32 @@ class JacksonParser
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100651706
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala
---
@@ -79,7 +80,7 @@ private[sql] object
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100651910
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
---
@@ -1764,4 +1769,123 @@ class JsonSuite extends
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100651990
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
---
@@ -1764,4 +1769,123 @@ class JsonSuite extends
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100652192
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
---
@@ -1764,4 +1769,123 @@ class JsonSuite extends
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100652259
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
---
@@ -1764,4 +1769,123 @@ class JsonSuite extends
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100652445
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala
---
@@ -0,0 +1,213 @@
+/*
+ * Licensed
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100653580
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
---
@@ -1764,4 +1769,123 @@ class JsonSuite extends
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100653757
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
---
@@ -1764,4 +1769,123 @@ class JsonSuite extends
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100653835
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
---
@@ -1764,4 +1769,123 @@ class JsonSuite extends
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100653879
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
---
@@ -1764,4 +1769,123 @@ class JsonSuite extends
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
@cloud-fan I just pushed a few more changes to address some of your
comments. I'll be back later next week to continue work.
---
If your project is set up for it, you can reply to this
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
@cloud-fan When implementing tests for the other modes I've uncovered an
existing bug in schema inference in `DROPMALFORMED` mode:
https://issues.apache.org/jira/browse/SPARK-19641. Sin
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r101671453
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
---
@@ -1764,4 +1769,117 @@ class JsonSuite extends
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16976#discussion_r102662258
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
---
@@ -0,0 +1,256 @@
+/*
+ * Licensed
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16976#discussion_r102662330
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
---
@@ -0,0 +1,256 @@
+/*
+ * Licensed
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16976#discussion_r102662637
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
---
@@ -0,0 +1,256 @@
+/*
+ * Licensed
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16976#discussion_r102663016
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
---
@@ -43,23 +37,26 @@ class CSVFileFormat
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16976#discussion_r102665619
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
---
@@ -0,0 +1,256 @@
+/*
+ * Licensed
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16976#discussion_r102665872
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
---
@@ -0,0 +1,256 @@
+/*
+ * Licensed
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16976#discussion_r102667812
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
---
@@ -0,0 +1,256 @@
+/*
+ * Licensed
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16976#discussion_r102668816
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
---
@@ -0,0 +1,256 @@
+/*
+ * Licensed
GitHub user NathanHowell opened a pull request:
https://github.com/apache/spark/pull/16386
[SPARK-18352][SQL] Support parsing multiline json files
## What changes were proposed in this pull request?
If a new option `wholeFile` is set to `true` the JSON reader will parse
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
Hello recent JacksonGenerator.scala commiters, please take a look.
cc/ @rxin @hvanhovell @clockfly @hyukjinkwon @cloud-fan
---
If your project is set up for it, you can reply to this
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
@srowen It is functionally the same as what you're suggesting. The question
is how (or if) it should it be first class in the `DataFrameReader` api. If we
agree that it should be ex
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r93969732
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala
---
@@ -0,0 +1,204 @@
+/*
+ * Licensed
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r93970059
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
---
@@ -36,29 +31,31 @@ import
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
@HyukjinKwon I agree that overloading the corrupt record column is
undesirable and `F.input_file_name` is a better way to fetch the filename. It
would be nice to extend this concept further
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
The tests failed for an unrelated reason, looks to be running out of heap
space in SBT somewhere.
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
@HyukjinKwon I just pushed a change that makes the corrupt record handling
consistent: if a corrupt record column is defined it will always get the json
text for failed records. If `wholeFile
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
Can someone kick off the tests again? The last failure was in another
module (Kafka).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
Jenkins, retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/17255#discussion_r105563532
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala
---
@@ -23,24 +23,25 @@ import
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/17255
Would there be any additional benefit of replacing more (or all?) of the
uses of `RDD` with the equivalent `Dataset` operations?
---
If your project is set up for it, you can reply to this
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/17255#discussion_r105942833
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala
---
@@ -40,18 +40,11 @@ private[sql] object
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/15813
Any thoughts on modifying `JsonToStruct` to support arrays (and options),
then parsing could be something like:
```
dataset.select(
Column(Inline(
JsonToValue
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
Any other comments?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100097749
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala
---
@@ -0,0 +1,213 @@
+/*
+ * Licensed
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100098008
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
---
@@ -1764,4 +1769,125 @@ class JsonSuite extends
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100099791
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
---
@@ -31,10 +31,17 @@ import
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100100641
--- Diff:
core/src/main/scala/org/apache/spark/input/PortableDataStream.scala ---
@@ -194,5 +195,8 @@ class PortableDataStream
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100101464
--- Diff:
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -160,7 +164,17 @@ public void writeTo(OutputStream out
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100103739
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
---
@@ -227,66 +267,71 @@ class JacksonParser
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16386#discussion_r100104738
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
---
@@ -298,22 +312,22 @@ class JacksonParser
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16386
I rebased to master and hopefully addressed all of your comments
@cloud-fan, please have another look.
---
If your project is set up for it, you can reply to this email and have your
reply
GitHub user NathanHowell opened a pull request:
https://github.com/apache/spark/pull/16084
[SPARK-18654][SQL] Remove unreachable patterns in makeRootConverter
## What changes were proposed in this pull request?
`makeRootConverter` is only called with a `StructType` value
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16084
cc/ @HyukjinKwon
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
GitHub user NathanHowell opened a pull request:
https://github.com/apache/spark/pull/16089
[SPARK-18658][SQL] Write text records directly to a FileOutputStream
## What changes were proposed in this pull request?
This replaces uses of `TextOutputFormat` with an `OutputStream
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16089
This touches a fair number of components. I also haven't done any
performance testing to see what the impact of this is. Curious what your
thoughts are?
cc/ @marmbrus @rxin @Josh
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16089
Yep. It uses the Hadoop `FileSystem` class to open files, just like
`TextOutputFormat` does.
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16089#discussion_r90380594
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
---
@@ -132,39 +128,17 @@ class
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16089
Doh, forgot to run the Hive tests. Should be fixed now.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16089#discussion_r90385252
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala
---
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16089#discussion_r90468563
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala
---
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16089#discussion_r90468858
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
---
@@ -132,39 +128,17 @@ class
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16089#discussion_r90488454
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
---
@@ -132,39 +128,17 @@ class
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16089#discussion_r90501927
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
---
@@ -194,4 +194,8 @@ private[sql] class
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16089#discussion_r90502343
--- Diff:
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -147,6 +147,17 @@ public void writeTo(ByteBuffer buffer
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16089#discussion_r90503024
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala
---
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16089
@steveloughran Spark is handling the output committing somewhere further up
the stack. The path being passed in to `OutputWriterFactory.newInstance` is to
a temporary file, such as
`/private
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16089#discussion_r90509459
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
---
@@ -194,4 +194,8 @@ private[sql] class
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16089#discussion_r90521381
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala
---
@@ -245,24 +230,12 @@ private[csv] class
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16089#discussion_r90566162
--- Diff:
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -147,6 +147,17 @@ public void writeTo(ByteBuffer buffer
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16107
I wrote the buggy version, doh... but this LGTM. Thanks for fix.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
GitHub user NathanHowell opened a pull request:
https://github.com/apache/spark/pull/16199
[SPARK-18772][SQL] NaN/Infinite float parsing in JSON is inconsistent
## What changes were proposed in this pull request?
This relaxes the parsing of `Float` and `Double` columns to
Github user NathanHowell commented on the issue:
https://github.com/apache/spark/pull/16199
Hello @HyukjinKwon, can you take a look at this one? I am unsure if we
should be accepting lowercased values like `nan` (versus strictly testing for
`NaN`) but I think this PR matches the
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/16375#discussion_r93460507
--- Diff:
common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
---
@@ -591,7 +591,11 @@ public void
GitHub user NathanHowell opened a pull request:
https://github.com/apache/spark/pull/2721
[SPARK-3858][SQL] Pass the generator alias into logical plan node
The alias parameter is being ignored, which makes it more difficult to
specify a qualifier for Generator expressions.
You can
Github user NathanHowell commented on the pull request:
https://github.com/apache/spark/pull/2721#issuecomment-58452145
It works properly from Hive... `HiveQl.withLateralView` creates `Generate`
instances directly and doesn't go through the `SchemaRDD.generate` helper
functio
Github user NathanHowell commented on the pull request:
https://github.com/apache/spark/pull/2721#issuecomment-58460036
Alright, I've added a test that fails on master and is fixed by this pull
request.
---
If your project is set up for it, you can reply to this email and have
Github user NathanHowell commented on the pull request:
https://github.com/apache/spark/pull/3213#issuecomment-63151280
Another approach is to use a `JsonGenerator` instead of an `ObjectMapper`.
This is the implementation I've been using for a while:
https://gist.githu
GitHub user NathanHowell opened a pull request:
https://github.com/apache/spark/pull/5801
[SPARK-5938][SQL] Improve JsonRDD performance
This patch comprises of a few related pieces of work:
* Schema inference is performed directly on the JSON token stream
* `String
Github user NathanHowell commented on the pull request:
https://github.com/apache/spark/pull/5801#issuecomment-97699564
Looks like it may also resolve
[SPARK-5443](https://issues.apache.org/jira/browse/SPARK-5443).
---
If your project is set up for it, you can reply to this email
Github user NathanHowell commented on the pull request:
https://github.com/apache/spark/pull/5801#issuecomment-97959395
Benchmarked a small-ish real dataset... Runs are with 5 executors (for 5
input splits) with data in HDFS:
step | before | after
Github user NathanHowell commented on the pull request:
https://github.com/apache/spark/pull/5801#issuecomment-98081628
I think it's in a decent state now, if this qualifies for the 1.4.0 merge
window I'll make time to work through any remaining issues (if any).
---
If yo
Github user NathanHowell commented on the pull request:
https://github.com/apache/spark/pull/5801#issuecomment-98251841
@yhuai Fine with me, I'm reworking the patch set now.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user NathanHowell commented on the pull request:
https://github.com/apache/spark/pull/5801#issuecomment-98258602
@yhuai The updated patches do not test the old code. Do you have an opinion
on the best way to address this? I can duplicate the entire JsonSuite or try to
do
Github user NathanHowell commented on the pull request:
https://github.com/apache/spark/pull/5801#issuecomment-98260541
@marmbrus sounds good, I'll leave it as is.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/5801#discussion_r29565276
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD2.scala
---
@@ -0,0 +1,409 @@
+/*
+ * Licensed to the Apache Software
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/5801#discussion_r29566704
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD2.scala
---
@@ -0,0 +1,409 @@
+/*
+ * Licensed to the Apache Software
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/5801#discussion_r29566701
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD2.scala
---
@@ -0,0 +1,409 @@
+/*
+ * Licensed to the Apache Software
Github user NathanHowell commented on a diff in the pull request:
https://github.com/apache/spark/pull/5801#discussion_r29566889
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala ---
@@ -101,32 +103,83 @@ private[sql] class DefaultSource
1 - 100 of 164 matches
Mail list logo