GitHub user laserson opened a pull request:
https://github.com/apache/incubator-spark/pull/576
Added parquetFileAsJSON to read Parquet data into JSON strings
This function makes it incredibly easy to read Parquet data especially with
PySpark. Is there any interest in this? It
Github user laserson commented on the pull request:
https://github.com/apache/incubator-spark/pull/576#issuecomment-34718389
No, this actually constructs Avro `GenericRecord` objects in memory. The
problem is that if you want access to the Parquet data through PySpark, there
is no
Github user laserson commented on the pull request:
https://github.com/apache/incubator-spark/pull/576#issuecomment-35035595
Yes, I have since thought about it more and agree that this would actually
be a bad idea. No need to add additional dependencies on other specific file
Github user laserson closed the pull request at:
https://github.com/apache/incubator-spark/pull/576
Github user laserson commented on the pull request:
https://github.com/apache/incubator-spark/pull/576#issuecomment-35040314
Yes, that's a much better suggestion. Thanks!
m happy to
contribute these, but want to hear what the preferred method is first.
Uri
--
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
laser...@cloudera.com