[jira] [Created] (SPARK-9652) Make reading Avro to RDDs easier
Joseph Batchik created SPARK-9652: - Summary: Make reading Avro to RDDs easier Key: SPARK-9652 URL: https://issues.apache.org/jira/browse/SPARK-9652 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Joseph Batchik Currently reading in Avro files requires manually creating a Hadoop RDD and a decent amount of configuration. It would be nice to have a wrapper function, similar to `def binaryFiles`, that deals with all the configuration for you. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8360) Streaming DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649729#comment-14649729 ] Joseph Batchik commented on SPARK-8360: --- Would streaming DataFrames replace streaming RDDs or coincide with it? Streaming DataFrames Key: SPARK-8360 URL: https://issues.apache.org/jira/browse/SPARK-8360 Project: Spark Issue Type: Umbrella Components: SQL, Streaming Reporter: Reynold Xin Umbrella ticket to track what's needed to make streaming DataFrame a reality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark
Joseph Batchik created SPARK-9486: - Summary: Add aliasing to data sources to allow external packages to register themselves with Spark Key: SPARK-9486 URL: https://issues.apache.org/jira/browse/SPARK-9486 Project: Spark Issue Type: Improvement Components: SQL Reporter: Joseph Batchik Priority: Minor Currently Spark allows users to use external data sources like spark-avro, spark-csv, etc by having them specifying their full class name: {code:java} sqlContext.read.format(com.databricks.spark.avro).load(path) {code} Typing in a full class is not the best idea so it would be nice to allow the external packages to be able to register themselves with Spark to allow users to do something like: {code:java} sqlContext.read.format(avro).load(path) {code} This would make it so that the external data source packages follow the same convention as the built in data sources do, parquet, json, jdbc, etc. This could be accomplished by using a ServiceLoader. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-746) Automatically Use Avro Serialization for Avro Objects
[ https://issues.apache.org/jira/browse/SPARK-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637963#comment-14637963 ] Joseph Batchik edited comment on SPARK-746 at 7/23/15 4:08 PM: --- I did some benchmarks for both the current implementation of serializing Avro records compared to this change, the results are located here: https://docs.google.com/a/cloudera.com/spreadsheets/d/16JXO80O1Fh9bTIhx_0PZcCd5gvGTobX6xoe17vBEGH0/pubhtml was (Author: jd): I did some benchmarks for both the current implementation of serializing Avro records compared to this change, the results are located here: https://docs.google.com/a/cloudera.com/spreadsheets/d/16JXO80O1Fh9bTIhx_0PZcCd5gvGTobX6xoe17vBEGH0/edit?usp=sharing Automatically Use Avro Serialization for Avro Objects - Key: SPARK-746 URL: https://issues.apache.org/jira/browse/SPARK-746 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Cogan All generated objects extend org.apache.avro.specific.SpecificRecordBase (or there may be a higher up class as well). Since Avro records aren't JavaSerializable by default people currently have to wrap their records. It would be good if we could use an implicit conversion to do this for them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-746) Automatically Use Avro Serialization for Avro Objects
[ https://issues.apache.org/jira/browse/SPARK-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637963#comment-14637963 ] Joseph Batchik edited comment on SPARK-746 at 7/23/15 4:13 PM: --- I did some benchmarks for both the current implementation of serializing Avro records compared to this change, the results are located here: http://www.csh.rit.edu/~jd/spark/Avro%20Datapoints.xlsx was (Author: jd): I did some benchmarks for both the current implementation of serializing Avro records compared to this change, the results are located here: https://docs.google.com/a/cloudera.com/spreadsheets/d/16JXO80O1Fh9bTIhx_0PZcCd5gvGTobX6xoe17vBEGH0/pubhtml Automatically Use Avro Serialization for Avro Objects - Key: SPARK-746 URL: https://issues.apache.org/jira/browse/SPARK-746 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Cogan All generated objects extend org.apache.avro.specific.SpecificRecordBase (or there may be a higher up class as well). Since Avro records aren't JavaSerializable by default people currently have to wrap their records. It would be good if we could use an implicit conversion to do this for them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639187#comment-14639187 ] Joseph Batchik commented on SPARK-8007: --- You will be able to solve this issue by doing: {code:java} df.groupBy(expr(spark__partition__id())) {code} with SPARK-8668 . This will make all of these virtual column just be function calls so no changes to the analyzer will be needed. Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Joseph Batchik Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-746) Automatically Use Avro Serialization for Avro Objects
[ https://issues.apache.org/jira/browse/SPARK-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637963#comment-14637963 ] Joseph Batchik commented on SPARK-746: -- I did some benchmarks for both the current implementation of serializing Avro records compared to this change, the results are located here: https://docs.google.com/a/cloudera.com/spreadsheets/d/16JXO80O1Fh9bTIhx_0PZcCd5gvGTobX6xoe17vBEGH0/edit?usp=sharing Automatically Use Avro Serialization for Avro Objects - Key: SPARK-746 URL: https://issues.apache.org/jira/browse/SPARK-746 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Cogan All generated objects extend org.apache.avro.specific.SpecificRecordBase (or there may be a higher up class as well). Since Avro records aren't JavaSerializable by default people currently have to wrap their records. It would be good if we could use an implicit conversion to do this for them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column
[ https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636180#comment-14636180 ] Joseph Batchik commented on SPARK-8668: --- Does this look like what you were thinking? https://github.com/JDrit/spark/commit/7fcf18a11427709d403418da8d444b434c63 expr function to convert SQL expression into a Column - Key: SPARK-8668 URL: https://issues.apache.org/jira/browse/SPARK-8668 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin selectExpr uses the expression parser to parse a string expressions. would be great to create an expr function in functions.scala/functions.py that converts a string into an expression (or a list of expressions separated by comma). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630820#comment-14630820 ] Joseph Batchik edited comment on SPARK-8007 at 7/17/15 6:01 AM: Reynold, I start adding virtual columns to the DataFrames and SQL queries for SPARK-8003 and SPARK-8007. My initial code is here: https://github.com/JDrit/spark/commit/e34d3a7eabbc9c41c2dd85b128b2bb5713039e40. The one issue I ran into though was that the catalyst package cannot access org.apache.spark.sql.execution.expressions where SparkPartitionID resides. For prototyping purposes I copied SparkPartitionID to the catalyst package, but am wondering what would be the best way to deal with that dependency, Can you let me know what you think about my changes and what else needs to be done to it. was (Author: jd): [~rxin] Reynold, I start adding virtual columns to the DataFrames and SQL queries for SPARK-8003 and SPARK-8007. My initial code is here: https://github.com/JDrit/spark/commit/e34d3a7eabbc9c41c2dd85b128b2bb5713039e40. The one issue I ran into though was that the catalyst package cannot access org.apache.spark.sql.execution.expressions where SparkPartitionID resides. For prototyping purposes I copied SparkPartitionID to the catalyst package, but am wondering what would be the best way to deal with that dependency, Can you let me know what you think about my changes and what else needs to be done to it. Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631673#comment-14631673 ] Joseph Batchik commented on SPARK-8007: --- Reynold, thanks for pointing that out. I updated the commit to use what you suggested. This should also make it easy to add other virtual columns as described in the parent ticket. All that should need to be done is updating the resolver in the logical plan and the new virtual column rule. https://github.com/JDrit/spark/commit/7b46e7de6f98df98480fa34c85248aa2d90bc635#diff-d74f782d414a74eee09a4b6b9994be87R34 Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630820#comment-14630820 ] Joseph Batchik commented on SPARK-8007: --- [~rxin] Reynold, I start adding virtual columns to the DataFrames and SQL queries for SPARK-8003 and SPARK-8007. My initial code is here: https://github.com/JDrit/spark/commit/e34d3a7eabbc9c41c2dd85b128b2bb5713039e40. The one issue I ran into though was that the catalyst package cannot access org.apache.spark.sql.execution.expressions where SparkPartitionID resides. For prototyping purposes I copied SparkPartitionID to the catalyst package, but am wondering what would be the best way to deal with that dependency, Can you let me know what you think about my changes and what else needs to be done to it. Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-746) Automatically Use Avro Serialization for Avro Objects
[ https://issues.apache.org/jira/browse/SPARK-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597973#comment-14597973 ] Joseph Batchik commented on SPARK-746: -- Spark can currently serialize the three type of Avro records if the user specifies Kryo. Specific and Reflect records serialize just fine, if the user registers them ahead of time, since Kryo can efficiently deal with serializing classes. The problem lies in generic records since Kryo cannot serialize them without a large amount of overhead. This causes issues for users who want to use Avro records during a shuffle. To alleviate this, I implemented a custom Kryo serializer for generic records that tries to reduce the amount of network IO. https://github.com/JDrit/spark/commit/6f1106bc20eb670e963d45a191dfc4517d46543b This works by sending a compressed form of the schema with each message over have Kryo serialize the in-memory representation itself. Since the same schema is going to be sent numerous times, it caches previously seen values as to reduce the computation needed. It also allows users to register their schemas ahead of time. This allows it to just send the schema’s unique ID with each message, over the entire schema itself. Could I get some feedback about this approach or let me know if I am missing anything important. Automatically Use Avro Serialization for Avro Objects - Key: SPARK-746 URL: https://issues.apache.org/jira/browse/SPARK-746 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Cogan All generated objects extend org.apache.avro.specific.SpecificRecordBase (or there may be a higher up class as well). Since Avro records aren't JavaSerializable by default people currently have to wrap their records. It would be good if we could use an implicit conversion to do this for them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org