[jira] [Created] (SPARK-9652) Make reading Avro to RDDs easier

2015-08-05 Thread Joseph Batchik (JIRA)
Joseph Batchik created SPARK-9652:
-

 Summary: Make reading Avro to RDDs easier
 Key: SPARK-9652
 URL: https://issues.apache.org/jira/browse/SPARK-9652
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Joseph Batchik


Currently reading in Avro files requires manually creating a Hadoop RDD and a 
decent amount of configuration. It would be nice to have a wrapper function, 
similar to `def binaryFiles`, that deals with all the configuration for you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8360) Streaming DataFrames

2015-07-31 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649729#comment-14649729
 ] 

Joseph Batchik commented on SPARK-8360:
---

Would streaming DataFrames replace streaming RDDs or coincide with it?

 Streaming DataFrames
 

 Key: SPARK-8360
 URL: https://issues.apache.org/jira/browse/SPARK-8360
 Project: Spark
  Issue Type: Umbrella
  Components: SQL, Streaming
Reporter: Reynold Xin

 Umbrella ticket to track what's needed to make streaming DataFrame a reality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9486) Add aliasing to data sources to allow external packages to register themselves with Spark

2015-07-30 Thread Joseph Batchik (JIRA)
Joseph Batchik created SPARK-9486:
-

 Summary: Add aliasing to data sources to allow external packages 
to register themselves with Spark
 Key: SPARK-9486
 URL: https://issues.apache.org/jira/browse/SPARK-9486
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Joseph Batchik
Priority: Minor


Currently Spark allows users to use external data sources like spark-avro, 
spark-csv, etc by having them specifying their full class name:

{code:java}
sqlContext.read.format(com.databricks.spark.avro).load(path)
{code}

Typing in a full class is not the best idea so it would be nice to allow the 
external packages to be able to register themselves with Spark to allow users 
to do something like:

{code:java}
sqlContext.read.format(avro).load(path)
{code}

This would make it so that the external data source packages follow the same 
convention as the built in data sources do, parquet, json, jdbc, etc.

This could be accomplished by using a ServiceLoader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-746) Automatically Use Avro Serialization for Avro Objects

2015-07-23 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637963#comment-14637963
 ] 

Joseph Batchik edited comment on SPARK-746 at 7/23/15 4:08 PM:
---

I did some benchmarks for both the current implementation of serializing Avro 
records compared to this change, the results are located here:

https://docs.google.com/a/cloudera.com/spreadsheets/d/16JXO80O1Fh9bTIhx_0PZcCd5gvGTobX6xoe17vBEGH0/pubhtml


was (Author: jd):
I did some benchmarks for both the current implementation of serializing Avro 
records compared to this change, the results are located here:

https://docs.google.com/a/cloudera.com/spreadsheets/d/16JXO80O1Fh9bTIhx_0PZcCd5gvGTobX6xoe17vBEGH0/edit?usp=sharing

 Automatically Use Avro Serialization for Avro Objects
 -

 Key: SPARK-746
 URL: https://issues.apache.org/jira/browse/SPARK-746
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Cogan

 All generated objects extend org.apache.avro.specific.SpecificRecordBase (or 
 there may be a higher up class as well).
 Since Avro records aren't JavaSerializable by default people currently have 
 to wrap their records. It would be good if we could use an implicit 
 conversion to do this for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-746) Automatically Use Avro Serialization for Avro Objects

2015-07-23 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637963#comment-14637963
 ] 

Joseph Batchik edited comment on SPARK-746 at 7/23/15 4:13 PM:
---

I did some benchmarks for both the current implementation of serializing Avro 
records compared to this change, the results are located here:

http://www.csh.rit.edu/~jd/spark/Avro%20Datapoints.xlsx


was (Author: jd):
I did some benchmarks for both the current implementation of serializing Avro 
records compared to this change, the results are located here:

https://docs.google.com/a/cloudera.com/spreadsheets/d/16JXO80O1Fh9bTIhx_0PZcCd5gvGTobX6xoe17vBEGH0/pubhtml

 Automatically Use Avro Serialization for Avro Objects
 -

 Key: SPARK-746
 URL: https://issues.apache.org/jira/browse/SPARK-746
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Cogan

 All generated objects extend org.apache.avro.specific.SpecificRecordBase (or 
 there may be a higher up class as well).
 Since Avro records aren't JavaSerializable by default people currently have 
 to wrap their records. It would be good if we could use an implicit 
 conversion to do this for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-23 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639187#comment-14639187
 ] 

Joseph Batchik commented on SPARK-8007:
---

You will be able to solve this issue by doing:

{code:java}
df.groupBy(expr(spark__partition__id()))
{code}

with SPARK-8668 .

This will make all of these virtual column just be function calls so no 
changes to the analyzer will be needed.

 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Joseph Batchik

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-746) Automatically Use Avro Serialization for Avro Objects

2015-07-22 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14637963#comment-14637963
 ] 

Joseph Batchik commented on SPARK-746:
--

I did some benchmarks for both the current implementation of serializing Avro 
records compared to this change, the results are located here:

https://docs.google.com/a/cloudera.com/spreadsheets/d/16JXO80O1Fh9bTIhx_0PZcCd5gvGTobX6xoe17vBEGH0/edit?usp=sharing

 Automatically Use Avro Serialization for Avro Objects
 -

 Key: SPARK-746
 URL: https://issues.apache.org/jira/browse/SPARK-746
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Cogan

 All generated objects extend org.apache.avro.specific.SpecificRecordBase (or 
 there may be a higher up class as well).
 Since Avro records aren't JavaSerializable by default people currently have 
 to wrap their records. It would be good if we could use an implicit 
 conversion to do this for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column

2015-07-21 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636180#comment-14636180
 ] 

Joseph Batchik commented on SPARK-8668:
---

Does this look like what you were thinking?

https://github.com/JDrit/spark/commit/7fcf18a11427709d403418da8d444b434c63

 expr function to convert SQL expression into a Column
 -

 Key: SPARK-8668
 URL: https://issues.apache.org/jira/browse/SPARK-8668
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 selectExpr uses the expression parser to parse a string expressions. would be 
 great to create an expr function in functions.scala/functions.py that 
 converts a string into an expression (or a list of expressions separated by 
 comma).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-17 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630820#comment-14630820
 ] 

Joseph Batchik edited comment on SPARK-8007 at 7/17/15 6:01 AM:


Reynold, I start adding virtual columns to the DataFrames and SQL queries for 
SPARK-8003 and SPARK-8007. My initial code is here: 
https://github.com/JDrit/spark/commit/e34d3a7eabbc9c41c2dd85b128b2bb5713039e40.

The one issue I ran into though was that the catalyst package cannot access 
org.apache.spark.sql.execution.expressions where SparkPartitionID resides. For 
prototyping purposes I copied SparkPartitionID to the catalyst package, but am 
wondering what would be the best way to deal with that dependency,  

Can you let me know what you think about my changes and what else needs to be 
done to it.


was (Author: jd):
[~rxin] Reynold, I start adding virtual columns to the DataFrames and SQL 
queries for SPARK-8003 and SPARK-8007. My initial code is here: 
https://github.com/JDrit/spark/commit/e34d3a7eabbc9c41c2dd85b128b2bb5713039e40.

The one issue I ran into though was that the catalyst package cannot access 
org.apache.spark.sql.execution.expressions where SparkPartitionID resides. For 
prototyping purposes I copied SparkPartitionID to the catalyst package, but am 
wondering what would be the best way to deal with that dependency,  

Can you let me know what you think about my changes and what else needs to be 
done to it.

 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-17 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631673#comment-14631673
 ] 

Joseph Batchik commented on SPARK-8007:
---

Reynold, thanks for pointing that out. I updated the commit to use what you 
suggested. This should also make it easy to add other virtual columns as 
described in the parent ticket. All that should need to be done is updating the 
resolver in the logical plan and the new virtual column rule.

https://github.com/JDrit/spark/commit/7b46e7de6f98df98480fa34c85248aa2d90bc635#diff-d74f782d414a74eee09a4b6b9994be87R34

 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-16 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630820#comment-14630820
 ] 

Joseph Batchik commented on SPARK-8007:
---

[~rxin] Reynold, I start adding virtual columns to the DataFrames and SQL 
queries for SPARK-8003 and SPARK-8007. My initial code is here: 
https://github.com/JDrit/spark/commit/e34d3a7eabbc9c41c2dd85b128b2bb5713039e40.

The one issue I ran into though was that the catalyst package cannot access 
org.apache.spark.sql.execution.expressions where SparkPartitionID resides. For 
prototyping purposes I copied SparkPartitionID to the catalyst package, but am 
wondering what would be the best way to deal with that dependency,  

Can you let me know what you think about my changes and what else needs to be 
done to it.

 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-746) Automatically Use Avro Serialization for Avro Objects

2015-06-23 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597973#comment-14597973
 ] 

Joseph Batchik commented on SPARK-746:
--

Spark can currently serialize the three type of Avro records if the user 
specifies Kryo. Specific and Reflect records serialize just fine, if the user 
registers them ahead of time, since Kryo can efficiently deal with serializing 
classes. The problem lies in generic records since Kryo cannot serialize them 
without a large amount of overhead. This causes issues for users who want to 
use Avro records during a shuffle. To alleviate this, I implemented a custom 
Kryo serializer for generic records that tries to reduce the amount of network 
IO.

https://github.com/JDrit/spark/commit/6f1106bc20eb670e963d45a191dfc4517d46543b

This works by sending a compressed form of the schema with each message over 
have Kryo serialize the in-memory representation itself. Since the same schema 
is going to be sent numerous times, it caches previously seen values as to 
reduce the computation needed. It also allows users to register their schemas 
ahead of time. This allows it to just send the schema’s unique ID with each 
message, over the entire schema itself.

Could I get some feedback about this approach or let me know if I am missing 
anything important.

 Automatically Use Avro Serialization for Avro Objects
 -

 Key: SPARK-746
 URL: https://issues.apache.org/jira/browse/SPARK-746
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Cogan

 All generated objects extend org.apache.avro.specific.SpecificRecordBase (or 
 there may be a higher up class as well).
 Since Avro records aren't JavaSerializable by default people currently have 
 to wrap their records. It would be good if we could use an implicit 
 conversion to do this for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org