Re: [VOTE] Release Apache Spark 1.2.1 (RC1)
Hey Sean, Right now we don't publish every 2.11 binary to avoid combinatorial explosion of the number of build artifacts we publish (there are other parameters such as whether hive is included, etc). We can revisit this in future feature releases, but .1 releases like this are reserved for bug fixes. - Patrick On Tue, Jan 27, 2015 at 10:31 AM, Sean McNamara sean.mcnam...@webtrends.com wrote: We're using spark on scala 2.11 /w hadoop2.4. Would it be practical / make sense to build a bin version of spark against scala 2.11 for versions other than just hadoop1 at this time? Cheers, Sean On Jan 27, 2015, at 12:04 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1061/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.2.1! The vote is open until Friday, January 30, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.1 [ ] -1 Do not release this package because ... For a list of fixes in this release, see http://s.apache.org/Mpn. To learn more about Apache Spark, please see http://spark.apache.org/ - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.2.1 (RC1)
Sounds good, that makes sense. Cheers, Sean On Jan 27, 2015, at 11:35 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, Right now we don't publish every 2.11 binary to avoid combinatorial explosion of the number of build artifacts we publish (there are other parameters such as whether hive is included, etc). We can revisit this in future feature releases, but .1 releases like this are reserved for bug fixes. - Patrick On Tue, Jan 27, 2015 at 10:31 AM, Sean McNamara sean.mcnam...@webtrends.com wrote: We're using spark on scala 2.11 /w hadoop2.4. Would it be practical / make sense to build a bin version of spark against scala 2.11 for versions other than just hadoop1 at this time? Cheers, Sean On Jan 27, 2015, at 12:04 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1061/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.2.1! The vote is open until Friday, January 30, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.1 [ ] -1 Do not release this package because ... For a list of fixes in this release, see http://s.apache.org/Mpn. To learn more about Apache Spark, please see http://spark.apache.org/ - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.2.1 (RC1)
Okay - we've resolved all issues with the signatures and keys. However, I'll leave the current vote open for a bit to solicit additional feedback. On Tue, Jan 27, 2015 at 10:43 AM, Sean McNamara sean.mcnam...@webtrends.com wrote: Sounds good, that makes sense. Cheers, Sean On Jan 27, 2015, at 11:35 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, Right now we don't publish every 2.11 binary to avoid combinatorial explosion of the number of build artifacts we publish (there are other parameters such as whether hive is included, etc). We can revisit this in future feature releases, but .1 releases like this are reserved for bug fixes. - Patrick On Tue, Jan 27, 2015 at 10:31 AM, Sean McNamara sean.mcnam...@webtrends.com wrote: We're using spark on scala 2.11 /w hadoop2.4. Would it be practical / make sense to build a bin version of spark against scala 2.11 for versions other than just hadoop1 at this time? Cheers, Sean On Jan 27, 2015, at 12:04 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1061/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.2.1! The vote is open until Friday, January 30, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.1 [ ] -1 Do not release this package because ... For a list of fixes in this release, see http://s.apache.org/Mpn. To learn more about Apache Spark, please see http://spark.apache.org/ - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: renaming SchemaRDD - DataFrame
Koert, As Mark said, I have already refactored the API so that nothing is catalyst is exposed (and users won't need them anyway). Data types, Row interfaces are both outside catalyst package and in org.apache.spark.sql. On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers ko...@tresata.com wrote: hey matei, i think that stuff such as SchemaRDD, columar storage and perhaps also query planning can be re-used by many systems that do analysis on structured data. i can imagine panda-like systems, but also datalog or scalding-like (which we use at tresata and i might rebase on SchemaRDD at some point). SchemaRDD should become the interface for all these. and columnar storage abstractions should be re-used between all these. currently the sql tie in is way beyond just the (perhaps unfortunate) naming convention. for example a core part of the SchemaRD abstraction is Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser (if i understand it correctly, i have not used catalyst, but it looks neat). i should not need to include a SQL parser just to use structured data in say a panda-like framework. best, koert On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library. Matei On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote: The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. i agree. this to me also implies it belongs in spark core, not sql On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell pwend...@gmail.com To: Reynold Xin r...@databricks.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January 26, 2015 4:01 PM Subject: Re: renaming SchemaRDD - DataFrame One thing potentially not clear from this e-mail, there will be a 1:1 correspondence where you can get an RDD to/from a DataFrame. On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com wrote: Hi, We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted to get the community's opinion. The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. We also expect more and more users to be programming directly against SchemaRDD API rather than the core RDD API. SchemaRDD, through its less commonly used DSL originally designed for writing test cases, always has the data-frame like API. In 1.3, we are redesigning the API to make the API usable for end users. There are two motivations for the renaming: 1. DataFrame seems to be a more self-evident name than SchemaRDD. 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even though it would contain some RDD functions like map, flatMap, etc), and calling it Schema*RDD* while it is not an RDD is highly confusing. Instead. DataFrame.rdd will return the underlying RDD for all RDD methods. My understanding is that very few users program directly against the SchemaRDD API at the moment, because they are not well documented. However, oo maintain backward compatibility, we can create a type alias DataFrame that is still named SchemaRDD. This will maintain source compatibility for Scala. That said, we will have to update all existing materials to use DataFrame rather than SchemaRDD. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: renaming SchemaRDD - DataFrame
Reynold, But with type alias we will have the same problem, right? If the methods doesn't receive schemardd anymore, we will have to change our code to migrade from schema to dataframe. Unless we have an implicit conversion between DataFrame and SchemaRDD 2015-01-27 17:18 GMT-02:00 Reynold Xin r...@databricks.com: Dirceu, That is not possible because one cannot overload return types. SQLContext.parquetFile (and many other methods) needs to return some type, and that type cannot be both SchemaRDD and DataFrame. In 1.3, we will create a type alias for DataFrame called SchemaRDD to not break source compatibility for Scala. On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho dirceu.semigh...@gmail.com wrote: Can't the SchemaRDD remain the same, but deprecated, and be removed in the release 1.5(+/- 1) for example, and the new code been added to DataFrame? With this, we don't impact in existing code for the next few releases. 2015-01-27 0:02 GMT-02:00 Kushal Datta kushal.da...@gmail.com: I want to address the issue that Matei raised about the heavy lifting required for a full SQL support. It is amazing that even after 30 years of research there is not a single good open source columnar database like Vertica. There is a column store option in MySQL, but it is not nearly as sophisticated as Vertica or MonetDB. But there's a true need for such a system. I wonder why so and it's high time to change that. On Jan 26, 2015 5:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Both SchemaRDD and DataFrame sound fine to me, though I like the former slightly better because it's more descriptive. Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would be more clear from a user-facing perspective to at least choose a package name for it that omits sql. I would also be in favor of adding a separate Spark Schema module for Spark SQL to rely on, but I imagine that might be too large a change at this point? -Sandy On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: (Actually when we designed Spark SQL we thought of giving it another name, like Spark Schema, but we decided to stick with SQL since that was the most obvious use case to many users.) Matei On Jan 26, 2015, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library. Matei On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote: The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. i agree. this to me also implies it belongs in spark core, not sql On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell pwend...@gmail.com To: Reynold Xin r...@databricks.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January 26, 2015 4:01 PM Subject: Re: renaming SchemaRDD - DataFrame One thing potentially not clear from this e-mail, there will be a 1:1 correspondence where you can get an RDD to/from a DataFrame. On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com wrote: Hi, We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted to get the community's opinion. The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. We also expect more and more users to be programming directly against SchemaRDD API rather than the core RDD API. SchemaRDD, through its less commonly used DSL originally designed for writing test cases, always has the data-frame like API. In 1.3, we are redesigning the API to make the API usable for end users. There are two
Re: renaming SchemaRDD - DataFrame
Dirceu, That is not possible because one cannot overload return types. SQLContext.parquetFile (and many other methods) needs to return some type, and that type cannot be both SchemaRDD and DataFrame. In 1.3, we will create a type alias for DataFrame called SchemaRDD to not break source compatibility for Scala. On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho dirceu.semigh...@gmail.com wrote: Can't the SchemaRDD remain the same, but deprecated, and be removed in the release 1.5(+/- 1) for example, and the new code been added to DataFrame? With this, we don't impact in existing code for the next few releases. 2015-01-27 0:02 GMT-02:00 Kushal Datta kushal.da...@gmail.com: I want to address the issue that Matei raised about the heavy lifting required for a full SQL support. It is amazing that even after 30 years of research there is not a single good open source columnar database like Vertica. There is a column store option in MySQL, but it is not nearly as sophisticated as Vertica or MonetDB. But there's a true need for such a system. I wonder why so and it's high time to change that. On Jan 26, 2015 5:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Both SchemaRDD and DataFrame sound fine to me, though I like the former slightly better because it's more descriptive. Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would be more clear from a user-facing perspective to at least choose a package name for it that omits sql. I would also be in favor of adding a separate Spark Schema module for Spark SQL to rely on, but I imagine that might be too large a change at this point? -Sandy On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: (Actually when we designed Spark SQL we thought of giving it another name, like Spark Schema, but we decided to stick with SQL since that was the most obvious use case to many users.) Matei On Jan 26, 2015, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library. Matei On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote: The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. i agree. this to me also implies it belongs in spark core, not sql On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell pwend...@gmail.com To: Reynold Xin r...@databricks.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January 26, 2015 4:01 PM Subject: Re: renaming SchemaRDD - DataFrame One thing potentially not clear from this e-mail, there will be a 1:1 correspondence where you can get an RDD to/from a DataFrame. On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com wrote: Hi, We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted to get the community's opinion. The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. We also expect more and more users to be programming directly against SchemaRDD API rather than the core RDD API. SchemaRDD, through its less commonly used DSL originally designed for writing test cases, always has the data-frame like API. In 1.3, we are redesigning the API to make the API usable for end users. There are two motivations for the renaming: 1. DataFrame seems to be a more self-evident name than SchemaRDD. 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even though it would contain some RDD functions like map, flatMap, etc), and calling it Schema*RDD* while it is not an RDD is highly
Re: [VOTE] Release Apache Spark 1.2.1 (RC1)
Yes - the key issue is just due to me creating new keys this time around. Anyways let's take another stab at this. In the mean time, please don't hesitate to test the release itself. - Patrick On Tue, Jan 27, 2015 at 10:00 AM, Sean Owen so...@cloudera.com wrote: Got it. Ignore the SHA512 issue since these aren't somehow expected by a policy or Maven to be in a certain format. Just wondered if the difference was intended. The Maven way of generated the SHA1 hashes is to set this on the install plugin, AFAIK, although I'm not sure if the intent was to hash files that Maven didn't create: configuration createChecksumtrue/createChecksum /configuration As for the key issue, I think it's just a matter of uploading the new key in both places. We should all of course test the release anyway. On Tue, Jan 27, 2015 at 5:55 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The release script generates hashes in two places (take a look a bit further down in the script), one for the published artifacts and the other for the binaries. In the case of the binaries we use SHA512 because, AFAIK, the ASF does not require you to use SHA1 and SHA512 is better. In the case of the published Maven artifacts we use SHA1 because my understanding is this is what Maven requires. However, it does appear that the format is now one that maven cannot parse. Anyways, it seems fine to just change the format of the hash per your PR. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: renaming SchemaRDD - DataFrame
thats great. guess i was looking at a somewhat stale master branch... On Tue, Jan 27, 2015 at 2:19 PM, Reynold Xin r...@databricks.com wrote: Koert, As Mark said, I have already refactored the API so that nothing is catalyst is exposed (and users won't need them anyway). Data types, Row interfaces are both outside catalyst package and in org.apache.spark.sql. On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers ko...@tresata.com wrote: hey matei, i think that stuff such as SchemaRDD, columar storage and perhaps also query planning can be re-used by many systems that do analysis on structured data. i can imagine panda-like systems, but also datalog or scalding-like (which we use at tresata and i might rebase on SchemaRDD at some point). SchemaRDD should become the interface for all these. and columnar storage abstractions should be re-used between all these. currently the sql tie in is way beyond just the (perhaps unfortunate) naming convention. for example a core part of the SchemaRD abstraction is Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser (if i understand it correctly, i have not used catalyst, but it looks neat). i should not need to include a SQL parser just to use structured data in say a panda-like framework. best, koert On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library. Matei On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote: The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. i agree. this to me also implies it belongs in spark core, not sql On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell pwend...@gmail.com To: Reynold Xin r...@databricks.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January 26, 2015 4:01 PM Subject: Re: renaming SchemaRDD - DataFrame One thing potentially not clear from this e-mail, there will be a 1:1 correspondence where you can get an RDD to/from a DataFrame. On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com wrote: Hi, We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted to get the community's opinion. The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. We also expect more and more users to be programming directly against SchemaRDD API rather than the core RDD API. SchemaRDD, through its less commonly used DSL originally designed for writing test cases, always has the data-frame like API. In 1.3, we are redesigning the API to make the API usable for end users. There are two motivations for the renaming: 1. DataFrame seems to be a more self-evident name than SchemaRDD. 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even though it would contain some RDD functions like map, flatMap, etc), and calling it Schema*RDD* while it is not an RDD is highly confusing. Instead. DataFrame.rdd will return the underlying RDD for all RDD methods. My understanding is that very few users program directly against the SchemaRDD API at the moment, because they are not well documented. However, oo maintain backward compatibility, we can create a type alias DataFrame that is still named SchemaRDD. This will maintain source compatibility for Scala. That said, we will have to update all existing materials to use DataFrame rather than SchemaRDD. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: renaming SchemaRDD - DataFrame
It has been pretty evident for some time that's what it is, hasn't it? Yes that's a better name IMO. On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com wrote: Hi, We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted to get the community's opinion. The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. We also expect more and more users to be programming directly against SchemaRDD API rather than the core RDD API. SchemaRDD, through its less commonly used DSL originally designed for writing test cases, always has the data-frame like API. In 1.3, we are redesigning the API to make the API usable for end users. There are two motivations for the renaming: 1. DataFrame seems to be a more self-evident name than SchemaRDD. 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even though it would contain some RDD functions like map, flatMap, etc), and calling it Schema*RDD* while it is not an RDD is highly confusing. Instead. DataFrame.rdd will return the underlying RDD for all RDD methods. My understanding is that very few users program directly against the SchemaRDD API at the moment, because they are not well documented. However, oo maintain backward compatibility, we can create a type alias DataFrame that is still named SchemaRDD. This will maintain source compatibility for Scala. That said, we will have to update all existing materials to use DataFrame rather than SchemaRDD.
Re: talk on interface design
Thanks, Andrew. That's great material. On Mon, Jan 26, 2015 at 10:23 PM, Andrew Ash and...@andrewash.com wrote: In addition to the references you have at the end of the presentation, there's a great set of practical examples based on the learnings from Qt posted here: http://www21.in.tum.de/~blanchet/api-design.pdf Chapter 4's way of showing a principle and then an example from Qt is particularly instructional. On Tue, Jan 27, 2015 at 1:05 AM, Reynold Xin r...@databricks.com wrote: Hi all, In Spark, we have done reasonable well historically in interface and API design, especially compared with some other Big Data systems. However, we have also made mistakes along the way. I want to share a talk I gave about interface design at Databricks' internal retreat. https://speakerdeck.com/rxin/interface-design-for-spark-community Interface design is a vital part of Spark becoming a long-term sustainable, thriving framework. Good interfaces can be the project's biggest asset, while bad interfaces can be the worst technical debt. As the project scales bigger and bigger, the community is expanding and we are getting a wider range of contributors that have not thought about this as their everyday development experience outside Spark. It is part-art part-science and in some sense acquired taste. However, I think there are common issues that can be spotted easily, and common principles that can address a lot of the low hanging fruits. Through this presentation, I hope to bring to everybody's attention the issue of interface design and encourage everybody to think hard about interface design in their contributions.
Re: renaming SchemaRDD - DataFrame
The type alias means your methods can specify either type and they will work. It's just another name for the same type. But Scaladocs and such will show DataFrame as the type. Matei On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho dirceu.semigh...@gmail.com wrote: Reynold, But with type alias we will have the same problem, right? If the methods doesn't receive schemardd anymore, we will have to change our code to migrade from schema to dataframe. Unless we have an implicit conversion between DataFrame and SchemaRDD 2015-01-27 17:18 GMT-02:00 Reynold Xin r...@databricks.com: Dirceu, That is not possible because one cannot overload return types. SQLContext.parquetFile (and many other methods) needs to return some type, and that type cannot be both SchemaRDD and DataFrame. In 1.3, we will create a type alias for DataFrame called SchemaRDD to not break source compatibility for Scala. On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho dirceu.semigh...@gmail.com wrote: Can't the SchemaRDD remain the same, but deprecated, and be removed in the release 1.5(+/- 1) for example, and the new code been added to DataFrame? With this, we don't impact in existing code for the next few releases. 2015-01-27 0:02 GMT-02:00 Kushal Datta kushal.da...@gmail.com: I want to address the issue that Matei raised about the heavy lifting required for a full SQL support. It is amazing that even after 30 years of research there is not a single good open source columnar database like Vertica. There is a column store option in MySQL, but it is not nearly as sophisticated as Vertica or MonetDB. But there's a true need for such a system. I wonder why so and it's high time to change that. On Jan 26, 2015 5:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Both SchemaRDD and DataFrame sound fine to me, though I like the former slightly better because it's more descriptive. Even if SchemaRDD's needs to rely on Spark SQL under the covers, it would be more clear from a user-facing perspective to at least choose a package name for it that omits sql. I would also be in favor of adding a separate Spark Schema module for Spark SQL to rely on, but I imagine that might be too large a change at this point? -Sandy On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: (Actually when we designed Spark SQL we thought of giving it another name, like Spark Schema, but we decided to stick with SQL since that was the most obvious use case to many users.) Matei On Jan 26, 2015, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library. Matei On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote: The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. i agree. this to me also implies it belongs in spark core, not sql On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell pwend...@gmail.com To: Reynold Xin r...@databricks.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January 26, 2015 4:01 PM Subject: Re: renaming SchemaRDD - DataFrame One thing potentially not clear from this e-mail, there will be a 1:1 correspondence where you can get an RDD to/from a DataFrame. On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com wrote: Hi, We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted to get the community's opinion. The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. We also expect more and more users to be programming directly against SchemaRDD API rather than the core RDD API. SchemaRDD, through its less commonly used DSL originally designed for writing test cases, always has the data-frame like API. In 1.3, we are redesigning the API to make the API usable for end users. There are two motivations for the
Re: [VOTE] Release Apache Spark 1.2.1 (RC1)
+1 1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests 2. Tested pyspark, mlib - running as well as compare results with 1.1.x 1.2.0 2.1. statistics OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK Center And Scale OK Fixed : org.apache.spark.SparkException in zip ! 2.5. rdd operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. recommendation OK Cheers k/ On Mon, Jan 26, 2015 at 11:02 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1061/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.2.1! The vote is open until Friday, January 30, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.1 [ ] -1 Do not release this package because ... For a list of fixes in this release, see http://s.apache.org/Mpn. To learn more about Apache Spark, please see http://spark.apache.org/ - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.2.1 (RC1)
+1 Tested on Mac OS X On Tue, Jan 27, 2015 at 12:35 PM, Krishna Sankar ksanka...@gmail.com wrote: +1 1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests 2. Tested pyspark, mlib - running as well as compare results with 1.1.x 1.2.0 2.1. statistics OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK Center And Scale OK Fixed : org.apache.spark.SparkException in zip ! 2.5. rdd operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. recommendation OK Cheers k/ On Mon, Jan 26, 2015 at 11:02 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1061/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.2.1! The vote is open until Friday, January 30, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.1 [ ] -1 Do not release this package because ... For a list of fixes in this release, see http://s.apache.org/Mpn. To learn more about Apache Spark, please see http://spark.apache.org/ - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Friendly reminder/request to help with reviews!
Hi Patrick: I would love to help reviewing in any way I can. Im fairly new here. Can you help with a pointer to get me started. Thanks From: Patrick Wendell pwend...@gmail.com To: dev@spark.apache.org dev@spark.apache.org Sent: Tuesday, January 27, 2015 3:56 PM Subject: Friendly reminder/request to help with reviews! Hey All, Just a reminder, as always around release time we have a very large volume of patches show up near the deadline. One thing that can help us maximize the number of patches we get in is to have community involvement in performing code reviews. And in particular, doing a thorough review and signing off on a patch with LGTM can substantially increase the odds we can merge a patch confidently. If you are newer to Spark, finding a single area of the codebase to focus on can still provide a lot of value to the project in the reviewing process. Cheers and good luck with everyone on work for this release. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Friendly reminder/request to help with reviews!
Hey All, Just a reminder, as always around release time we have a very large volume of patches show up near the deadline. One thing that can help us maximize the number of patches we get in is to have community involvement in performing code reviews. And in particular, doing a thorough review and signing off on a patch with LGTM can substantially increase the odds we can merge a patch confidently. If you are newer to Spark, finding a single area of the codebase to focus on can still provide a lot of value to the project in the reviewing process. Cheers and good luck with everyone on work for this release. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Use mvn to build Spark 1.2.0 failed
You certainly do not need yo build Spark as root. It might clumsily overcome a permissions problem in your local env but probably causes other problems. On Jan 27, 2015 11:18 AM, angel__ angel.alvarez.pas...@gmail.com wrote: I had that problem when I tried to build Spark 1.2. I don't exactly know what is causing it, but I guess it might have something to do with user permissions. I could finally fix this by building Spark as root user (now I'm dealing with another problem, but ...that's another story...) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Use-mvn-to-build-Spark-1-2-0-failed-tp9876p10285.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Use mvn to build Spark 1.2.0 failed
I had that problem when I tried to build Spark 1.2. I don't exactly know what is causing it, but I guess it might have something to do with user permissions. I could finally fix this by building Spark as root user (now I'm dealing with another problem, but ...that's another story...) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Use-mvn-to-build-Spark-1-2-0-failed-tp9876p10285.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: renaming SchemaRDD - DataFrame
I'm +1 on this, although a little worried about unknowingly introducing SparkSQL dependencies every time someone wants to use this. It would be great if the interface can be abstract and the implementation (in this case, SparkSQL backend) could be swapped out. One alternative suggestion on the name - why not call it DataTable? DataFrame seems like a name carried over from pandas (and by extension, R), and it's never been obvious to me what a Frame is. On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: (Actually when we designed Spark SQL we thought of giving it another name, like Spark Schema, but we decided to stick with SQL since that was the most obvious use case to many users.) Matei On Jan 26, 2015, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library. Matei On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote: The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. i agree. this to me also implies it belongs in spark core, not sql On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell pwend...@gmail.com To: Reynold Xin r...@databricks.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January 26, 2015 4:01 PM Subject: Re: renaming SchemaRDD - DataFrame One thing potentially not clear from this e-mail, there will be a 1:1 correspondence where you can get an RDD to/from a DataFrame. On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com wrote: Hi, We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted to get the community's opinion. The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. We also expect more and more users to be programming directly against SchemaRDD API rather than the core RDD API. SchemaRDD, through its less commonly used DSL originally designed for writing test cases, always has the data-frame like API. In 1.3, we are redesigning the API to make the API usable for end users. There are two motivations for the renaming: 1. DataFrame seems to be a more self-evident name than SchemaRDD. 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even though it would contain some RDD functions like map, flatMap, etc), and calling it Schema*RDD* while it is not an RDD is highly confusing. Instead. DataFrame.rdd will return the underlying RDD for all RDD methods. My understanding is that very few users program directly against the SchemaRDD API at the moment, because they are not well documented. However, oo maintain backward compatibility, we can create a type alias DataFrame that is still named SchemaRDD. This will maintain source compatibility for Scala. That said, we will have to update all existing materials to use DataFrame rather than SchemaRDD. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.2.1 (RC1)
I think there are several signing / hash issues that should be fixed before this release. Hashes: http://issues.apache.org/jira/browse/SPARK-5308 https://github.com/apache/spark/pull/4161 The hashes here are correct, but have two issues: As noted in the JIRA, the format of the hash file is nonstandard -- at least, doesn't match what Maven outputs, and apparently which tools like Leiningen expect, which is just the hash with no file name or spaces. There are two ways to fix that: different command-line tools (see PR), or, just ask Maven to generate these hashes (a different, easy PR). However, is the script I modified above used to generate these hashes? It's generating SHA1 sums, but the output in this release candidate has (correct) SHA512 sums. This may be more than a nuisance, since last time for some reason Maven Central did not register the project hashes. http://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-core_2.10%7C1.2.0%7Cjar does not show them but they exist: http://www.us.apache.org/dist/spark/spark-1.2.0/ It may add up to a problem worth rooting out before this release. Signing: As noted in https://issues.apache.org/jira/browse/SPARK-5299 there are two signing keys in https://people.apache.org/keys/committer/pwendell.asc (9E4FE3AF, 00799F7E) but only one is in http://www.apache.org/dist/spark/KEYS However, these artifacts seem to be signed by FC8ED089 which isn't in either. Details details, but I'd say non-binding -1 at the moment. On Tue, Jan 27, 2015 at 7:02 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1061/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.2.1! The vote is open until Friday, January 30, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.1 [ ] -1 Do not release this package because ... For a list of fixes in this release, see http://s.apache.org/Mpn. To learn more about Apache Spark, please see http://spark.apache.org/ - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Maximum size of vector that reduce can handle
I am running into this issue as well, when storing large Arrays as the value in a kv pair and then doing a reducebykey. Can one of the experts please comment if it would make sense to add an operation to add values in place like accumulators do - this would essentially merge the vectors for a given key in place, avoiding multiple allocations of temp array/vectors. This should be faster for datasets with frequently repeated keys. On Tue, Jan 27, 2015 at 11:03 AM, Xiangrui Meng men...@gmail.com wrote: 60m-vector costs 480MB memory. You have 12 of them to be reduced to the driver. So you need ~6GB memory not counting the temp vectors generated from '_+_'. You need to increase driver memory to make it work. That being said, ~10^7 hits the limit for the current impl of glm. -Xiangrui On Jan 23, 2015 2:19 PM, DB Tsai dbt...@dbtsai.com wrote: Hi Alexander, For `reduce`, it's an action that will collect all the data from mapper to driver, and perform the aggregation in driver. As a result, if the output from the mapper is very large, and the numbers of partitions in mapper are large, it might cause a problem. For `treeReduce`, as the name indicates, the way it works is in the first layer, it aggregates the output of the mappers two by two resulting half of the numbers of output. And then, we continuously do the aggregation layer by layer. The final aggregation will be done in driver but in this time, the numbers of data are small. By default, depth 2 is used, so if you have so many partitions of large vector, this may still cause issue. You can increase the depth into higher numbers such that in the final reduce in driver, the number of partitions are very small. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Jan 23, 2015 at 12:07 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi DB Tsai, Thank you for your suggestion. Actually, I've started my experiments with treeReduce. Originally, I had vv.treeReduce(_ + _, 2) in my script exactly because MLlib optimizers are using it, as you pointed out with LBFGS. However, it leads to the same problems as reduce, but presumably not so directly. As far as I understand, treeReduce limits the number of communications between workers and master forcing workers to partially compute the reduce operation. Are you sure that driver will first collect all results (or all partial results in treeReduce) and ONLY then perform aggregation? If that is the problem, then how to force it to do aggregation after receiving each portion of data from Workers? Best regards, Alexander -Original Message- From: DB Tsai [mailto:dbt...@dbtsai.com] Sent: Friday, January 23, 2015 11:53 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Maximum size of vector that reduce can handle Hi Alexander, When you use `reduce` to aggregate the vectors, those will actually be pulled into driver, and merged over there. Obviously, it's not scaleable given you are doing deep neural networks which have so many coefficients. Please try treeReduce instead which is what we do in linear regression and logistic regression. See https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala for example. val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n), 0.0))( seqOp = (c, v) = (c, v) match { case ((grad, loss), (label, features)) = val l = localGradient.compute( features, label, bcW.value, grad) (grad, loss + l) }, combOp = (c1, c2) = (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) = axpy(1.0, grad2, grad1) (grad1, loss1 + loss2) }) Sincerely, DB Tsai --- Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Jan 23, 2015 at 10:00 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Dear Spark developers, I am trying to measure the Spark reduce performance for big vectors. My motivation is related to machine learning gradient. Gradient is a vector that is computed on each worker and then all results need to be summed up and broadcasted back to workers. For example, present machine learning applications involve very long parameter vectors, for deep neural networks it can be up to 2Billions. So, I want to measure the time that is needed for this operation depending on the size of vector and number of workers. I wrote few lines of code that assume that Spark will distribute partitions among all available workers. I have 6-machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM), each runs 2 Workers. import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import
Re: renaming SchemaRDD - DataFrame
In master, Reynold has already taken care of moving Row into org.apache.spark.sql; so, even though the implementation of Row (and GenericRow et al.) is in Catalyst (which is more optimizer than parser), that needn't be of concern to users of the API in its most recent state. On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers ko...@tresata.com wrote: hey matei, i think that stuff such as SchemaRDD, columar storage and perhaps also query planning can be re-used by many systems that do analysis on structured data. i can imagine panda-like systems, but also datalog or scalding-like (which we use at tresata and i might rebase on SchemaRDD at some point). SchemaRDD should become the interface for all these. and columnar storage abstractions should be re-used between all these. currently the sql tie in is way beyond just the (perhaps unfortunate) naming convention. for example a core part of the SchemaRD abstraction is Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser (if i understand it correctly, i have not used catalyst, but it looks neat). i should not need to include a SQL parser just to use structured data in say a panda-like framework. best, koert On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library. Matei On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote: The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. i agree. this to me also implies it belongs in spark core, not sql On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell pwend...@gmail.com To: Reynold Xin r...@databricks.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January 26, 2015 4:01 PM Subject: Re: renaming SchemaRDD - DataFrame One thing potentially not clear from this e-mail, there will be a 1:1 correspondence where you can get an RDD to/from a DataFrame. On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com wrote: Hi, We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted to get the community's opinion. The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. We also expect more and more users to be programming directly against SchemaRDD API rather than the core RDD API. SchemaRDD, through its less commonly used DSL originally designed for writing test cases, always has the data-frame like API. In 1.3, we are redesigning the API to make the API usable for end users. There are two motivations for the renaming: 1. DataFrame seems to be a more self-evident name than SchemaRDD. 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even though it would contain some RDD functions like map, flatMap, etc), and calling it Schema*RDD* while it is not an RDD is highly confusing. Instead. DataFrame.rdd will return the underlying RDD for all RDD methods. My understanding is that very few users program directly against the SchemaRDD API at the moment, because they are not well documented. However, oo maintain backward compatibility, we can create a type alias DataFrame that is still named SchemaRDD. This will maintain source compatibility for Scala. That said, we will have to update all existing materials to use DataFrame rather than SchemaRDD. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: renaming SchemaRDD - DataFrame
I personally have no preference DataFrame vs. DataTable, but only wish to lay out the history and etymology simply because I'm into that sort of thing. Frame comes from Marvin Minsky's 1970's AI construct: slots and the data that go in them. The S programming language (precursor to R) adopted this terminology in 1991. R of course became popular with the rise of Data Science around 2012. http://www.google.com/trends/explore#q=%22data%20science%22%2C%20%22r%20programming%22cmpt=qtz= DataFrame would carry the implication that it comes along with its own metadata, whereas DataTable might carry the implication that metadata is stored in a central metadata repository. DataFrame is thus technically more correct for SchemaRDD, but is a less familiar (and thus less accessible) term for those not immersed in data science or AI and thus may have narrower appeal. - Original Message - From: Evan R. Sparks evan.spa...@gmail.com To: Matei Zaharia matei.zaha...@gmail.com Cc: Koert Kuipers ko...@tresata.com; Michael Malak michaelma...@yahoo.com; Patrick Wendell pwend...@gmail.com; Reynold Xin r...@databricks.com; dev@spark.apache.org dev@spark.apache.org Sent: Tuesday, January 27, 2015 9:55 AM Subject: Re: renaming SchemaRDD - DataFrame I'm +1 on this, although a little worried about unknowingly introducing SparkSQL dependencies every time someone wants to use this. It would be great if the interface can be abstract and the implementation (in this case, SparkSQL backend) could be swapped out. One alternative suggestion on the name - why not call it DataTable? DataFrame seems like a name carried over from pandas (and by extension, R), and it's never been obvious to me what a Frame is. On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: (Actually when we designed Spark SQL we thought of giving it another name, like Spark Schema, but we decided to stick with SQL since that was the most obvious use case to many users.) Matei On Jan 26, 2015, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: While it might be possible to move this concept to Spark Core long-term, supporting structured data efficiently does require quite a bit of the infrastructure in Spark SQL, such as query planning and columnar storage. The intent of Spark SQL though is to be more than a SQL server -- it's meant to be a library for manipulating structured data. Since this is possible to build over the core API, it's pretty natural to organize it that way, same as Spark Streaming is a library. Matei On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote: The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. i agree. this to me also implies it belongs in spark core, not sql On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell pwend...@gmail.com To: Reynold Xin r...@databricks.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January 26, 2015 4:01 PM Subject: Re: renaming SchemaRDD - DataFrame One thing potentially not clear from this e-mail, there will be a 1:1 correspondence where you can get an RDD to/from a DataFrame. On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com wrote: Hi, We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted to get the community's opinion. The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. We also expect more and more users to be programming directly against SchemaRDD API rather than the core RDD API. SchemaRDD, through its less commonly used DSL originally designed for writing test cases, always has the data-frame like API. In 1.3, we are redesigning the API to make the API usable for end users. There are two motivations for the renaming: 1. DataFrame seems to be a more self-evident name than SchemaRDD. 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even though it would contain some RDD functions like map, flatMap, etc), and calling it Schema*RDD* while it is not an RDD is highly confusing. Instead. DataFrame.rdd will return the underlying RDD for all RDD methods. My understanding is that very few users program directly against the SchemaRDD API at the moment, because they are not well documented. However, oo maintain backward compatibility, we can
Re: [VOTE] Release Apache Spark 1.2.1 (RC1)
Got it. Ignore the SHA512 issue since these aren't somehow expected by a policy or Maven to be in a certain format. Just wondered if the difference was intended. The Maven way of generated the SHA1 hashes is to set this on the install plugin, AFAIK, although I'm not sure if the intent was to hash files that Maven didn't create: configuration createChecksumtrue/createChecksum /configuration As for the key issue, I think it's just a matter of uploading the new key in both places. We should all of course test the release anyway. On Tue, Jan 27, 2015 at 5:55 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean, The release script generates hashes in two places (take a look a bit further down in the script), one for the published artifacts and the other for the binaries. In the case of the binaries we use SHA512 because, AFAIK, the ASF does not require you to use SHA1 and SHA512 is better. In the case of the published Maven artifacts we use SHA1 because my understanding is this is what Maven requires. However, it does appear that the format is now one that maven cannot parse. Anyways, it seems fine to just change the format of the hash per your PR. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org