Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Hey Sean,

Right now we don't publish every 2.11 binary to avoid combinatorial
explosion of the number of build artifacts we publish (there are other
parameters such as whether hive is included, etc). We can revisit this
in future feature releases, but .1 releases like this are reserved for
bug fixes.

- Patrick

On Tue, Jan 27, 2015 at 10:31 AM, Sean McNamara
sean.mcnam...@webtrends.com wrote:
 We're using spark on scala 2.11 /w hadoop2.4.  Would it be practical / make 
 sense to build a bin version of spark against scala 2.11 for versions other 
 than just hadoop1 at this time?

 Cheers,

 Sean


 On Jan 27, 2015, at 12:04 AM, Patrick Wendell pwend...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark version 
 1.2.1!

 The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1061/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.2.1!

 The vote is open until Friday, January 30, at 07:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.2.1
 [ ] -1 Do not release this package because ...

 For a list of fixes in this release, see http://s.apache.org/Mpn.

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Sean McNamara
Sounds good, that makes sense.

Cheers,

Sean

 On Jan 27, 2015, at 11:35 AM, Patrick Wendell pwend...@gmail.com wrote:
 
 Hey Sean,
 
 Right now we don't publish every 2.11 binary to avoid combinatorial
 explosion of the number of build artifacts we publish (there are other
 parameters such as whether hive is included, etc). We can revisit this
 in future feature releases, but .1 releases like this are reserved for
 bug fixes.
 
 - Patrick
 
 On Tue, Jan 27, 2015 at 10:31 AM, Sean McNamara
 sean.mcnam...@webtrends.com wrote:
 We're using spark on scala 2.11 /w hadoop2.4.  Would it be practical / make 
 sense to build a bin version of spark against scala 2.11 for versions other 
 than just hadoop1 at this time?
 
 Cheers,
 
 Sean
 
 
 On Jan 27, 2015, at 12:04 AM, Patrick Wendell pwend...@gmail.com wrote:
 
 Please vote on releasing the following candidate as Apache Spark version 
 1.2.1!
 
 The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc1/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1061/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
 
 Please vote on releasing this package as Apache Spark 1.2.1!
 
 The vote is open until Friday, January 30, at 07:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.2.1
 [ ] -1 Do not release this package because ...
 
 For a list of fixes in this release, see http://s.apache.org/Mpn.
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 - Patrick
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Okay - we've resolved all issues with the signatures and keys.
However, I'll leave the current vote open for a bit to solicit
additional feedback.

On Tue, Jan 27, 2015 at 10:43 AM, Sean McNamara
sean.mcnam...@webtrends.com wrote:
 Sounds good, that makes sense.

 Cheers,

 Sean

 On Jan 27, 2015, at 11:35 AM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Sean,

 Right now we don't publish every 2.11 binary to avoid combinatorial
 explosion of the number of build artifacts we publish (there are other
 parameters such as whether hive is included, etc). We can revisit this
 in future feature releases, but .1 releases like this are reserved for
 bug fixes.

 - Patrick

 On Tue, Jan 27, 2015 at 10:31 AM, Sean McNamara
 sean.mcnam...@webtrends.com wrote:
 We're using spark on scala 2.11 /w hadoop2.4.  Would it be practical / make 
 sense to build a bin version of spark against scala 2.11 for versions other 
 than just hadoop1 at this time?

 Cheers,

 Sean


 On Jan 27, 2015, at 12:04 AM, Patrick Wendell pwend...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark version 
 1.2.1!

 The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1061/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.2.1!

 The vote is open until Friday, January 30, at 07:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.2.1
 [ ] -1 Do not release this package because ...

 For a list of fixes in this release, see http://s.apache.org/Mpn.

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Reynold Xin
Koert,

As Mark said, I have already refactored the API so that nothing is catalyst
is exposed (and users won't need them anyway). Data types, Row interfaces
are both outside catalyst package and in org.apache.spark.sql.

On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers ko...@tresata.com wrote:

 hey matei,
 i think that stuff such as SchemaRDD, columar storage and perhaps also
 query planning can be re-used by many systems that do analysis on
 structured data. i can imagine panda-like systems, but also datalog or
 scalding-like (which we use at tresata and i might rebase on SchemaRDD at
 some point). SchemaRDD should become the interface for all these. and
 columnar storage abstractions should be re-used between all these.

 currently the sql tie in is way beyond just the (perhaps unfortunate)
 naming convention. for example a core part of the SchemaRD abstraction is
 Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
 that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
 (if i understand it correctly, i have not used catalyst, but it looks
 neat). i should not need to include a SQL parser just to use structured
 data in say a panda-like framework.

 best, koert


 On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 While it might be possible to move this concept to Spark Core long-term,
 supporting structured data efficiently does require quite a bit of the
 infrastructure in Spark SQL, such as query planning and columnar storage.
 The intent of Spark SQL though is to be more than a SQL server -- it's
 meant to be a library for manipulating structured data. Since this is
 possible to build over the core API, it's pretty natural to organize it
 that way, same as Spark Streaming is a library.

 Matei

  On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote:
 
  The context is that SchemaRDD is becoming a common data format used for
  bringing data into Spark from external systems, and used for various
  components of Spark, e.g. MLlib's new pipeline API.
 
  i agree. this to me also implies it belongs in spark core, not sql
 
  On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak 
  michaelma...@yahoo.com.invalid wrote:
 
  And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
 Area
  Spark Meetup YouTube contained a wealth of background information on
 this
  idea (mostly from Patrick and Reynold :-).
 
  https://www.youtube.com/watch?v=YWppYPWznSQ
 
  
  From: Patrick Wendell pwend...@gmail.com
  To: Reynold Xin r...@databricks.com
  Cc: dev@spark.apache.org dev@spark.apache.org
  Sent: Monday, January 26, 2015 4:01 PM
  Subject: Re: renaming SchemaRDD - DataFrame
 
 
  One thing potentially not clear from this e-mail, there will be a 1:1
  correspondence where you can get an RDD to/from a DataFrame.
 
 
  On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com
 wrote:
  Hi,
 
  We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted
 to
  get the community's opinion.
 
  The context is that SchemaRDD is becoming a common data format used
 for
  bringing data into Spark from external systems, and used for various
  components of Spark, e.g. MLlib's new pipeline API. We also expect
 more
  and
  more users to be programming directly against SchemaRDD API rather
 than
  the
  core RDD API. SchemaRDD, through its less commonly used DSL originally
  designed for writing test cases, always has the data-frame like API.
 In
  1.3, we are redesigning the API to make the API usable for end users.
 
 
  There are two motivations for the renaming:
 
  1. DataFrame seems to be a more self-evident name than SchemaRDD.
 
  2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
 (even
  though it would contain some RDD functions like map, flatMap, etc),
 and
  calling it Schema*RDD* while it is not an RDD is highly confusing.
  Instead.
  DataFrame.rdd will return the underlying RDD for all RDD methods.
 
 
  My understanding is that very few users program directly against the
  SchemaRDD API at the moment, because they are not well documented.
  However,
  oo maintain backward compatibility, we can create a type alias
 DataFrame
  that is still named SchemaRDD. This will maintain source compatibility
  for
  Scala. That said, we will have to update all existing materials to use
  DataFrame rather than SchemaRDD.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 





Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Dirceu Semighini Filho
Reynold,
But with type alias we will have the same problem, right?
If the methods doesn't receive schemardd anymore, we will have to change
our code to migrade from schema to dataframe. Unless we have an implicit
conversion between DataFrame and SchemaRDD



2015-01-27 17:18 GMT-02:00 Reynold Xin r...@databricks.com:

 Dirceu,

 That is not possible because one cannot overload return types.

 SQLContext.parquetFile (and many other methods) needs to return some type,
 and that type cannot be both SchemaRDD and DataFrame.

 In 1.3, we will create a type alias for DataFrame called SchemaRDD to not
 break source compatibility for Scala.


 On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho 
 dirceu.semigh...@gmail.com wrote:

 Can't the SchemaRDD remain the same, but deprecated, and be removed in the
 release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
 With this, we don't impact in existing code for the next few releases.



 2015-01-27 0:02 GMT-02:00 Kushal Datta kushal.da...@gmail.com:

  I want to address the issue that Matei raised about the heavy lifting
  required for a full SQL support. It is amazing that even after 30 years
 of
  research there is not a single good open source columnar database like
  Vertica. There is a column store option in MySQL, but it is not nearly
 as
  sophisticated as Vertica or MonetDB. But there's a true need for such a
  system. I wonder why so and it's high time to change that.
  On Jan 26, 2015 5:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 
   Both SchemaRDD and DataFrame sound fine to me, though I like the
 former
   slightly better because it's more descriptive.
  
   Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
 would
   be more clear from a user-facing perspective to at least choose a
 package
   name for it that omits sql.
  
   I would also be in favor of adding a separate Spark Schema module for
  Spark
   SQL to rely on, but I imagine that might be too large a change at this
   point?
  
   -Sandy
  
   On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia 
 matei.zaha...@gmail.com
   wrote:
  
(Actually when we designed Spark SQL we thought of giving it another
   name,
like Spark Schema, but we decided to stick with SQL since that was
 the
   most
obvious use case to many users.)
   
Matei
   
 On Jan 26, 2015, at 5:31 PM, Matei Zaharia 
 matei.zaha...@gmail.com
wrote:

 While it might be possible to move this concept to Spark Core
   long-term,
supporting structured data efficiently does require quite a bit of
 the
infrastructure in Spark SQL, such as query planning and columnar
  storage.
The intent of Spark SQL though is to be more than a SQL server --
 it's
meant to be a library for manipulating structured data. Since this
 is
possible to build over the core API, it's pretty natural to
 organize it
that way, same as Spark Streaming is a library.

 Matei

 On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com
  wrote:

 The context is that SchemaRDD is becoming a common data format
 used
   for
 bringing data into Spark from external systems, and used for
 various
 components of Spark, e.g. MLlib's new pipeline API.

 i agree. this to me also implies it belongs in spark core, not
 sql

 On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak 
 michaelma...@yahoo.com.invalid wrote:

 And in the off chance that anyone hasn't seen it yet, the Jan.
 13
  Bay
Area
 Spark Meetup YouTube contained a wealth of background
 information
  on
this
 idea (mostly from Patrick and Reynold :-).

 https://www.youtube.com/watch?v=YWppYPWznSQ

 
 From: Patrick Wendell pwend...@gmail.com
 To: Reynold Xin r...@databricks.com
 Cc: dev@spark.apache.org dev@spark.apache.org
 Sent: Monday, January 26, 2015 4:01 PM
 Subject: Re: renaming SchemaRDD - DataFrame


 One thing potentially not clear from this e-mail, there will be
 a
  1:1
 correspondence where you can get an RDD to/from a DataFrame.


 On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin 
 r...@databricks.com
wrote:
 Hi,

 We are considering renaming SchemaRDD - DataFrame in 1.3, and
   wanted
to
 get the community's opinion.

 The context is that SchemaRDD is becoming a common data format
  used
for
 bringing data into Spark from external systems, and used for
  various
 components of Spark, e.g. MLlib's new pipeline API. We also
 expect
more
 and
 more users to be programming directly against SchemaRDD API
 rather
than
 the
 core RDD API. SchemaRDD, through its less commonly used DSL
   originally
 designed for writing test cases, always has the data-frame like
  API.
In
 1.3, we are redesigning the API to make the API usable for end
   users.


 There are two 

Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Reynold Xin
Dirceu,

That is not possible because one cannot overload return types.

SQLContext.parquetFile (and many other methods) needs to return some type,
and that type cannot be both SchemaRDD and DataFrame.

In 1.3, we will create a type alias for DataFrame called SchemaRDD to not
break source compatibility for Scala.


On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho 
dirceu.semigh...@gmail.com wrote:

 Can't the SchemaRDD remain the same, but deprecated, and be removed in the
 release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
 With this, we don't impact in existing code for the next few releases.



 2015-01-27 0:02 GMT-02:00 Kushal Datta kushal.da...@gmail.com:

  I want to address the issue that Matei raised about the heavy lifting
  required for a full SQL support. It is amazing that even after 30 years
 of
  research there is not a single good open source columnar database like
  Vertica. There is a column store option in MySQL, but it is not nearly as
  sophisticated as Vertica or MonetDB. But there's a true need for such a
  system. I wonder why so and it's high time to change that.
  On Jan 26, 2015 5:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 
   Both SchemaRDD and DataFrame sound fine to me, though I like the former
   slightly better because it's more descriptive.
  
   Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
 would
   be more clear from a user-facing perspective to at least choose a
 package
   name for it that omits sql.
  
   I would also be in favor of adding a separate Spark Schema module for
  Spark
   SQL to rely on, but I imagine that might be too large a change at this
   point?
  
   -Sandy
  
   On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia 
 matei.zaha...@gmail.com
   wrote:
  
(Actually when we designed Spark SQL we thought of giving it another
   name,
like Spark Schema, but we decided to stick with SQL since that was
 the
   most
obvious use case to many users.)
   
Matei
   
 On Jan 26, 2015, at 5:31 PM, Matei Zaharia 
 matei.zaha...@gmail.com
wrote:

 While it might be possible to move this concept to Spark Core
   long-term,
supporting structured data efficiently does require quite a bit of
 the
infrastructure in Spark SQL, such as query planning and columnar
  storage.
The intent of Spark SQL though is to be more than a SQL server --
 it's
meant to be a library for manipulating structured data. Since this is
possible to build over the core API, it's pretty natural to organize
 it
that way, same as Spark Streaming is a library.

 Matei

 On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com
  wrote:

 The context is that SchemaRDD is becoming a common data format
 used
   for
 bringing data into Spark from external systems, and used for
 various
 components of Spark, e.g. MLlib's new pipeline API.

 i agree. this to me also implies it belongs in spark core, not sql

 On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak 
 michaelma...@yahoo.com.invalid wrote:

 And in the off chance that anyone hasn't seen it yet, the Jan. 13
  Bay
Area
 Spark Meetup YouTube contained a wealth of background information
  on
this
 idea (mostly from Patrick and Reynold :-).

 https://www.youtube.com/watch?v=YWppYPWznSQ

 
 From: Patrick Wendell pwend...@gmail.com
 To: Reynold Xin r...@databricks.com
 Cc: dev@spark.apache.org dev@spark.apache.org
 Sent: Monday, January 26, 2015 4:01 PM
 Subject: Re: renaming SchemaRDD - DataFrame


 One thing potentially not clear from this e-mail, there will be a
  1:1
 correspondence where you can get an RDD to/from a DataFrame.


 On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin 
 r...@databricks.com
wrote:
 Hi,

 We are considering renaming SchemaRDD - DataFrame in 1.3, and
   wanted
to
 get the community's opinion.

 The context is that SchemaRDD is becoming a common data format
  used
for
 bringing data into Spark from external systems, and used for
  various
 components of Spark, e.g. MLlib's new pipeline API. We also
 expect
more
 and
 more users to be programming directly against SchemaRDD API
 rather
than
 the
 core RDD API. SchemaRDD, through its less commonly used DSL
   originally
 designed for writing test cases, always has the data-frame like
  API.
In
 1.3, we are redesigning the API to make the API usable for end
   users.


 There are two motivations for the renaming:

 1. DataFrame seems to be a more self-evident name than
 SchemaRDD.

 2. SchemaRDD/DataFrame is actually not going to be an RDD
 anymore
(even
 though it would contain some RDD functions like map, flatMap,
  etc),
and
 calling it Schema*RDD* while it is not an RDD is highly
 

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Patrick Wendell
Yes - the key issue is just due to me creating new keys this time
around. Anyways let's take another stab at this. In the mean time,
please don't hesitate to test the release itself.

- Patrick

On Tue, Jan 27, 2015 at 10:00 AM, Sean Owen so...@cloudera.com wrote:
 Got it. Ignore the SHA512 issue since these aren't somehow expected by
 a policy or Maven to be in a certain format. Just wondered if the
 difference was intended.

 The Maven way of generated the SHA1 hashes is to set this on the
 install plugin, AFAIK, although I'm not sure if the intent was to hash
 files that Maven didn't create:

 configuration
 createChecksumtrue/createChecksum
 /configuration

 As for the key issue, I think it's just a matter of uploading the new
 key in both places.

 We should all of course test the release anyway.

 On Tue, Jan 27, 2015 at 5:55 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Sean,

 The release script generates hashes in two places (take a look a bit
 further down in the script), one for the published artifacts and the
 other for the binaries. In the case of the binaries we use SHA512
 because, AFAIK, the ASF does not require you to use SHA1 and SHA512 is
 better. In the case of the published Maven artifacts we use SHA1
 because my understanding is this is what Maven requires. However, it
 does appear that the format is now one that maven cannot parse.

 Anyways, it seems fine to just change the format of the hash per your PR.

 - Patrick


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Koert Kuipers
thats great. guess i was looking at a somewhat stale master branch...

On Tue, Jan 27, 2015 at 2:19 PM, Reynold Xin r...@databricks.com wrote:

 Koert,

 As Mark said, I have already refactored the API so that nothing is
 catalyst is exposed (and users won't need them anyway). Data types, Row
 interfaces are both outside catalyst package and in org.apache.spark.sql.

 On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers ko...@tresata.com wrote:

 hey matei,
 i think that stuff such as SchemaRDD, columar storage and perhaps also
 query planning can be re-used by many systems that do analysis on
 structured data. i can imagine panda-like systems, but also datalog or
 scalding-like (which we use at tresata and i might rebase on SchemaRDD at
 some point). SchemaRDD should become the interface for all these. and
 columnar storage abstractions should be re-used between all these.

 currently the sql tie in is way beyond just the (perhaps unfortunate)
 naming convention. for example a core part of the SchemaRD abstraction is
 Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
 that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
 (if i understand it correctly, i have not used catalyst, but it looks
 neat). i should not need to include a SQL parser just to use structured
 data in say a panda-like framework.

 best, koert


 On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 While it might be possible to move this concept to Spark Core long-term,
 supporting structured data efficiently does require quite a bit of the
 infrastructure in Spark SQL, such as query planning and columnar storage.
 The intent of Spark SQL though is to be more than a SQL server -- it's
 meant to be a library for manipulating structured data. Since this is
 possible to build over the core API, it's pretty natural to organize it
 that way, same as Spark Streaming is a library.

 Matei

  On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote:
 
  The context is that SchemaRDD is becoming a common data format used
 for
  bringing data into Spark from external systems, and used for various
  components of Spark, e.g. MLlib's new pipeline API.
 
  i agree. this to me also implies it belongs in spark core, not sql
 
  On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak 
  michaelma...@yahoo.com.invalid wrote:
 
  And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
 Area
  Spark Meetup YouTube contained a wealth of background information on
 this
  idea (mostly from Patrick and Reynold :-).
 
  https://www.youtube.com/watch?v=YWppYPWznSQ
 
  
  From: Patrick Wendell pwend...@gmail.com
  To: Reynold Xin r...@databricks.com
  Cc: dev@spark.apache.org dev@spark.apache.org
  Sent: Monday, January 26, 2015 4:01 PM
  Subject: Re: renaming SchemaRDD - DataFrame
 
 
  One thing potentially not clear from this e-mail, there will be a 1:1
  correspondence where you can get an RDD to/from a DataFrame.
 
 
  On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com
 wrote:
  Hi,
 
  We are considering renaming SchemaRDD - DataFrame in 1.3, and
 wanted to
  get the community's opinion.
 
  The context is that SchemaRDD is becoming a common data format used
 for
  bringing data into Spark from external systems, and used for various
  components of Spark, e.g. MLlib's new pipeline API. We also expect
 more
  and
  more users to be programming directly against SchemaRDD API rather
 than
  the
  core RDD API. SchemaRDD, through its less commonly used DSL
 originally
  designed for writing test cases, always has the data-frame like API.
 In
  1.3, we are redesigning the API to make the API usable for end users.
 
 
  There are two motivations for the renaming:
 
  1. DataFrame seems to be a more self-evident name than SchemaRDD.
 
  2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
 (even
  though it would contain some RDD functions like map, flatMap, etc),
 and
  calling it Schema*RDD* while it is not an RDD is highly confusing.
  Instead.
  DataFrame.rdd will return the underlying RDD for all RDD methods.
 
 
  My understanding is that very few users program directly against the
  SchemaRDD API at the moment, because they are not well documented.
  However,
  oo maintain backward compatibility, we can create a type alias
 DataFrame
  that is still named SchemaRDD. This will maintain source
 compatibility
  for
  Scala. That said, we will have to update all existing materials to
 use
  DataFrame rather than SchemaRDD.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 






Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Dmitriy Lyubimov
It has been pretty evident for some time that's what it is, hasn't it?

Yes that's a better name IMO.

On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com wrote:

 Hi,

 We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted to
 get the community's opinion.

 The context is that SchemaRDD is becoming a common data format used for
 bringing data into Spark from external systems, and used for various
 components of Spark, e.g. MLlib's new pipeline API. We also expect more and
 more users to be programming directly against SchemaRDD API rather than the
 core RDD API. SchemaRDD, through its less commonly used DSL originally
 designed for writing test cases, always has the data-frame like API. In
 1.3, we are redesigning the API to make the API usable for end users.


 There are two motivations for the renaming:

 1. DataFrame seems to be a more self-evident name than SchemaRDD.

 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
 though it would contain some RDD functions like map, flatMap, etc), and
 calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
 DataFrame.rdd will return the underlying RDD for all RDD methods.


 My understanding is that very few users program directly against the
 SchemaRDD API at the moment, because they are not well documented. However,
 oo maintain backward compatibility, we can create a type alias DataFrame
 that is still named SchemaRDD. This will maintain source compatibility for
 Scala. That said, we will have to update all existing materials to use
 DataFrame rather than SchemaRDD.



Re: talk on interface design

2015-01-27 Thread Reynold Xin
Thanks, Andrew. That's great material.


On Mon, Jan 26, 2015 at 10:23 PM, Andrew Ash and...@andrewash.com wrote:

 In addition to the references you have at the end of the presentation,
 there's a great set of practical examples based on the learnings from Qt
 posted here: http://www21.in.tum.de/~blanchet/api-design.pdf

 Chapter 4's way of showing a principle and then an example from Qt is
 particularly instructional.

 On Tue, Jan 27, 2015 at 1:05 AM, Reynold Xin r...@databricks.com wrote:

 Hi all,

 In Spark, we have done reasonable well historically in interface and API
 design, especially compared with some other Big Data systems. However, we
 have also made mistakes along the way. I want to share a talk I gave about
 interface design at Databricks' internal retreat.

 https://speakerdeck.com/rxin/interface-design-for-spark-community

 Interface design is a vital part of Spark becoming a long-term
 sustainable,
 thriving framework. Good interfaces can be the project's biggest asset,
 while bad interfaces can be the worst technical debt. As the project
 scales
 bigger and bigger, the community is expanding and we are getting a wider
 range of contributors that have not thought about this as their everyday
 development experience outside Spark.

 It is part-art part-science and in some sense acquired taste. However, I
 think there are common issues that can be spotted easily, and common
 principles that can address a lot of the low hanging fruits. Through this
 presentation, I hope to bring to everybody's attention the issue of
 interface design and encourage everybody to think hard about interface
 design in their contributions.





Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Matei Zaharia
The type alias means your methods can specify either type and they will work. 
It's just another name for the same type. But Scaladocs and such will show 
DataFrame as the type.

Matei

 On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho 
 dirceu.semigh...@gmail.com wrote:
 
 Reynold,
 But with type alias we will have the same problem, right?
 If the methods doesn't receive schemardd anymore, we will have to change
 our code to migrade from schema to dataframe. Unless we have an implicit
 conversion between DataFrame and SchemaRDD
 
 
 
 2015-01-27 17:18 GMT-02:00 Reynold Xin r...@databricks.com:
 
 Dirceu,
 
 That is not possible because one cannot overload return types.
 
 SQLContext.parquetFile (and many other methods) needs to return some type,
 and that type cannot be both SchemaRDD and DataFrame.
 
 In 1.3, we will create a type alias for DataFrame called SchemaRDD to not
 break source compatibility for Scala.
 
 
 On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho 
 dirceu.semigh...@gmail.com wrote:
 
 Can't the SchemaRDD remain the same, but deprecated, and be removed in the
 release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
 With this, we don't impact in existing code for the next few releases.
 
 
 
 2015-01-27 0:02 GMT-02:00 Kushal Datta kushal.da...@gmail.com:
 
 I want to address the issue that Matei raised about the heavy lifting
 required for a full SQL support. It is amazing that even after 30 years
 of
 research there is not a single good open source columnar database like
 Vertica. There is a column store option in MySQL, but it is not nearly
 as
 sophisticated as Vertica or MonetDB. But there's a true need for such a
 system. I wonder why so and it's high time to change that.
 On Jan 26, 2015 5:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 
 Both SchemaRDD and DataFrame sound fine to me, though I like the
 former
 slightly better because it's more descriptive.
 
 Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
 would
 be more clear from a user-facing perspective to at least choose a
 package
 name for it that omits sql.
 
 I would also be in favor of adding a separate Spark Schema module for
 Spark
 SQL to rely on, but I imagine that might be too large a change at this
 point?
 
 -Sandy
 
 On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia 
 matei.zaha...@gmail.com
 wrote:
 
 (Actually when we designed Spark SQL we thought of giving it another
 name,
 like Spark Schema, but we decided to stick with SQL since that was
 the
 most
 obvious use case to many users.)
 
 Matei
 
 On Jan 26, 2015, at 5:31 PM, Matei Zaharia 
 matei.zaha...@gmail.com
 wrote:
 
 While it might be possible to move this concept to Spark Core
 long-term,
 supporting structured data efficiently does require quite a bit of
 the
 infrastructure in Spark SQL, such as query planning and columnar
 storage.
 The intent of Spark SQL though is to be more than a SQL server --
 it's
 meant to be a library for manipulating structured data. Since this
 is
 possible to build over the core API, it's pretty natural to
 organize it
 that way, same as Spark Streaming is a library.
 
 Matei
 
 On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com
 wrote:
 
 The context is that SchemaRDD is becoming a common data format
 used
 for
 bringing data into Spark from external systems, and used for
 various
 components of Spark, e.g. MLlib's new pipeline API.
 
 i agree. this to me also implies it belongs in spark core, not
 sql
 
 On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak 
 michaelma...@yahoo.com.invalid wrote:
 
 And in the off chance that anyone hasn't seen it yet, the Jan.
 13
 Bay
 Area
 Spark Meetup YouTube contained a wealth of background
 information
 on
 this
 idea (mostly from Patrick and Reynold :-).
 
 https://www.youtube.com/watch?v=YWppYPWznSQ
 
 
 From: Patrick Wendell pwend...@gmail.com
 To: Reynold Xin r...@databricks.com
 Cc: dev@spark.apache.org dev@spark.apache.org
 Sent: Monday, January 26, 2015 4:01 PM
 Subject: Re: renaming SchemaRDD - DataFrame
 
 
 One thing potentially not clear from this e-mail, there will be
 a
 1:1
 correspondence where you can get an RDD to/from a DataFrame.
 
 
 On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin 
 r...@databricks.com
 wrote:
 Hi,
 
 We are considering renaming SchemaRDD - DataFrame in 1.3, and
 wanted
 to
 get the community's opinion.
 
 The context is that SchemaRDD is becoming a common data format
 used
 for
 bringing data into Spark from external systems, and used for
 various
 components of Spark, e.g. MLlib's new pipeline API. We also
 expect
 more
 and
 more users to be programming directly against SchemaRDD API
 rather
 than
 the
 core RDD API. SchemaRDD, through its less commonly used DSL
 originally
 designed for writing test cases, always has the data-frame like
 API.
 In
 1.3, we are redesigning the API to make the API usable for end
 users.
 
 
 There are two motivations for the 

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Krishna Sankar
+1
1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests
2. Tested pyspark, mlib - running as well as compare results with 1.1.x 
1.2.0
2.1. statistics OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
   Fixed : org.apache.spark.SparkException in zip !
2.5. rdd operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. recommendation OK

Cheers
k/

On Mon, Jan 26, 2015 at 11:02 PM, Patrick Wendell pwend...@gmail.com
wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.2.1!

 The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1061/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.2.1!

 The vote is open until Friday, January 30, at 07:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.2.1
 [ ] -1 Do not release this package because ...

 For a list of fixes in this release, see http://s.apache.org/Mpn.

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Reynold Xin
+1

Tested on Mac OS X

On Tue, Jan 27, 2015 at 12:35 PM, Krishna Sankar ksanka...@gmail.com
wrote:

 +1
 1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min
  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
 -Dhadoop.version=2.6.0 -Phive -DskipTests
 2. Tested pyspark, mlib - running as well as compare results with 1.1.x 
 1.2.0
 2.1. statistics OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
Fixed : org.apache.spark.SparkException in zip !
 2.5. rdd operations OK
State of the Union Texts - MapReduce, Filter,sortByKey (word count)
 2.6. recommendation OK

 Cheers
 k/

 On Mon, Jan 26, 2015 at 11:02 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Please vote on releasing the following candidate as Apache Spark version
  1.2.1!
 
  The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.2.1-rc1/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1061/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
 
  Please vote on releasing this package as Apache Spark 1.2.1!
 
  The vote is open until Friday, January 30, at 07:00 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.2.1
  [ ] -1 Do not release this package because ...
 
  For a list of fixes in this release, see http://s.apache.org/Mpn.
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  - Patrick
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: Friendly reminder/request to help with reviews!

2015-01-27 Thread Gurumurthy Yeleswarapu
Hi Patrick:
I would love to help reviewing in any way I can. Im fairly new here. Can you 
help with a pointer to get me started.
Thanks
  From: Patrick Wendell pwend...@gmail.com
 To: dev@spark.apache.org dev@spark.apache.org 
 Sent: Tuesday, January 27, 2015 3:56 PM
 Subject: Friendly reminder/request to help with reviews!
   
Hey All,

Just a reminder, as always around release time we have a very large
volume of patches show up near the deadline.

One thing that can help us maximize the number of patches we get in is
to have community involvement in performing code reviews. And in
particular, doing a thorough review and signing off on a patch with
LGTM can substantially increase the odds we can merge a patch
confidently.

If you are newer to Spark, finding a single area of the codebase to
focus on can still provide a lot of value to the project in the
reviewing process.

Cheers and good luck with everyone on work for this release.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



   

Friendly reminder/request to help with reviews!

2015-01-27 Thread Patrick Wendell
Hey All,

Just a reminder, as always around release time we have a very large
volume of patches show up near the deadline.

One thing that can help us maximize the number of patches we get in is
to have community involvement in performing code reviews. And in
particular, doing a thorough review and signing off on a patch with
LGTM can substantially increase the odds we can merge a patch
confidently.

If you are newer to Spark, finding a single area of the codebase to
focus on can still provide a lot of value to the project in the
reviewing process.

Cheers and good luck with everyone on work for this release.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Use mvn to build Spark 1.2.0 failed

2015-01-27 Thread Sean Owen
You certainly do not need yo build Spark as root. It might clumsily
overcome a permissions problem in your local env but probably causes other
problems.
On Jan 27, 2015 11:18 AM, angel__ angel.alvarez.pas...@gmail.com wrote:

 I had that problem when I tried to build Spark 1.2. I don't exactly know
 what
 is causing it, but I guess it might have something to do with user
 permissions.

 I could finally fix this by building Spark as root user (now I'm dealing
 with another problem, but ...that's another story...)



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Use-mvn-to-build-Spark-1-2-0-failed-tp9876p10285.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Use mvn to build Spark 1.2.0 failed

2015-01-27 Thread angel__
I had that problem when I tried to build Spark 1.2. I don't exactly know what
is causing it, but I guess it might have something to do with user
permissions.

I could finally fix this by building Spark as root user (now I'm dealing
with another problem, but ...that's another story...)



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Use-mvn-to-build-Spark-1-2-0-failed-tp9876p10285.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Evan R. Sparks
I'm +1 on this, although a little worried about unknowingly introducing
SparkSQL dependencies every time someone wants to use this. It would be
great if the interface can be abstract and the implementation (in this
case, SparkSQL backend) could be swapped out.

One alternative suggestion on the name - why not call it DataTable?
DataFrame seems like a name carried over from pandas (and by extension, R),
and it's never been obvious to me what a Frame is.



On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 (Actually when we designed Spark SQL we thought of giving it another name,
 like Spark Schema, but we decided to stick with SQL since that was the most
 obvious use case to many users.)

 Matei

  On Jan 26, 2015, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
  While it might be possible to move this concept to Spark Core long-term,
 supporting structured data efficiently does require quite a bit of the
 infrastructure in Spark SQL, such as query planning and columnar storage.
 The intent of Spark SQL though is to be more than a SQL server -- it's
 meant to be a library for manipulating structured data. Since this is
 possible to build over the core API, it's pretty natural to organize it
 that way, same as Spark Streaming is a library.
 
  Matei
 
  On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote:
 
  The context is that SchemaRDD is becoming a common data format used for
  bringing data into Spark from external systems, and used for various
  components of Spark, e.g. MLlib's new pipeline API.
 
  i agree. this to me also implies it belongs in spark core, not sql
 
  On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak 
  michaelma...@yahoo.com.invalid wrote:
 
  And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
 Area
  Spark Meetup YouTube contained a wealth of background information on
 this
  idea (mostly from Patrick and Reynold :-).
 
  https://www.youtube.com/watch?v=YWppYPWznSQ
 
  
  From: Patrick Wendell pwend...@gmail.com
  To: Reynold Xin r...@databricks.com
  Cc: dev@spark.apache.org dev@spark.apache.org
  Sent: Monday, January 26, 2015 4:01 PM
  Subject: Re: renaming SchemaRDD - DataFrame
 
 
  One thing potentially not clear from this e-mail, there will be a 1:1
  correspondence where you can get an RDD to/from a DataFrame.
 
 
  On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com
 wrote:
  Hi,
 
  We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted
 to
  get the community's opinion.
 
  The context is that SchemaRDD is becoming a common data format used
 for
  bringing data into Spark from external systems, and used for various
  components of Spark, e.g. MLlib's new pipeline API. We also expect
 more
  and
  more users to be programming directly against SchemaRDD API rather
 than
  the
  core RDD API. SchemaRDD, through its less commonly used DSL originally
  designed for writing test cases, always has the data-frame like API.
 In
  1.3, we are redesigning the API to make the API usable for end users.
 
 
  There are two motivations for the renaming:
 
  1. DataFrame seems to be a more self-evident name than SchemaRDD.
 
  2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
 (even
  though it would contain some RDD functions like map, flatMap, etc),
 and
  calling it Schema*RDD* while it is not an RDD is highly confusing.
  Instead.
  DataFrame.rdd will return the underlying RDD for all RDD methods.
 
 
  My understanding is that very few users program directly against the
  SchemaRDD API at the moment, because they are not well documented.
  However,
  oo maintain backward compatibility, we can create a type alias
 DataFrame
  that is still named SchemaRDD. This will maintain source compatibility
  for
  Scala. That said, we will have to update all existing materials to use
  DataFrame rather than SchemaRDD.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Sean Owen
I think there are several signing / hash issues that should be fixed
before this release.

Hashes:

http://issues.apache.org/jira/browse/SPARK-5308
https://github.com/apache/spark/pull/4161

The hashes here are correct, but have two issues:

As noted in the JIRA, the format of the hash file is nonstandard --
at least, doesn't match what Maven outputs, and apparently which tools
like Leiningen expect, which is just the hash with no file name or
spaces. There are two ways to fix that: different command-line tools
(see PR), or, just ask Maven to generate these hashes (a different,
easy PR).

However, is the script I modified above used to generate these hashes?
It's generating SHA1 sums, but the output in this release candidate
has (correct) SHA512 sums.

This may be more than a nuisance, since last time for some reason
Maven Central did not register the project hashes.

http://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-core_2.10%7C1.2.0%7Cjar
does not show them but they exist:
http://www.us.apache.org/dist/spark/spark-1.2.0/

It may add up to a problem worth rooting out before this release.


Signing:

As noted in https://issues.apache.org/jira/browse/SPARK-5299 there are
two signing keys in
https://people.apache.org/keys/committer/pwendell.asc (9E4FE3AF,
00799F7E) but only one is in http://www.apache.org/dist/spark/KEYS

However, these artifacts seem to be signed by FC8ED089 which isn't in either.

Details details, but I'd say non-binding -1 at the moment.


On Tue, Jan 27, 2015 at 7:02 AM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.2.1!

 The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1061/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.2.1!

 The vote is open until Friday, January 30, at 07:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.2.1
 [ ] -1 Do not release this package because ...

 For a list of fixes in this release, see http://s.apache.org/Mpn.

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Maximum size of vector that reduce can handle

2015-01-27 Thread Boromir Widas
I am running into this issue as well, when storing large Arrays as the
value in a kv pair
and then doing a reducebykey.
Can one of the experts please comment if it would make sense to add an
operation to
add values in place like accumulators do - this would essentially merge the
vectors for
a given key in place, avoiding multiple allocations of temp array/vectors.
This should
be faster for datasets with frequently repeated keys.

On Tue, Jan 27, 2015 at 11:03 AM, Xiangrui Meng men...@gmail.com wrote:

 60m-vector costs 480MB memory. You have 12 of them to be reduced to the
 driver. So you need ~6GB memory not counting the temp vectors generated
 from '_+_'. You need to increase driver memory to make it work. That being
 said, ~10^7 hits the limit for the current impl of glm. -Xiangrui
 On Jan 23, 2015 2:19 PM, DB Tsai dbt...@dbtsai.com wrote:

  Hi Alexander,
 
  For `reduce`, it's an action that will collect all the data from
  mapper to driver, and perform the aggregation in driver. As a result,
  if the output from the mapper is very large, and the numbers of
  partitions in mapper are large, it might cause a problem.
 
  For `treeReduce`, as the name indicates, the way it works is in the
  first layer, it aggregates the output of the mappers two by two
  resulting half of the numbers of output. And then, we continuously do
  the aggregation layer by layer. The final aggregation will be done in
  driver but in this time, the numbers of data are small.
 
  By default, depth 2 is used, so if you have so many partitions of
  large vector, this may still cause issue. You can increase the depth
  into higher numbers such that in the final reduce in driver, the
  number of partitions are very small.
 
  Sincerely,
 
  DB Tsai
  ---
  Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
 
  On Fri, Jan 23, 2015 at 12:07 PM, Ulanov, Alexander
  alexander.ula...@hp.com wrote:
   Hi DB Tsai,
  
   Thank you for your suggestion. Actually, I've started my experiments
  with treeReduce. Originally, I had vv.treeReduce(_ + _, 2) in my
 script
  exactly because MLlib optimizers are using it, as you pointed out with
  LBFGS. However, it leads to the same problems as reduce, but presumably
  not so directly. As far as I understand, treeReduce limits the number of
  communications between workers and master forcing workers to partially
  compute the reduce operation.
  
   Are you sure that driver will first collect all results (or all partial
  results in treeReduce) and ONLY then perform aggregation? If that is the
  problem, then how to force it to do aggregation after receiving each
  portion of data from Workers?
  
   Best regards, Alexander
  
   -Original Message-
   From: DB Tsai [mailto:dbt...@dbtsai.com]
   Sent: Friday, January 23, 2015 11:53 AM
   To: Ulanov, Alexander
   Cc: dev@spark.apache.org
   Subject: Re: Maximum size of vector that reduce can handle
  
   Hi Alexander,
  
   When you use `reduce` to aggregate the vectors, those will actually be
  pulled into driver, and merged over there. Obviously, it's not scaleable
  given you are doing deep neural networks which have so many coefficients.
  
   Please try treeReduce instead which is what we do in linear regression
  and logistic regression.
  
   See
 
 https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala
   for example.
  
   val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n),
  0.0))( seqOp = (c, v) = (c, v) match { case ((grad, loss), (label,
  features)) = val l = localGradient.compute( features, label, bcW.value,
  grad) (grad, loss + l) }, combOp = (c1, c2) = (c1, c2) match { case
  ((grad1, loss1), (grad2, loss2)) = axpy(1.0, grad2, grad1) (grad1,
 loss1 +
  loss2)
   })
  
   Sincerely,
  
   DB Tsai
   ---
   Blog: https://www.dbtsai.com
   LinkedIn: https://www.linkedin.com/in/dbtsai
  
  
  
   On Fri, Jan 23, 2015 at 10:00 AM, Ulanov, Alexander 
  alexander.ula...@hp.com wrote:
   Dear Spark developers,
  
   I am trying to measure the Spark reduce performance for big vectors.
 My
  motivation is related to machine learning gradient. Gradient is a vector
  that is computed on each worker and then all results need to be summed up
  and broadcasted back to workers. For example, present machine learning
  applications involve very long parameter vectors, for deep neural
 networks
  it can be up to 2Billions. So, I want to measure the time that is needed
  for this operation depending on the size of vector and number of
 workers. I
  wrote few lines of code that assume that Spark will distribute partitions
  among all available workers. I have 6-machine cluster (Xeon 3.3GHz 4
 cores,
  16GB RAM), each runs 2 Workers.
  
   import org.apache.spark.mllib.rdd.RDDFunctions._
   import breeze.linalg._
   import 

Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Mark Hamstra
In master, Reynold has already taken care of moving Row
into org.apache.spark.sql; so, even though the implementation of Row (and
GenericRow et al.) is in Catalyst (which is more optimizer than parser),
that needn't be of concern to users of the API in its most recent state.

On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers ko...@tresata.com wrote:

 hey matei,
 i think that stuff such as SchemaRDD, columar storage and perhaps also
 query planning can be re-used by many systems that do analysis on
 structured data. i can imagine panda-like systems, but also datalog or
 scalding-like (which we use at tresata and i might rebase on SchemaRDD at
 some point). SchemaRDD should become the interface for all these. and
 columnar storage abstractions should be re-used between all these.

 currently the sql tie in is way beyond just the (perhaps unfortunate)
 naming convention. for example a core part of the SchemaRD abstraction is
 Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
 that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
 (if i understand it correctly, i have not used catalyst, but it looks
 neat). i should not need to include a SQL parser just to use structured
 data in say a panda-like framework.

 best, koert


 On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

  While it might be possible to move this concept to Spark Core long-term,
  supporting structured data efficiently does require quite a bit of the
  infrastructure in Spark SQL, such as query planning and columnar storage.
  The intent of Spark SQL though is to be more than a SQL server -- it's
  meant to be a library for manipulating structured data. Since this is
  possible to build over the core API, it's pretty natural to organize it
  that way, same as Spark Streaming is a library.
 
  Matei
 
   On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote:
  
   The context is that SchemaRDD is becoming a common data format used
 for
   bringing data into Spark from external systems, and used for various
   components of Spark, e.g. MLlib's new pipeline API.
  
   i agree. this to me also implies it belongs in spark core, not sql
  
   On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak 
   michaelma...@yahoo.com.invalid wrote:
  
   And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
  Area
   Spark Meetup YouTube contained a wealth of background information on
  this
   idea (mostly from Patrick and Reynold :-).
  
   https://www.youtube.com/watch?v=YWppYPWznSQ
  
   
   From: Patrick Wendell pwend...@gmail.com
   To: Reynold Xin r...@databricks.com
   Cc: dev@spark.apache.org dev@spark.apache.org
   Sent: Monday, January 26, 2015 4:01 PM
   Subject: Re: renaming SchemaRDD - DataFrame
  
  
   One thing potentially not clear from this e-mail, there will be a 1:1
   correspondence where you can get an RDD to/from a DataFrame.
  
  
   On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com
  wrote:
   Hi,
  
   We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted
  to
   get the community's opinion.
  
   The context is that SchemaRDD is becoming a common data format used
 for
   bringing data into Spark from external systems, and used for various
   components of Spark, e.g. MLlib's new pipeline API. We also expect
 more
   and
   more users to be programming directly against SchemaRDD API rather
 than
   the
   core RDD API. SchemaRDD, through its less commonly used DSL
 originally
   designed for writing test cases, always has the data-frame like API.
 In
   1.3, we are redesigning the API to make the API usable for end users.
  
  
   There are two motivations for the renaming:
  
   1. DataFrame seems to be a more self-evident name than SchemaRDD.
  
   2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
 (even
   though it would contain some RDD functions like map, flatMap, etc),
 and
   calling it Schema*RDD* while it is not an RDD is highly confusing.
   Instead.
   DataFrame.rdd will return the underlying RDD for all RDD methods.
  
  
   My understanding is that very few users program directly against the
   SchemaRDD API at the moment, because they are not well documented.
   However,
   oo maintain backward compatibility, we can create a type alias
  DataFrame
   that is still named SchemaRDD. This will maintain source
 compatibility
   for
   Scala. That said, we will have to update all existing materials to
 use
   DataFrame rather than SchemaRDD.
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
 
 



Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Michael Malak
I personally have no preference DataFrame vs. DataTable, but only wish to lay 
out the history and etymology simply because I'm into that sort of thing.

Frame comes from Marvin Minsky's 1970's AI construct: slots and the data 
that go in them. The S programming language (precursor to R) adopted this 
terminology in 1991. R of course became popular with the rise of Data Science 
around 2012.
http://www.google.com/trends/explore#q=%22data%20science%22%2C%20%22r%20programming%22cmpt=qtz=

DataFrame would carry the implication that it comes along with its own 
metadata, whereas DataTable might carry the implication that metadata is 
stored in a central metadata repository.

DataFrame is thus technically more correct for SchemaRDD, but is a less 
familiar (and thus less accessible) term for those not immersed in data science 
or AI and thus may have narrower appeal.


- Original Message -
From: Evan R. Sparks evan.spa...@gmail.com
To: Matei Zaharia matei.zaha...@gmail.com
Cc: Koert Kuipers ko...@tresata.com; Michael Malak michaelma...@yahoo.com; 
Patrick Wendell pwend...@gmail.com; Reynold Xin r...@databricks.com; 
dev@spark.apache.org dev@spark.apache.org
Sent: Tuesday, January 27, 2015 9:55 AM
Subject: Re: renaming SchemaRDD - DataFrame

I'm +1 on this, although a little worried about unknowingly introducing
SparkSQL dependencies every time someone wants to use this. It would be
great if the interface can be abstract and the implementation (in this
case, SparkSQL backend) could be swapped out.

One alternative suggestion on the name - why not call it DataTable?
DataFrame seems like a name carried over from pandas (and by extension, R),
and it's never been obvious to me what a Frame is.



On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 (Actually when we designed Spark SQL we thought of giving it another name,
 like Spark Schema, but we decided to stick with SQL since that was the most
 obvious use case to many users.)

 Matei

  On Jan 26, 2015, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
  While it might be possible to move this concept to Spark Core long-term,
 supporting structured data efficiently does require quite a bit of the
 infrastructure in Spark SQL, such as query planning and columnar storage.
 The intent of Spark SQL though is to be more than a SQL server -- it's
 meant to be a library for manipulating structured data. Since this is
 possible to build over the core API, it's pretty natural to organize it
 that way, same as Spark Streaming is a library.
 
  Matei
 
  On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com wrote:
 
  The context is that SchemaRDD is becoming a common data format used for
  bringing data into Spark from external systems, and used for various
  components of Spark, e.g. MLlib's new pipeline API.
 
  i agree. this to me also implies it belongs in spark core, not sql
 
  On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak 
  michaelma...@yahoo.com.invalid wrote:
 
  And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
 Area
  Spark Meetup YouTube contained a wealth of background information on
 this
  idea (mostly from Patrick and Reynold :-).
 
  https://www.youtube.com/watch?v=YWppYPWznSQ
 
  
  From: Patrick Wendell pwend...@gmail.com
  To: Reynold Xin r...@databricks.com
  Cc: dev@spark.apache.org dev@spark.apache.org
  Sent: Monday, January 26, 2015 4:01 PM
  Subject: Re: renaming SchemaRDD - DataFrame
 
 
  One thing potentially not clear from this e-mail, there will be a 1:1
  correspondence where you can get an RDD to/from a DataFrame.
 
 
  On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin r...@databricks.com
 wrote:
  Hi,
 
  We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted
 to
  get the community's opinion.
 
  The context is that SchemaRDD is becoming a common data format used
 for
  bringing data into Spark from external systems, and used for various
  components of Spark, e.g. MLlib's new pipeline API. We also expect
 more
  and
  more users to be programming directly against SchemaRDD API rather
 than
  the
  core RDD API. SchemaRDD, through its less commonly used DSL originally
  designed for writing test cases, always has the data-frame like API.
 In
  1.3, we are redesigning the API to make the API usable for end users.
 
 
  There are two motivations for the renaming:
 
  1. DataFrame seems to be a more self-evident name than SchemaRDD.
 
  2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
 (even
  though it would contain some RDD functions like map, flatMap, etc),
 and
  calling it Schema*RDD* while it is not an RDD is highly confusing.
  Instead.
  DataFrame.rdd will return the underlying RDD for all RDD methods.
 
 
  My understanding is that very few users program directly against the
  SchemaRDD API at the moment, because they are not well documented.
  However,
  oo maintain backward compatibility, we can 

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Sean Owen
Got it. Ignore the SHA512 issue since these aren't somehow expected by
a policy or Maven to be in a certain format. Just wondered if the
difference was intended.

The Maven way of generated the SHA1 hashes is to set this on the
install plugin, AFAIK, although I'm not sure if the intent was to hash
files that Maven didn't create:

configuration
createChecksumtrue/createChecksum
/configuration

As for the key issue, I think it's just a matter of uploading the new
key in both places.

We should all of course test the release anyway.

On Tue, Jan 27, 2015 at 5:55 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Sean,

 The release script generates hashes in two places (take a look a bit
 further down in the script), one for the published artifacts and the
 other for the binaries. In the case of the binaries we use SHA512
 because, AFAIK, the ASF does not require you to use SHA1 and SHA512 is
 better. In the case of the published Maven artifacts we use SHA1
 because my understanding is this is what Maven requires. However, it
 does appear that the format is now one that maven cannot parse.

 Anyways, it seems fine to just change the format of the hash per your PR.

 - Patrick


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org