Re: Tungsten gives unexpected results when selecting null elements in array

2015-12-21 Thread PierreB
For info, this is the generated code:

GeneratedExpressionCode(
  cursor8 = 16;
  convertedStruct6.pointTo(buffer7, Platform.BYTE_ARRAY_OFFSET, 1,
cursor8);
  
/* input[0, ArrayType(StringType,true)][0] */

  /* input[0, ArrayType(StringType,true)] */

  boolean isNull2 = i.isNullAt(0);
  ArrayData primitive3 = isNull2 ? null : (i.getArray(0));

  boolean isNull0 = isNull2;
  UTF8String primitive1 = null;
  if (!isNull0) {
/* 0 */

if (!false) {
  
final int index = (int) 0;
if (index >= primitive3.numElements() || index < 0) {
  isNull0 = true;
} else {
  primitive1 = primitive3.getUTF8String(index);
}
  
} else {
  isNull0 = true;
}
  }


  int numBytes10 = cursor8 + (isNull0 ? 0 :
org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(primitive1));
  if (buffer7.length < numBytes10) {
// This will not happen frequently, because the buffer is
re-used.
byte[] tmpBuffer9 = new byte[numBytes10 * 2];
Platform.copyMemory(buffer7, Platform.BYTE_ARRAY_OFFSET,
  tmpBuffer9, Platform.BYTE_ARRAY_OFFSET, buffer7.length);
buffer7 = tmpBuffer9;
  }
  convertedStruct6.pointTo(buffer7, Platform.BYTE_ARRAY_OFFSET, 1,
numBytes10);
 

  if (isNull0) {
convertedStruct6.setNullAt(0);
  } else {
cursor8 +=
org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.write(convertedStruct6,
0, cursor8, primitive1);
  }

  
  ,false,convertedStruct6)


The culprit line is the following:

  int numBytes10 = cursor8 + (isNull0 ? 0 :
org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(primitive1));





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-gives-unexpected-results-when-selecting-null-elements-in-array-tp15717p15718.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Tungsten gives unexpected results when selecting null elements in array

2015-12-21 Thread PierreB
I believe the problem is that the generated code does not check if the
selected item in the array is null.Naïvely, I think changing this line would
solve this:
https://github.com/apache/spark/blob/4af647c77ded6a0d3087ceafb2e30e01d97e7a06/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L219Replace:
if (index >= $eval1.numElements() || index < 0) {
By:
if (index >= $eval1.numElements() || index < 0 || ctx.getValue(eval1,
dataType, "index") == null ) { 
Does that make sense?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Tungsten-gives-unexpected-results-when-selecting-null-elements-in-array-tp15717p15719.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

2015-12-21 Thread Zhan Zhang
This looks to me is a very unusual use case. You stop the SparkContext, and 
start another one. I don’t think it is well supported. As the SparkContext is 
stopped, all the resources are supposed to be released. 

Is there any mandatory reason you have stop and restart another SparkContext.

Thanks.

Zhan Zhang

Note that when sc is stopped, all resources are released (for example in yarn 
On Dec 20, 2015, at 2:59 PM, Jerry Lam  wrote:

> Hi Spark developers,
> 
> I found that SQLContext.getOrCreate(sc: SparkContext) does not behave 
> correctly when a different spark context is provided.
> 
> ```
> val sc = new SparkContext
> val sqlContext =SQLContext.getOrCreate(sc)
> sc.stop
> ...
> 
> val sc2 = new SparkContext
> val sqlContext2 = SQLContext.getOrCreate(sc2)
> sc2.stop
> ```
> 
> The sqlContext2 will reference sc instead of sc2 and therefore, the program 
> will not work because sc has been stopped. 
> 
> Best Regards,
> 
> Jerry 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Expression/LogicalPlan dichotomy in Spark SQL Catalyst

2015-12-21 Thread Michael Armbrust
>
> Why was the choice made in Catalyst to make LogicalPlan/QueryPlan and
> Expression separate subclasses of TreeNode, instead of e.g. also make
> QueryPlan inherit from Expression?
>
I think this is a pretty common way to model things (glancing at postgres
it looks similar).  Expression and plans are pretty different concepts.  An
expression can be evaluated on a single input row and returns a single
value.  In contrast a query plan operates on a relation and has a schema
with many different atomic values.


> The code also contains duplicate functionality, like
> LeafNode/LeafExpression, UnaryNode/UnaryExpression and
> BinaryNode/BinaryExpression.


These traits actually have different semantics for expressions vs. plans
(i.e. a UnaryExpression nullability is based on its child's nullability,
whereas this would not make sense for a UnaryNode which does not have a
concept of nullability).


> this makes whole-tree transformations really cumbersome since we've got to
> deal with 'pivot points' for these 2 types of TreeNodes, where a recursive
> transformation can only be done on 1 specific type of children, and then
> has to be dealt with again within the same PartialFunction for the other
> type in which the matching case(s) can be nested.


It is not clear to me that you actually want these transformations to
happen seamlessly.  For example, the resolution rules for subqueries are
different than normal plans because you have to reason about correlation.
That said, it seems like you should be able to do some magic in
RuleExecutor to make sure that things like the optimizer descend seamlessly
into nested query plans.


pyspark streaming 1.6 mapWithState?

2015-12-21 Thread Renyi Xiong
Hi TD,

I noticed mapWithState was available in spark 1.6. Is there any plan to
enable it in pyspark as well?

thanks,
Renyi.


Expression/LogicalPlan dichotomy in Spark SQL Catalyst

2015-12-21 Thread Roland Reumerman
[Note: this question has been moved from the Conversation in

[SPARK-4226][SQL]Add subquery (not) in/exists support #9055

to the dev mailing list.]


We've added our own In/Exists - plus Subquery in Select - support to a partial 
fork of Spark SQL Catalyst (which we use in transformations from our own query 
language to SQL for relational databases). But since In, Exists and Select 
projections are Expressions which will then contain LogicalPlans 
(Subquery/Select with nested LogicalPlans with potential nested Expressions) 
this makes whole-tree transformations really cumbersome since we've got to deal 
with 'pivot points' for these 2 types of TreeNodes, where a recursive 
transformation can only be done on 1 specific type of children, and then has to 
be dealt with again within the same PartialFunction for the other type in which 
the matching case(s) can be nested. Why was the choice made in Catalyst to make 
LogicalPlan/QueryPlan and Expression separate subclasses of TreeNode, instead 
of e.g. also make QueryPlan inherit from Expression? The code also contains 
duplicate functionality, like LeafNode/LeafExpression, 
UnaryNode/UnaryExpression and BinaryNode/BinaryExpression.


Much obliged,

Roland Reumerman

mendix.com



Re: Tungsten gives unexpected results when selecting null elements in array

2015-12-21 Thread Reynold Xin
Thanks for the email. Do you mind creating a JIRA ticket and reply with a
link to the ticket?

On Mon, Dec 21, 2015 at 1:12 PM, PierreB <
pierre.borckm...@realimpactanalytics.com> wrote:

> I believe the problem is that the generated code does not check if the
> selected item in the array is null. Naïvely, I think changing this line
> would solve this:
> https://github.com/apache/spark/blob/4af647c77ded6a0d3087ceafb2e30e01d97e7a06/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L219
> Replace:
>
> if (index >= $eval1.numElements() || index < 0) {
>
> By:
>
> if (index >= $eval1.numElements() || index < 0 || ctx.getValue(eval1, 
> dataType, "index") == null ) {
>
> Does that make sense?
> --
> View this message in context: Re: Tungsten gives unexpected results when
> selecting null elements in array
> 
>
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-21 Thread Michael Armbrust
It's come to my attention that there have been several bug fixes merged
since RC3:

  - SPARK-12404 - Fix serialization error for Datasets with
Timestamps/Arrays/Decimal
  - SPARK-12218 - Fix incorrect pushdown of filters to parquet
  - SPARK-12395 - Fix join columns of outer join for DataFrame using
  - SPARK-12413 - Fix mesos HA

Normally, these would probably not be sufficient to hold the release,
however with the holidays going on in the US this week, we don't have the
resources to finalize 1.6 until next Monday.  Given this delay anyway, I
propose that we cut one final RC with the above fixes and plan for the
actual release first thing next week.

I'll post RC4 shortly and cancel this vote if there are no objections.
Since this vote nearly passed with no major issues, I don't anticipate any
problems with RC4.

Michael

On Sat, Dec 19, 2015 at 11:44 PM, Jeff Zhang  wrote:

> +1 (non-binding)
>
> All the test passed, and run it on HDP 2.3.2 sandbox successfully.
>
> On Sun, Dec 20, 2015 at 10:43 AM, Luciano Resende 
> wrote:
>
>> +1 (non-binding)
>>
>> Tested Standalone mode, SparkR and couple Stream Apps, all seem ok.
>>
>> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust > > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc3
>>> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
>>> *
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1174/
>>>
>>> The test repository (versioned as v1.6.0-rc3) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1173/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>>>
>>> ===
>>> == How can I help test this release? ==
>>> ===
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> 
>>> == What justifies a -1 vote for this release? ==
>>> 
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==
>>> == Major changes to help you focus your testing ==
>>> ==
>>>
>>> Notable changes since 1.6 RC2
>>> - SPARK_VERSION has been set correctly
>>> - SPARK-12199 ML Docs are publishing correctly
>>> - SPARK-12345 Mesos cluster mode has been fixed
>>>
>>> Notable changes since 1.6 RC1
>>> Spark Streaming
>>>
>>>- SPARK-2629  
>>>trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>>- SPARK-12165 
>>>SPARK-12189  Fix
>>>bugs in eviction of storage memory by execution.
>>>- SPARK-12258  correct
>>>passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>>- SPARK-11787  Parquet
>>>Performance - Improve Parquet scan performance when using flat
>>>schemas.
>>>- SPARK-10810 

Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

2015-12-21 Thread Jerry Lam
Hi Zhan,

I'm illustrating the issue via a simple example. However it is not
difficult to imagine use cases that need this behaviour. For example, you
want to release all resources of spark when it does not use for longer than
an hour in  a job server like web services. Unless you can prevent people
from stopping spark context, then it is reasonable to assume that people
can stop it and start it again in  later time.

Best Regards,

Jerry


On Mon, Dec 21, 2015 at 7:20 PM, Zhan Zhang  wrote:

> This looks to me is a very unusual use case. You stop the SparkContext,
> and start another one. I don’t think it is well supported. As the
> SparkContext is stopped, all the resources are supposed to be released.
>
> Is there any mandatory reason you have stop and restart another
> SparkContext.
>
> Thanks.
>
> Zhan Zhang
>
> Note that when sc is stopped, all resources are released (for example in
> yarn
> On Dec 20, 2015, at 2:59 PM, Jerry Lam  wrote:
>
> > Hi Spark developers,
> >
> > I found that SQLContext.getOrCreate(sc: SparkContext) does not behave
> correctly when a different spark context is provided.
> >
> > ```
> > val sc = new SparkContext
> > val sqlContext =SQLContext.getOrCreate(sc)
> > sc.stop
> > ...
> >
> > val sc2 = new SparkContext
> > val sqlContext2 = SQLContext.getOrCreate(sc2)
> > sc2.stop
> > ```
> >
> > The sqlContext2 will reference sc instead of sc2 and therefore, the
> program will not work because sc has been stopped.
> >
> > Best Regards,
> >
> > Jerry
>
>


Re: A proposal for Spark 2.0

2015-12-21 Thread Reynold Xin
FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.

On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:

> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>


Re: A proposal for Spark 2.0

2015-12-21 Thread Reynold Xin
I'm not sure if we need special API support for GPUs. You can already use
GPUs on individual executor nodes to build your own applications. If we
want to leverage GPUs out of the box, I don't think the solution is to
provide GPU specific APIs. Rather, we should just switch the underlying
execution to GPUs when it is more optimal.

Anyway, I don't want to distract this topic, If you want to discuss more
about GPUs, please start a new thread.


On Mon, Dec 21, 2015 at 11:18 PM, Allen Zhang  wrote:

> plus dev
>
>
>
>
>
>
> 在 2015-12-22 15:15:59,"Allen Zhang"  写道:
>
> Hi Reynold,
>
> Any new API support for GPU computing in our 2.0 new version ?
>
> -Allen
>
>
>
>
> 在 2015-12-22 14:12:50,"Reynold Xin"  写道:
>
> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.
>
> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature request in this thread. Not
>> that we shouldn’t be adding features, but we can always add features in
>> 1.7, 2.1, 2.2, ...
>>
>> First - I want to propose a premise for how to think about Spark 2.0 and
>> major releases in Spark, based on discussion with several members of the
>> community: a major release should be low overhead and minimally disruptive
>> to the Spark community. A major release should not be very different from a
>> minor release and should not be gated based on new features. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs (examples follow).
>>
>> For this reason, I would *not* propose doing major releases to break
>> substantial API's or perform large re-architecting that prevent users from
>> upgrading. Spark has always had a culture of evolving architecture
>> incrementally and making changes - and I don't think we want to change this
>> model. In fact, we’ve released many architectural changes on the 1.X line.
>>
>> If the community likes the above model, then to me it seems reasonable to
>> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
>> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
>> major releases every 2 years seems doable within the above model.
>>
>> Under this model, here is a list of example things I would propose doing
>> in Spark 2.0, separated into APIs and Operation/Deployment:
>>
>>
>> APIs
>>
>> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 1.x.
>>
>> 2. Remove Akka from Spark’s API dependency (in streaming), so user
>> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
>> about user applications being unable to use Akka due to Spark’s dependency
>> on Akka.
>>
>> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>>
>> 4. Better class package structure for low level developer API’s. In
>> particular, we have some DeveloperApi (mostly various listener-related
>> classes) added over the years. Some packages include only one or two public
>> classes but a lot of private classes. A better structure is to have public
>> classes isolated to a few public packages, and these public packages should
>> have minimal private classes for low level developer APIs.
>>
>> 5. Consolidate task metric and accumulator API. Although having some
>> subtle differences, these two are very similar but have completely
>> different code path.
>>
>> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
>> moving them to other package(s). They are already used beyond SQL, e.g. in
>> ML pipelines, and will be used by streaming also.
>>
>>
>> Operation/Deployment
>>
>> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
>> but it has been end-of-life.
>>
>> 2. Remove Hadoop 1 support.
>>
>> 3. Assembly-free distribution of Spark: don’t require building an
>> enormous assembly jar in order to run Spark.
>>
>>
>
>
>
>
>
>
>
>


Re: A proposal for Spark 2.0

2015-12-21 Thread Allen Zhang


Thanks your quick respose, ok, I will start a new thread with my thoughts


Thanks,
Allen





At 2015-12-22 15:19:49, "Reynold Xin"  wrote:

I'm not sure if we need special API support for GPUs. You can already use GPUs 
on individual executor nodes to build your own applications. If we want to 
leverage GPUs out of the box, I don't think the solution is to provide GPU 
specific APIs. Rather, we should just switch the underlying execution to GPUs 
when it is more optimal.


Anyway, I don't want to distract this topic, If you want to discuss more about 
GPUs, please start a new thread.




On Mon, Dec 21, 2015 at 11:18 PM, Allen Zhang  wrote:

plus dev







在 2015-12-22 15:15:59,"Allen Zhang"  写道:

Hi Reynold,


Any new API support for GPU computing in our 2.0 new version ?


-Allen





在 2015-12-22 14:12:50,"Reynold Xin"  写道:

FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. 


On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:

I’m starting a new thread since the other one got intermixed with feature 
requests. Please refrain from making feature request in this thread. Not that 
we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 
2.2, ...


First - I want to propose a premise for how to think about Spark 2.0 and major 
releases in Spark, based on discussion with several members of the community: a 
major release should be low overhead and minimally disruptive to the Spark 
community. A major release should not be very different from a minor release 
and should not be gated based on new features. The main purpose of a major 
release is an opportunity to fix things that are broken in the current API and 
remove certain deprecated APIs (examples follow).


For this reason, I would *not* propose doing major releases to break 
substantial API's or perform large re-architecting that prevent users from 
upgrading. Spark has always had a culture of evolving architecture 
incrementally and making changes - and I don't think we want to change this 
model. In fact, we’ve released many architectural changes on the 1.X line.


If the community likes the above model, then to me it seems reasonable to do 
Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after 
Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major 
releases every 2 years seems doable within the above model.


Under this model, here is a list of example things I would propose doing in 
Spark 2.0, separated into APIs and Operation/Deployment:




APIs


1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x.


2. Remove Akka from Spark’s API dependency (in streaming), so user applications 
can use Akka (SPARK-5293). We have gotten a lot of complaints about user 
applications being unable to use Akka due to Spark’s dependency on Akka.


3. Remove Guava from Spark’s public API (JavaRDD Optional).


4. Better class package structure for low level developer API’s. In particular, 
we have some DeveloperApi (mostly various listener-related classes) added over 
the years. Some packages include only one or two public classes but a lot of 
private classes. A better structure is to have public classes isolated to a few 
public packages, and these public packages should have minimal private classes 
for low level developer APIs.


5. Consolidate task metric and accumulator API. Although having some subtle 
differences, these two are very similar but have completely different code path.


6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them 
to other package(s). They are already used beyond SQL, e.g. in ML pipelines, 
and will be used by streaming also.




Operation/Deployment


1. Scala 2.11 as the default build. We should still support Scala 2.10, but it 
has been end-of-life.


2. Remove Hadoop 1 support. 


3. Assembly-free distribution of Spark: don’t require building an enormous 
assembly jar in order to run Spark.








 





 




Re: A proposal for Spark 2.0

2015-12-21 Thread Allen Zhang
plus dev






在 2015-12-22 15:15:59,"Allen Zhang"  写道:

Hi Reynold,


Any new API support for GPU computing in our 2.0 new version ?


-Allen





在 2015-12-22 14:12:50,"Reynold Xin"  写道:

FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT. 


On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin  wrote:

I’m starting a new thread since the other one got intermixed with feature 
requests. Please refrain from making feature request in this thread. Not that 
we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 
2.2, ...


First - I want to propose a premise for how to think about Spark 2.0 and major 
releases in Spark, based on discussion with several members of the community: a 
major release should be low overhead and minimally disruptive to the Spark 
community. A major release should not be very different from a minor release 
and should not be gated based on new features. The main purpose of a major 
release is an opportunity to fix things that are broken in the current API and 
remove certain deprecated APIs (examples follow).


For this reason, I would *not* propose doing major releases to break 
substantial API's or perform large re-architecting that prevent users from 
upgrading. Spark has always had a culture of evolving architecture 
incrementally and making changes - and I don't think we want to change this 
model. In fact, we’ve released many architectural changes on the 1.X line.


If the community likes the above model, then to me it seems reasonable to do 
Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after 
Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major 
releases every 2 years seems doable within the above model.


Under this model, here is a list of example things I would propose doing in 
Spark 2.0, separated into APIs and Operation/Deployment:




APIs


1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x.


2. Remove Akka from Spark’s API dependency (in streaming), so user applications 
can use Akka (SPARK-5293). We have gotten a lot of complaints about user 
applications being unable to use Akka due to Spark’s dependency on Akka.


3. Remove Guava from Spark’s public API (JavaRDD Optional).


4. Better class package structure for low level developer API’s. In particular, 
we have some DeveloperApi (mostly various listener-related classes) added over 
the years. Some packages include only one or two public classes but a lot of 
private classes. A better structure is to have public classes isolated to a few 
public packages, and these public packages should have minimal private classes 
for low level developer APIs.


5. Consolidate task metric and accumulator API. Although having some subtle 
differences, these two are very similar but have completely different code path.


6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them 
to other package(s). They are already used beyond SQL, e.g. in ML pipelines, 
and will be used by streaming also.




Operation/Deployment


1. Scala 2.11 as the default build. We should still support Scala 2.10, but it 
has been end-of-life.


2. Remove Hadoop 1 support. 


3. Assembly-free distribution of Spark: don’t require building an enormous 
assembly jar in order to run Spark.








 

Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

2015-12-21 Thread Ted Yu
In Jerry's example, the first SparkContext, sc, has been stopped.

So there would be only one SparkContext running at any given moment.

Cheers

On Mon, Dec 21, 2015 at 8:23 AM, Chester @work 
wrote:

> Jerry
> I thought you should not create more than one SparkContext within one
> Jvm, ...
> Chester
>
> Sent from my iPhone
>
> > On Dec 20, 2015, at 2:59 PM, Jerry Lam  wrote:
> >
> > Hi Spark developers,
> >
> > I found that SQLContext.getOrCreate(sc: SparkContext) does not behave
> correctly when a different spark context is provided.
> >
> > ```
> > val sc = new SparkContext
> > val sqlContext =SQLContext.getOrCreate(sc)
> > sc.stop
> > ...
> >
> > val sc2 = new SparkContext
> > val sqlContext2 = SQLContext.getOrCreate(sc2)
> > sc2.stop
> > ```
> >
> > The sqlContext2 will reference sc instead of sc2 and therefore, the
> program will not work because sc has been stopped.
> >
> > Best Regards,
> >
> > Jerry
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [Spark SQL] SQLContext getOrCreate incorrect behaviour

2015-12-21 Thread Chester @work
Jerry
I thought you should not create more than one SparkContext within one Jvm, 
...
Chester

Sent from my iPhone

> On Dec 20, 2015, at 2:59 PM, Jerry Lam  wrote:
> 
> Hi Spark developers,
> 
> I found that SQLContext.getOrCreate(sc: SparkContext) does not behave 
> correctly when a different spark context is provided.
> 
> ```
> val sc = new SparkContext
> val sqlContext =SQLContext.getOrCreate(sc)
> sc.stop
> ...
> 
> val sc2 = new SparkContext
> val sqlContext2 = SQLContext.getOrCreate(sc2)
> sc2.stop
> ```
> 
> The sqlContext2 will reference sc instead of sc2 and therefore, the program 
> will not work because sc has been stopped. 
> 
> Best Regards,
> 
> Jerry 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org