date:20150702


 [ 
https://issues.apache.org/jira/browse/SPARK-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-8687.

  Resolution: Fixed
Assignee: SaintBacchus
   Fix Version/s: 1.5.0
Target Version/s: 1.5.0

 Spark on yarn-client mode can't send `spark.yarn.credentials.file` to 
 executor.
 ---

 Key: SPARK-8687
 URL: https://issues.apache.org/jira/browse/SPARK-8687
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.5.0
Reporter: SaintBacchus
Assignee: SaintBacchus
 Fix For: 1.5.0


 Yarn will set +spark.yarn.credentials.file+ after *DriverEndpoint* 
 initialized. So executor will fetch the old configuration and will cause the 
 problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8783) CTAS with WITH clause does not work

2015-07-02 Thread Keuntae Park (JIRA)

Keuntae Park created SPARK-8783:
---

 Summary: CTAS with WITH clause does not work
 Key: SPARK-8783
 URL: https://issues.apache.org/jira/browse/SPARK-8783
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Keuntae Park
Priority: Minor


Following CTAS with WITH clause query 
{code}
CREATE TABLE with_table1 AS
WITH T AS (
  SELECT *
  FROM table1
)
SELECT *
FROM T
{code}
induces following error
{code}
no such table T; line 7 pos 5
org.apache.spark.sql.AnalysisException: no such table T; line 7 pos 5
...
{code}

I think that WITH clause within CTAS is not handled properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8769) toLocalIterator should mention it results in many jobs


 [ 
https://issues.apache.org/jira/browse/SPARK-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-8769.

  Resolution: Fixed
Assignee: holdenk
   Fix Version/s: 1.4.2
  1.5.0
Target Version/s: 1.5.0, 1.4.2

 toLocalIterator should mention it results in many jobs
 --

 Key: SPARK-8769
 URL: https://issues.apache.org/jira/browse/SPARK-8769
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: holdenk
Assignee: holdenk
Priority: Trivial
 Fix For: 1.5.0, 1.4.2


 toLocalIterator on RDDs should mention that it results in mutliple jobs, and 
 that to avoid re-computing, if the input was the result of a 
 wide-transformation, the input should be cached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8706) Implement Pylint / Prospector checks for PySpark

2015-07-02 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611611#comment-14611611
 ] 

Manoj Kumar commented on SPARK-8706:


Sorry for sounding dumb, but the present code downloads pep8 as a script. 
However it seems that pylint is a repo, which again has two dependencies. What 
is the preferred way to do this in Spark?

 Implement Pylint / Prospector checks for PySpark
 

 Key: SPARK-8706
 URL: https://issues.apache.org/jira/browse/SPARK-8706
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra, PySpark
Reporter: Josh Rosen

 It would be nice to implement Pylint / Prospector 
 (https://github.com/landscapeio/prospector) checks for PySpark. As with the 
 style checker rules, I'll imagine that we'll want to roll out new rules 
 gradually in order to avoid a mass refactoring commit.
 For starters, we should create a pull request that introduces the harness for 
 running the linters, add a configuration file which enables only the lint 
 checks that currently pass, and install the required dependencies on Jenkins. 
 Once we've done this, we can open a series of smaller followup PRs to 
 gradually enable more linting checks and to fix existing violations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8729) Spark app unable to instantiate the classes using the reflection


 [ 
https://issues.apache.org/jira/browse/SPARK-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8729.
--
Resolution: Not A Problem

 Spark app unable to instantiate the classes using the reflection
 

 Key: SPARK-8729
 URL: https://issues.apache.org/jira/browse/SPARK-8729
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.3.0
Reporter: Murthy Chelankuri
Priority: Critical

 SPARK 1.3.0 unable to instantiate the classes using the reflection (using 
 Class.forName). It says class not found even that class is available in the 
 list jars.
 The following is the expection i am getting by the executors
 java.lang.ClassNotFoundException: com.abc.mq.msg.ObjectEncoder
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:264)
 at kafka.utils.Utils$.createObject(Utils.scala:438)
 at kafka.producer.Producer.init(Producer.scala:61)
 The application is working fine with out any issues with 1.2.0 version. 
 I am planing to upgrade to 1.3.0 and found it its not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark


[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611527#comment-14611527
 ] 

hujiayin edited comment on SPARK-5682 at 7/2/15 6:10 AM:
-

Steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

If you use a better cypher solution, the performance downgrade will be 
minimized. i think AES is a bit heavy.

In the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said rely on hadoop

Though the API is public stable, however, you cannot ensure if the API will not 
be changed since it is not the comercial software.



was (Author: hujiayin):
Steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

In the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said rely on hadoop

Though the API is public stable, however, you cannot ensure if the API will not 
be changed since it is not the comercial software.


 Add encrypted shuffle in spark
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark 
 Shuffle_20150209.docx, Design Document of Encrypted Spark 
 Shuffle_20150318.docx, Design Document of Encrypted Spark 
 Shuffle_20150402.docx, Design Document of Encrypted Spark 
 Shuffle_20150506.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. AES  is a specification for 
 the encryption of electronic data. There are 5 common modes in AES. CTR is 
 one of the modes. We use two codec JceAesCtrCryptoCodec and 
 OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
 in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
 provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
 provides. 
 Because ugi credential info is used in the process of encrypted shuffle, we 
 first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3071) Increase default driver memory


 [ 
https://issues.apache.org/jira/browse/SPARK-3071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3071.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Increase default driver memory
 --

 Key: SPARK-3071
 URL: https://issues.apache.org/jira/browse/SPARK-3071
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.2
Reporter: Xiangrui Meng
Assignee: Ilya Ganelin
 Fix For: 1.5.0


 The current default is 512M, which is usually too small because user also 
 uses driver to do some computation. In local mode, executor memory setting is 
 ignored while only driver memory is used, which provides more incentive to 
 increase the default driver memory.
 I suggest
 1. 2GB in local mode and warn users if executor memory is set a bigger value
 2. same as worker memory on an EC2 standalone server



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8782) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)


 [ 
https://issues.apache.org/jira/browse/SPARK-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8782:
---

Assignee: Apache Spark  (was: Josh Rosen)

 GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
 

 Key: SPARK-8782
 URL: https://issues.apache.org/jira/browse/SPARK-8782
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Apache Spark
Priority: Blocker

 Queries containing ORDER BY NULL currently result in a code generation 
 exception:
 {code}
   public SpecificOrdering 
 generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
 return new SpecificOrdering(expr);
   }
   class SpecificOrdering extends 
 org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
 private org.apache.spark.sql.catalyst.expressions.Expression[] 
 expressions = null;
 public 
 SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) 
 {
   expressions = expr;
 }
 @Override
 public int compare(InternalRow a, InternalRow b) {
   InternalRow i = null;  // Holds current row being evaluated.
   
   i = a;
   final Object primitive1 = null;
   i = b;
   final Object primitive3 = null;
   if (true  true) {
 // Nothing
   } else if (true) {
 return -1;
   } else if (true) {
 return 1;
   } else {
 int comp = primitive1.compare(primitive3);
 if (comp != 0) {
   return comp;
 }
   }
   
   return 0;
 }
   }
 org.codehaus.commons.compiler.CompileException: Line 29, Column 43: A method 
 named compare is not declared in any enclosing class nor any supertype, nor 
 through a static import
   at 
 org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8782) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)


 [ 
https://issues.apache.org/jira/browse/SPARK-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8782:
---

Assignee: Josh Rosen  (was: Apache Spark)

 GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
 

 Key: SPARK-8782
 URL: https://issues.apache.org/jira/browse/SPARK-8782
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 Queries containing ORDER BY NULL currently result in a code generation 
 exception:
 {code}
   public SpecificOrdering 
 generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
 return new SpecificOrdering(expr);
   }
   class SpecificOrdering extends 
 org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
 private org.apache.spark.sql.catalyst.expressions.Expression[] 
 expressions = null;
 public 
 SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) 
 {
   expressions = expr;
 }
 @Override
 public int compare(InternalRow a, InternalRow b) {
   InternalRow i = null;  // Holds current row being evaluated.
   
   i = a;
   final Object primitive1 = null;
   i = b;
   final Object primitive3 = null;
   if (true  true) {
 // Nothing
   } else if (true) {
 return -1;
   } else if (true) {
 return 1;
   } else {
 int comp = primitive1.compare(primitive3);
 if (comp != 0) {
   return comp;
 }
   }
   
   return 0;
 }
   }
 org.codehaus.commons.compiler.CompileException: Line 29, Column 43: A method 
 named compare is not declared in any enclosing class nor any supertype, nor 
 through a static import
   at 
 org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8782) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)


[ 
https://issues.apache.org/jira/browse/SPARK-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611543#comment-14611543
 ] 

Apache Spark commented on SPARK-8782:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7179

 GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
 

 Key: SPARK-8782
 URL: https://issues.apache.org/jira/browse/SPARK-8782
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 Queries containing ORDER BY NULL currently result in a code generation 
 exception:
 {code}
   public SpecificOrdering 
 generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
 return new SpecificOrdering(expr);
   }
   class SpecificOrdering extends 
 org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
 private org.apache.spark.sql.catalyst.expressions.Expression[] 
 expressions = null;
 public 
 SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) 
 {
   expressions = expr;
 }
 @Override
 public int compare(InternalRow a, InternalRow b) {
   InternalRow i = null;  // Holds current row being evaluated.
   
   i = a;
   final Object primitive1 = null;
   i = b;
   final Object primitive3 = null;
   if (true  true) {
 // Nothing
   } else if (true) {
 return -1;
   } else if (true) {
 return 1;
   } else {
 int comp = primitive1.compare(primitive3);
 if (comp != 0) {
   return comp;
 }
   }
   
   return 0;
 }
   }
 org.codehaus.commons.compiler.CompileException: Line 29, Column 43: A method 
 named compare is not declared in any enclosing class nor any supertype, nor 
 through a static import
   at 
 org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only

2015-07-02 Thread Antony Mayi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611555#comment-14611555
 ] 

Antony Mayi commented on SPARK-8708:


bq. Antony Mayi In your real case, how many partitions did ALS.predictAll 
return?
512 partitions of which 511 are empty and the single one with all 13M ratings.

 MatrixFactorizationModel.predictAll() populates single partition only
 -

 Key: SPARK-8708
 URL: https://issues.apache.org/jira/browse/SPARK-8708
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Antony Mayi

 When using mllib.recommendation.ALS the RDD returned by .predictAll() has all 
 values pushed into single partition despite using quite high parallelism.
 This degrades performance of further processing (I can obviously run 
 .partitionBy()) to balance it but that's still too costly (ie if running 
 .predictAll() in loop for thousands of products) and should be possible to do 
 it rather somehow on the model (automatically)).
 Bellow is an example on tiny sample (same on large dataset):
 {code:title=pyspark}
  r1 = (1, 1, 1.0)
  r2 = (1, 2, 2.0)
  r3 = (2, 1, 2.0)
  r4 = (2, 2, 2.0)
  r5 = (3, 1, 1.0)
  ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
  ratings.getNumPartitions()
 5
  users = ratings.map(itemgetter(0)).distinct()
  model = ALS.trainImplicit(ratings, 1, seed=10)
  predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
  predictions_for_2.glom().map(len).collect()
 [0, 0, 3, 0, 0]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8781) Pusblished POMs are no longer effective POMs


[ 
https://issues.apache.org/jira/browse/SPARK-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611571#comment-14611571
 ] 

Sean Owen commented on SPARK-8781:
--

Does this affect release artifacts or just the snapshot?
That commit doesn't look related since it doesn't touch the lines you reference 
here. Are you sure? 
If it's 'fixed' by changing it is maybe something else at work?

 Pusblished POMs are no longer effective POMs
 

 Key: SPARK-8781
 URL: https://issues.apache.org/jira/browse/SPARK-8781
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.2, 1.4.1, 1.5.0
Reporter: Konstantin Shaposhnikov

 Published to maven repository POMs are no longer effective POMs. E.g. 
 In 
 https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/1.4.2-SNAPSHOT/spark-core_2.11-1.4.2-20150702.043114-52.pom:
 {noformat}
 ...
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-launcher_${scala.binary.version}/artifactId
 version${project.version}/version
 /dependency
 ...
 {noformat}
 while it should be
 {noformat}
 ...
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-launcher_2.11/artifactId
 version${project.version}/version
 /dependency
 ...
 {noformat}
 The following commits are most likely the cause of it:
 - for branch-1.3: 
 https://github.com/apache/spark/commit/ce137b8ed3b240b7516046699ac96daa55ddc129
 - for branch-1.4: 
 https://github.com/apache/spark/commit/84da653192a2d9edb82d0dbe50f577c4dc6a0c78
 - for master: 
 https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724
 On branch-1.4 reverting the commit fixed the issue.
 See SPARK-3812 for additional details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8726) Wrong spark.executor.memory when using different EC2 master and worker machine types

2015-07-02 Thread Stefano Parmesan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611584#comment-14611584
 ] 

Stefano Parmesan commented on SPARK-8726:
-

I've created a pull request for this issue:
https://github.com/mesos/spark-ec2/pull/128

 Wrong spark.executor.memory when using different EC2 master and worker 
 machine types
 

 Key: SPARK-8726
 URL: https://issues.apache.org/jira/browse/SPARK-8726
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Stefano Parmesan

 By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, 
 master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32];
  when using the same instance type for master and workers you will not 
 notice, but when using different ones (which makes sense, as the master 
 cannot be a spot instance, and using a big machine for the master would be a 
 waste of resources) the default amount of memory given to each worker is 
 capped to the amount of RAM available on the master (ex: if you create a 
 cluster with an m1.small master (1.7GB RAM) and one m1.large worker (7.5GB 
 RAM), spark.executor.memory will be set to 512MB).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8787) Change the parameter order of @deprecated in package object sql


[ 
https://issues.apache.org/jira/browse/SPARK-8787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611650#comment-14611650
 ] 

Apache Spark commented on SPARK-8787:
-

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/7183

 Change the parameter  order of @deprecated in package object sql
 

 Key: SPARK-8787
 URL: https://issues.apache.org/jira/browse/SPARK-8787
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Vinod KC
Priority: Trivial

 Parameter order of @deprecated annotation  in package object sql is wrong 
 deprecated(1.3.0, use DataFrame) .
 This has to be changed to  deprecated(use DataFrame, 1.3.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark


[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611527#comment-14611527
 ] 

hujiayin edited comment on SPARK-5682 at 7/2/15 6:03 AM:
-

Steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

In the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said rely on hadoop

Though the API is public stable, however, you cannot ensure if the API will not 
be changed since it is not the comercial software.



was (Author: hujiayin):
steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

in the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said rely on hadoop

though it is public stable, however, you cannot ensure if the api will not be 
changed since it was not the comercial software.


 Add encrypted shuffle in spark
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark 
 Shuffle_20150209.docx, Design Document of Encrypted Spark 
 Shuffle_20150318.docx, Design Document of Encrypted Spark 
 Shuffle_20150402.docx, Design Document of Encrypted Spark 
 Shuffle_20150506.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. AES  is a specification for 
 the encryption of electronic data. There are 5 common modes in AES. CTR is 
 one of the modes. We use two codec JceAesCtrCryptoCodec and 
 OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
 in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
 provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
 provides. 
 Because ugi credential info is used in the process of encrypted shuffle, we 
 first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8688) Hadoop Configuration has to disable client cache when writing or reading delegation tokens.


 [ 
https://issues.apache.org/jira/browse/SPARK-8688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-8688.

  Resolution: Fixed
Assignee: SaintBacchus
   Fix Version/s: 1.5.0
Target Version/s: 1.5.0

 Hadoop Configuration has to disable client cache when writing or reading 
 delegation tokens.
 ---

 Key: SPARK-8688
 URL: https://issues.apache.org/jira/browse/SPARK-8688
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.5.0
Reporter: SaintBacchus
Assignee: SaintBacchus
 Fix For: 1.5.0


 In class *AMDelegationTokenRenewer* and *ExecutorDelegationTokenUpdater*, 
 Spark will write and read the credentials.
 But if we don't disable the *fs.hdfs.impl.disable.cache*, Spark will use 
 cached  FileSystem (which will use old token ) to  upload or download file.
 Then when the old token is expired, it can't gain the auth to get/put the 
 hdfs.
 (I only tested in a very short time with the configuration:
 dfs.namenode.delegation.token.renew-interval=3min
 dfs.namenode.delegation.token.max-lifetime=10min
 I'm not sure whatever it matters.
  )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark


[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611527#comment-14611527
 ] 

hujiayin edited comment on SPARK-5682 at 7/2/15 6:02 AM:
-

steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

in the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said rely on hadoop

though it is public stable, however, you cannot ensure if the api will not be 
changed since it was not the comercial software.



was (Author: hujiayin):
steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

in the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said reply on hadoop

though it is public stable, however, you cannot ensure if the api will not be 
changed since it was not the comercial software.


 Add encrypted shuffle in spark
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark 
 Shuffle_20150209.docx, Design Document of Encrypted Spark 
 Shuffle_20150318.docx, Design Document of Encrypted Spark 
 Shuffle_20150402.docx, Design Document of Encrypted Spark 
 Shuffle_20150506.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. AES  is a specification for 
 the encryption of electronic data. There are 5 common modes in AES. CTR is 
 one of the modes. We use two codec JceAesCtrCryptoCodec and 
 OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
 in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
 provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
 provides. 
 Because ugi credential info is used in the process of encrypted shuffle, we 
 first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8784) Add python API for hex/unhex

Davies Liu created SPARK-8784:
-

 Summary: Add python API for hex/unhex
 Key: SPARK-8784
 URL: https://issues.apache.org/jira/browse/SPARK-8784
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8781) Pusblished POMs are no longer effective POMs

2015-07-02 Thread Konstantin Shaposhnikov (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611600#comment-14611600
]

Konstantin Shaposhnikov commented on SPARK-8781:

I believe this will affect both released and SNAPSHOT artefacts.

Basically, as part of SPARK-3812 the build was changed to deploy an effective
POMs into maven repository. E.g. in
https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/1.4.0/spark-core_2.11-1.4.0.pom
you won't find {{$\{scala.binary.version}}, it was resolved to 2.11 by the
maven during the build. This is required for Scala 2.11 build to make sure that
jars that are built with Scala 2.11 reference Scala 2.11 jars (e.g.
spark-core_2.11 should depend on spark-launcher_2.11, not on
spark-launcher_2.10). By default {{$\{scala.binary.version}} will be resolved
to 2.10 because scala-2.10 maven profile is the active by default.

Publishing of effective POMs is implemented using maven-shade-plugin. To be
honest I am not sure how exactly it works. However when I removed the following
line from the parent POM
{{createDependencyReducedPomfalse/createDependencyReducedPom}} the build
started to deploy effective POMs again.

I hope my explanation helps.

Pusblished POMs are no longer effective POMs

Key: SPARK-8781
URL: https://issues.apache.org/jira/browse/SPARK-8781
Project: Spark
Issue Type: Bug
Components: Build
Affects Versions: 1.3.2, 1.4.1, 1.5.0
Reporter: Konstantin Shaposhnikov

Published to maven repository POMs are no longer effective POMs. E.g.
In
https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/1.4.2-SNAPSHOT/spark-core_2.11-1.4.2-20150702.043114-52.pom:
{noformat}
...
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-launcher_${scala.binary.version}/artifactId
version${project.version}/version
/dependency
...
{noformat}
while it should be
{noformat}
...
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-launcher_2.11/artifactId
version${project.version}/version
/dependency
...
{noformat}
The following commits are most likely the cause of it:
- for branch-1.3:
https://github.com/apache/spark/commit/ce137b8ed3b240b7516046699ac96daa55ddc129
- for branch-1.4:
https://github.com/apache/spark/commit/84da653192a2d9edb82d0dbe50f577c4dc6a0c78
- for master:
https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724
On branch-1.4 reverting the commit fixed the issue.
See SPARK-3812 for additional details

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching


[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611602#comment-14611602
 ] 

Davies Liu commented on SPARK-8632:
---

[~justin.uang] Sounds interesting, could you sending out the PR?

 Poor Python UDF performance because of RDD caching
 --

 Key: SPARK-8632
 URL: https://issues.apache.org/jira/browse/SPARK-8632
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Justin Uang

 {quote}
 We have been running into performance problems using Python UDFs with 
 DataFrames at large scale.
 From the implementation of BatchPythonEvaluation, it looks like the goal was 
 to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
 two passes over the data. One to give to the PythonRDD, then one to join the 
 python lambda results with the original row (which may have java objects that 
 should be passed through).
 In addition, it caches all the columns, even the ones that don't need to be 
 processed by the Python UDF. In the cases I was working with, I had a 500 
 column table, and i wanted to use a python UDF for one column, and it ended 
 up caching all 500 columns. 
 {quote}
 http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8785) Improve Parquet schema merging


[ 
https://issues.apache.org/jira/browse/SPARK-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611605#comment-14611605
 ] 

Apache Spark commented on SPARK-8785:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/7182

 Improve Parquet schema merging
 --

 Key: SPARK-8785
 URL: https://issues.apache.org/jira/browse/SPARK-8785
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh

 Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend 
 much time to merge duplicate schema. We can select only non duplicate schema 
 and merge them later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8781) Pusblished POMs are no longer effective POMs

2015-07-02 Thread Konstantin Shaposhnikov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611606#comment-14611606
 ] 

Konstantin Shaposhnikov commented on SPARK-8781:


The original commit that adds effective POM publishing: 
https://github.com/apache/spark/commit/6e09c98b5d7ad92cf01a3b415008f48782f2f1a3

 Pusblished POMs are no longer effective POMs
 

 Key: SPARK-8781
 URL: https://issues.apache.org/jira/browse/SPARK-8781
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.2, 1.4.1, 1.5.0
Reporter: Konstantin Shaposhnikov

 Published to maven repository POMs are no longer effective POMs. E.g. 
 In 
 https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/1.4.2-SNAPSHOT/spark-core_2.11-1.4.2-20150702.043114-52.pom:
 {noformat}
 ...
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-launcher_${scala.binary.version}/artifactId
 version${project.version}/version
 /dependency
 ...
 {noformat}
 while it should be
 {noformat}
 ...
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-launcher_2.11/artifactId
 version${project.version}/version
 /dependency
 ...
 {noformat}
 The following commits are most likely the cause of it:
 - for branch-1.3: 
 https://github.com/apache/spark/commit/ce137b8ed3b240b7516046699ac96daa55ddc129
 - for branch-1.4: 
 https://github.com/apache/spark/commit/84da653192a2d9edb82d0dbe50f577c4dc6a0c78
 - for master: 
 https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724
 On branch-1.4 reverting the commit fixed the issue.
 See SPARK-3812 for additional details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8785) Improve Parquet schema merging


 [ 
https://issues.apache.org/jira/browse/SPARK-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8785:
---

Assignee: Apache Spark

 Improve Parquet schema merging
 --

 Key: SPARK-8785
 URL: https://issues.apache.org/jira/browse/SPARK-8785
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Apache Spark

 Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend 
 much time to merge duplicate schema. We can select only non duplicate schema 
 and merge them later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8785) Improve Parquet schema merging


 [ 
https://issues.apache.org/jira/browse/SPARK-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8785:
---

Assignee: (was: Apache Spark)

 Improve Parquet schema merging
 --

 Key: SPARK-8785
 URL: https://issues.apache.org/jira/browse/SPARK-8785
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh

 Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend 
 much time to merge duplicate schema. We can select only non duplicate schema 
 and merge them later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark

2015-07-02 Thread liyunzhang_intel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611547#comment-14611547
 ] 

liyunzhang_intel commented on SPARK-5682:
-

[~hujiayin]: thanks for your comment.

This feature is not based on hadooop2.6.  it is based on hadoop2.6 in original 
design. In the latest design doc(20150506), It shows that now there are two 
ways to implement encrypted shuffle in spark. Currently we only implement it on 
spark-on-yarn framework.  One is based on [Chimera(Chimera is a project which 
strips code related to CryptoInputStream/CryptoOutputStream from Hadoop to 
facilitate AES-NI based data encryption in other 
projects)|https://github.com/intel-hadoop/chimera](see 
https://github.com/apache/spark/pull/5307). In the other way,we implement all 
the crypto classes like CryptoInputStream/CryptoOutputStream in scala under 
core/src/main/scala/org/apache/spark/crypto/ package(see 
https://github.com/apache/spark/pull/4491).

For the problem of importing hadoop api in spark, if the interface of hadoop 
class is public and stable,it can be use in spark.
in 
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/classification/InterfaceStability.html,
 it says:
{quote}
Incompatible changes must not be made to classes marked as stable.
{quote}
which means when a class is marked stable, later release will not change it.





 Add encrypted shuffle in spark
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark 
 Shuffle_20150209.docx, Design Document of Encrypted Spark 
 Shuffle_20150318.docx, Design Document of Encrypted Spark 
 Shuffle_20150402.docx, Design Document of Encrypted Spark 
 Shuffle_20150506.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. AES  is a specification for 
 the encryption of electronic data. There are 5 common modes in AES. CTR is 
 one of the modes. We use two codec JceAesCtrCryptoCodec and 
 OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
 in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
 provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
 provides. 
 Because ugi credential info is used in the process of encrypted shuffle, we 
 first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8785) Improve Parquet schema merging

2015-07-02 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-8785:
--

 Summary: Improve Parquet schema merging
 Key: SPARK-8785
 URL: https://issues.apache.org/jira/browse/SPARK-8785
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend 
much time to merge duplicate schema. We can select only non duplicate schema 
and merge them later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8786) Create a wrapper for BinaryType

Davies Liu created SPARK-8786:
-

 Summary: Create a wrapper for BinaryType
 Key: SPARK-8786
 URL: https://issues.apache.org/jira/browse/SPARK-8786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu


The hashCode and equals() of Array[Byte] does check the bytes, we should create 
a wrapper to do that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8786) Create a wrapper for BinaryType


 [ 
https://issues.apache.org/jira/browse/SPARK-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8786:
--
Description: The hashCode and equals() of Array[Byte] does check the bytes, 
we should create a wrapper (internally) to do that.  (was: The hashCode and 
equals() of Array[Byte] does check the bytes, we should create a wrapper to do 
that.)

 Create a wrapper for BinaryType
 ---

 Key: SPARK-8786
 URL: https://issues.apache.org/jira/browse/SPARK-8786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu

 The hashCode and equals() of Array[Byte] does check the bytes, we should 
 create a wrapper (internally) to do that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8781) Pusblished POMs are no longer effective POMs


[ 
https://issues.apache.org/jira/browse/SPARK-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611618#comment-14611618
 ] 

Sean Owen commented on SPARK-8781:
--

Right, I get all that. Yes that makes it clear what the connection is to 
https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724 
-- it's the createDependencyReducedPom issue, maybe.

[~andrewor14] do you have more color on why that bit was needed?

 Pusblished POMs are no longer effective POMs
 

 Key: SPARK-8781
 URL: https://issues.apache.org/jira/browse/SPARK-8781
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.2, 1.4.1, 1.5.0
Reporter: Konstantin Shaposhnikov

 Published to maven repository POMs are no longer effective POMs. E.g. 
 In 
 https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/1.4.2-SNAPSHOT/spark-core_2.11-1.4.2-20150702.043114-52.pom:
 {noformat}
 ...
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-launcher_${scala.binary.version}/artifactId
 version${project.version}/version
 /dependency
 ...
 {noformat}
 while it should be
 {noformat}
 ...
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-launcher_2.11/artifactId
 version${project.version}/version
 /dependency
 ...
 {noformat}
 The following commits are most likely the cause of it:
 - for branch-1.3: 
 https://github.com/apache/spark/commit/ce137b8ed3b240b7516046699ac96daa55ddc129
 - for branch-1.4: 
 https://github.com/apache/spark/commit/84da653192a2d9edb82d0dbe50f577c4dc6a0c78
 - for master: 
 https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724
 On branch-1.4 reverting the commit fixed the issue.
 See SPARK-3812 for additional details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8787) Change the parameter order of @deprecated in package object sql

2015-07-02 Thread Vinod KC (JIRA)

Vinod KC created SPARK-8787:
---

 Summary: Change the parameter  order of @deprecated in package 
object sql
 Key: SPARK-8787
 URL: https://issues.apache.org/jira/browse/SPARK-8787
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Vinod KC
Priority: Trivial


Parameter order of @deprecated annotation  in package object sql is wrong 
deprecated(1.3.0, use DataFrame) .

This has to be changed to  deprecated(use DataFrame, 1.3.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark


[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611527#comment-14611527
 ] 

hujiayin commented on SPARK-5682:
-

steps were added to encode and decode the data, the performance will not be 
fast than before, in the same time, codes also have security issue, for example 
save the plain text in configuration file and finally used as the part of the 
key

in the same time, the feature based on hadoop 2.6, it is the limitation, that 
is why i said reply on hadoop

though it is public stable, however, you cannot ensure if the api will not be 
changed since it was not the comercial software.


 Add encrypted shuffle in spark
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark 
 Shuffle_20150209.docx, Design Document of Encrypted Spark 
 Shuffle_20150318.docx, Design Document of Encrypted Spark 
 Shuffle_20150402.docx, Design Document of Encrypted Spark 
 Shuffle_20150506.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. AES  is a specification for 
 the encryption of electronic data. There are 5 common modes in AES. CTR is 
 one of the modes. We use two codec JceAesCtrCryptoCodec and 
 OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
 in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
 provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
 provides. 
 Because ugi credential info is used in the process of encrypted shuffle, we 
 first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8754) YarnClientSchedulerBackend doesn't stop gracefully in failure conditions


 [ 
https://issues.apache.org/jira/browse/SPARK-8754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-8754.

  Resolution: Fixed
   Fix Version/s: 1.4.2
  1.5.0
Target Version/s: 1.5.0, 1.4.2

 YarnClientSchedulerBackend doesn't stop gracefully in failure conditions
 

 Key: SPARK-8754
 URL: https://issues.apache.org/jira/browse/SPARK-8754
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
Reporter: Devaraj K
Priority: Minor
 Fix For: 1.5.0, 1.4.2


 {code:xml}
 java.lang.NullPointerException
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:151)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:421)
 at 
 org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1447)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1651)
 at org.apache.spark.SparkContext.init(SparkContext.scala:572)
 at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28)
 at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:621)
 at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {code}
 If the application has FINISHED/FAILED/KILLED or failed to launch application 
 master, monitorThread is not getting initialized but 
 monitorThread.interrupt() is getting invoked as part of stop() without any 
 check and It is causing to throw NPE and also it is preventing to stop the 
 client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8687) Spark on yarn-client mode can't send `spark.yarn.credentials.file` to executor.


 [ 
https://issues.apache.org/jira/browse/SPARK-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8687:
-
Fix Version/s: 1.4.2

 Spark on yarn-client mode can't send `spark.yarn.credentials.file` to 
 executor.
 ---

 Key: SPARK-8687
 URL: https://issues.apache.org/jira/browse/SPARK-8687
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.5.0
Reporter: SaintBacchus
Assignee: SaintBacchus
 Fix For: 1.5.0, 1.4.2


 Yarn will set +spark.yarn.credentials.file+ after *DriverEndpoint* 
 initialized. So executor will fetch the old configuration and will cause the 
 problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8687) Spark on yarn-client mode can't send `spark.yarn.credentials.file` to executor.


 [ 
https://issues.apache.org/jira/browse/SPARK-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8687:
-
Target Version/s: 1.5.0, 1.4.2  (was: 1.5.0)

 Spark on yarn-client mode can't send `spark.yarn.credentials.file` to 
 executor.
 ---

 Key: SPARK-8687
 URL: https://issues.apache.org/jira/browse/SPARK-8687
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.5.0
Reporter: SaintBacchus
Assignee: SaintBacchus
 Fix For: 1.5.0, 1.4.2


 Yarn will set +spark.yarn.credentials.file+ after *DriverEndpoint* 
 initialized. So executor will fetch the old configuration and will cause the 
 problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8771) Actor system deprecation tag uses deprecated deprecation tag


 [ 
https://issues.apache.org/jira/browse/SPARK-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-8771.

  Resolution: Fixed
Assignee: holdenk
   Fix Version/s: 1.5.0
Target Version/s: 1.5.0

 Actor system deprecation tag uses deprecated deprecation tag
 

 Key: SPARK-8771
 URL: https://issues.apache.org/jira/browse/SPARK-8771
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: holdenk
Assignee: holdenk
Priority: Trivial
 Fix For: 1.5.0


 The deprecation of the actor system adds a spurious build warning:
 {quote}
 @deprecated now takes two arguments; see the scaladoc.
 [warn]   @deprecated(Actor system is no longer supported as of 1.4)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8771) Actor system deprecation tag uses deprecated deprecation tag


 [ 
https://issues.apache.org/jira/browse/SPARK-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8771:
-
Affects Version/s: 1.4.0

 Actor system deprecation tag uses deprecated deprecation tag
 

 Key: SPARK-8771
 URL: https://issues.apache.org/jira/browse/SPARK-8771
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: holdenk
Priority: Trivial

 The deprecation of the actor system adds a spurious build warning:
 {quote}
 @deprecated now takes two arguments; see the scaladoc.
 [warn]   @deprecated(Actor system is no longer supported as of 1.4)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8783) CTAS with WITH clause does not work


 [ 
https://issues.apache.org/jira/browse/SPARK-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8783:
---

Assignee: Apache Spark

 CTAS with WITH clause does not work
 ---

 Key: SPARK-8783
 URL: https://issues.apache.org/jira/browse/SPARK-8783
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Keuntae Park
Assignee: Apache Spark
Priority: Minor

 Following CTAS with WITH clause query 
 {code}
 CREATE TABLE with_table1 AS
 WITH T AS (
   SELECT *
   FROM table1
 )
 SELECT *
 FROM T
 {code}
 induces following error
 {code}
 no such table T; line 7 pos 5
 org.apache.spark.sql.AnalysisException: no such table T; line 7 pos 5
 ...
 {code}
 I think that WITH clause within CTAS is not handled properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8783) CTAS with WITH clause does not work


 [ 
https://issues.apache.org/jira/browse/SPARK-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8783:
---

Assignee: (was: Apache Spark)

 CTAS with WITH clause does not work
 ---

 Key: SPARK-8783
 URL: https://issues.apache.org/jira/browse/SPARK-8783
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Keuntae Park
Priority: Minor

 Following CTAS with WITH clause query 
 {code}
 CREATE TABLE with_table1 AS
 WITH T AS (
   SELECT *
   FROM table1
 )
 SELECT *
 FROM T
 {code}
 induces following error
 {code}
 no such table T; line 7 pos 5
 org.apache.spark.sql.AnalysisException: no such table T; line 7 pos 5
 ...
 {code}
 I think that WITH clause within CTAS is not handled properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8783) CTAS with WITH clause does not work


[ 
https://issues.apache.org/jira/browse/SPARK-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611556#comment-14611556
 ] 

Apache Spark commented on SPARK-8783:
-

User 'sirpkt' has created a pull request for this issue:
https://github.com/apache/spark/pull/7180

 CTAS with WITH clause does not work
 ---

 Key: SPARK-8783
 URL: https://issues.apache.org/jira/browse/SPARK-8783
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Keuntae Park
Priority: Minor

 Following CTAS with WITH clause query 
 {code}
 CREATE TABLE with_table1 AS
 WITH T AS (
   SELECT *
   FROM table1
 )
 SELECT *
 FROM T
 {code}
 induces following error
 {code}
 no such table T; line 7 pos 5
 org.apache.spark.sql.AnalysisException: no such table T; line 7 pos 5
 ...
 {code}
 I think that WITH clause within CTAS is not handled properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8784) Add python API for hex/unhex


 [ 
https://issues.apache.org/jira/browse/SPARK-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8784:
---

Assignee: Apache Spark  (was: Davies Liu)

 Add python API for hex/unhex
 

 Key: SPARK-8784
 URL: https://issues.apache.org/jira/browse/SPARK-8784
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Davies Liu
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8784) Add python API for hex/unhex


 [ 
https://issues.apache.org/jira/browse/SPARK-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8784:
---

Assignee: Davies Liu  (was: Apache Spark)

 Add python API for hex/unhex
 

 Key: SPARK-8784
 URL: https://issues.apache.org/jira/browse/SPARK-8784
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8784) Add python API for hex/unhex


[ 
https://issues.apache.org/jira/browse/SPARK-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611575#comment-14611575
 ] 

Apache Spark commented on SPARK-8784:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7181

 Add python API for hex/unhex
 

 Key: SPARK-8784
 URL: https://issues.apache.org/jira/browse/SPARK-8784
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8691) Enable GZip for Web UI


 [ 
https://issues.apache.org/jira/browse/SPARK-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8691.
--
Resolution: Duplicate

 Enable GZip for Web UI
 --

 Key: SPARK-8691
 URL: https://issues.apache.org/jira/browse/SPARK-8691
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Shixiong Zhu

 When there are massive tasks in the stage page (such as, running 
 {{sc.parallelize(1 to 10, 1).count()}}), the size of the stage page 
 is large. Enabling GZip can reduce the size significantly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6573) Convert inbound NaN values as null

2015-07-02 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611565#comment-14611565
 ] 

Josh Rosen commented on SPARK-6573:
---

NaN can lead to confusing exceptions during sorting if it appears in a column.  
I just ran into an issue where Sort threw a Comparison method violates its 
general contract! error for data containing NaN columns.  See my comments at 
https://github.com/apache/spark/pull/7179#discussion_r33749911

 Convert inbound NaN values as null
 --

 Key: SPARK-6573
 URL: https://issues.apache.org/jira/browse/SPARK-6573
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0
Reporter: Fabian Boehnlein

 In pandas it is common to use numpy.nan as the null value, for missing data 
 or whatever.
 http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
 http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
 http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna
 createDataFrame however only works with None as null values, parsing them as 
 None in the RDD.
 I suggest to add support for np.nan values in pandas DataFrames.
 current stracktrace when calling a DataFrame with object type columns with 
 np.nan values (which are floats)
 {code}
 TypeError Traceback (most recent call last)
 ipython-input-38-34f0263f0bf4 in module()
  1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)
 /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
 createDataFrame(self, data, schema, samplingRatio)
 339 schema = self._inferSchema(data.map(lambda r: 
 row_cls(*r)), samplingRatio)
 340 
 -- 341 return self.applySchema(data, schema)
 342 
 343 def registerDataFrameAsTable(self, rdd, tableName):
 /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
 applySchema(self, rdd, schema)
 246 
 247 for row in rows:
 -- 248 _verify_type(row, schema)
 249 
 250 # convert python objects to sql data
 /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
 _verify_type(obj, dataType)
1064  length of fields (%d) % (len(obj), 
 len(dataType.fields)))
1065 for v, f in zip(obj, dataType.fields):
 - 1066 _verify_type(v, f.dataType)
1067 
1068 _cached_cls = weakref.WeakValueDictionary()
 /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
 _verify_type(obj, dataType)
1048 if type(obj) not in _acceptable_types[_type]:
1049 raise TypeError(%s can not accept object in type %s
 - 1050 % (dataType, type(obj)))
1051 
1052 if isinstance(dataType, ArrayType):
 TypeError: StringType can not accept object in type type 'float'{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8773) Throw type mismatch in check analysis for expressions with expected input types defined

2015-07-02 Thread Akhil Thatipamula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611623#comment-14611623
 ] 

Akhil Thatipamula commented on SPARK-8773:
--

[~rxin] aren't we checking that already,
|case e: Expression if e.checkInputDataTypes().isFailure|
am I missing somthing? 


 Throw type mismatch in check analysis for expressions with expected input 
 types defined
 ---

 Key: SPARK-8773
 URL: https://issues.apache.org/jira/browse/SPARK-8773
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-8740) Support GitHub OAuth tokens in dev/merge_spark_pr.py


 [ 
https://issues.apache.org/jira/browse/SPARK-8740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-8740.

  Resolution: Fixed
   Fix Version/s: 1.5.0
Target Version/s: 1.5.0

 Support GitHub OAuth tokens in dev/merge_spark_pr.py
 

 Key: SPARK-8740
 URL: https://issues.apache.org/jira/browse/SPARK-8740
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Minor
 Fix For: 1.5.0


 We should allow dev/merge_spark_pr.py to use personal GitHub OAuth tokens in 
 order to make authenticated requests. This is necessary to work around per-IP 
 rate limiting issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark

[
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611553#comment-14611553
]

hujiayin commented on SPARK-5682:
-

Since the encrypted shuffle in spark is focus on the common module, it maybe
not good to use hadoop API. On the other side, the AES solution is a bit heavy
to encode/decode the live steaming data.

Add encrypted shuffle in spark
--

Key: SPARK-5682
URL: https://issues.apache.org/jira/browse/SPARK-5682
Project: Spark
Issue Type: New Feature
Components: Shuffle
Reporter: liyunzhang_intel
Attachments: Design Document of Encrypted Spark
Shuffle_20150209.docx, Design Document of Encrypted Spark
Shuffle_20150318.docx, Design Document of Encrypted Spark
Shuffle_20150402.docx, Design Document of Encrypted Spark
Shuffle_20150506.docx

Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle
data safer. This feature is necessary in spark. AES is a specification for
the encryption of electronic data. There are 5 common modes in AES. CTR is
one of the modes. We use two codec JceAesCtrCryptoCodec and
OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used
in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk
provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl
provides.
Because ugi credential info is used in the process of encrypted shuffle, we
first enable encrypted shuffle on spark-on-yarn framework.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8787) Change the parameter order of @deprecated in package object sql


 [ 
https://issues.apache.org/jira/browse/SPARK-8787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8787:
---

Assignee: Apache Spark

 Change the parameter  order of @deprecated in package object sql
 

 Key: SPARK-8787
 URL: https://issues.apache.org/jira/browse/SPARK-8787
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Vinod KC
Assignee: Apache Spark
Priority: Trivial

 Parameter order of @deprecated annotation  in package object sql is wrong 
 deprecated(1.3.0, use DataFrame) .
 This has to be changed to  deprecated(use DataFrame, 1.3.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8787) Change the parameter order of @deprecated in package object sql


 [ 
https://issues.apache.org/jira/browse/SPARK-8787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8787:
---

Assignee: (was: Apache Spark)

 Change the parameter  order of @deprecated in package object sql
 

 Key: SPARK-8787
 URL: https://issues.apache.org/jira/browse/SPARK-8787
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Vinod KC
Priority: Trivial

 Parameter order of @deprecated annotation  in package object sql is wrong 
 deprecated(1.3.0, use DataFrame) .
 This has to be changed to  deprecated(use DataFrame, 1.3.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2


[ 
https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611747#comment-14611747
 ] 

Vincent Warmerdam commented on SPARK-8596:
--

Cool, would love to hear your end of the story. It seems the only bother to get 
the script to work. 

Slightly deviating subject: I'm not just a frequent R user, I do a lot of 
python as well. Is there a similar ticket like this for the iPython (jupyter) 
notebook? It seems like the most appropriate GUI for the python language. 

 Install and configure RStudio server on Spark EC2
 -

 Key: SPARK-8596
 URL: https://issues.apache.org/jira/browse/SPARK-8596
 Project: Spark
  Issue Type: Improvement
  Components: EC2, SparkR
Reporter: Shivaram Venkataraman

 This will make it convenient for R users to use SparkR from their browsers 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8788) Java unit test for PCA transformer


[ 
https://issues.apache.org/jira/browse/SPARK-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611666#comment-14611666
 ] 

Apache Spark commented on SPARK-8788:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7184

 Java unit test for PCA transformer
 --

 Key: SPARK-8788
 URL: https://issues.apache.org/jira/browse/SPARK-8788
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: Yanbo Liang

 Add Java unit test for PCA transformer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8788) Java unit test for PCA transformer


 [ 
https://issues.apache.org/jira/browse/SPARK-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8788:
---

Assignee: (was: Apache Spark)

 Java unit test for PCA transformer
 --

 Key: SPARK-8788
 URL: https://issues.apache.org/jira/browse/SPARK-8788
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: Yanbo Liang

 Add Java unit test for PCA transformer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8788) Java unit test for PCA transformer


 [ 
https://issues.apache.org/jira/browse/SPARK-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8788:
---

Assignee: Apache Spark

 Java unit test for PCA transformer
 --

 Key: SPARK-8788
 URL: https://issues.apache.org/jira/browse/SPARK-8788
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: Yanbo Liang
Assignee: Apache Spark

 Add Java unit test for PCA transformer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8684) Update R version in Spark EC2 AMI


[ 
https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611680#comment-14611680
 ] 

Vincent Warmerdam commented on SPARK-8684:
--

Mhm... I've tried multiple approaches. My collegue even had a look at it and 
left him without a clue. 

Make a stackoverflow question for advice. 

http://stackoverflow.com/questions/31180061/r-3-2-on-aws-ami

I get the impression that the amazon AMI forces you to use the amazon repos if 
the package you need is also available in the amazon package system... which 
only have the old versions. 

 Update R version in Spark EC2 AMI
 -

 Key: SPARK-8684
 URL: https://issues.apache.org/jira/browse/SPARK-8684
 Project: Spark
  Issue Type: Improvement
  Components: EC2, SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 Right now the R version in the AMI is 3.1 -- However a number of R libraries 
 need R version 3.2 and it will be good to update the R version on the AMI 
 while launching a EC2 cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8684) Update R version in Spark EC2 AMI


[ 
https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611680#comment-14611680
 ] 

Vincent Warmerdam edited comment on SPARK-8684 at 7/2/15 9:10 AM:
--

Mhm... I've tried multiple approaches. My collegue even had a look at it and 
left him without a clue. 

Made a stackoverflow question for advice. 

http://stackoverflow.com/questions/31180061/r-3-2-on-aws-ami

I get the impression that the amazon AMI forces you to use the amazon repos if 
the package you need is also available in the amazon package system... which 
only have the old versions. Does anybody know of a place where we could ask 
amazon to just add it? 


was (Author: cantdutchthis):
Mhm... I've tried multiple approaches. My collegue even had a look at it and 
left him without a clue. 

Made a stackoverflow question for advice. 

http://stackoverflow.com/questions/31180061/r-3-2-on-aws-ami

I get the impression that the amazon AMI forces you to use the amazon repos if 
the package you need is also available in the amazon package system... which 
only have the old versions. 

 Update R version in Spark EC2 AMI
 -

 Key: SPARK-8684
 URL: https://issues.apache.org/jira/browse/SPARK-8684
 Project: Spark
  Issue Type: Improvement
  Components: EC2, SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 Right now the R version in the AMI is 3.1 -- However a number of R libraries 
 need R version 3.2 and it will be good to update the R version on the AMI 
 while launching a EC2 cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8244) string function: find_in_set


 [ 
https://issues.apache.org/jira/browse/SPARK-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8244:
---

Assignee: Cheng Hao  (was: Apache Spark)

 string function: find_in_set
 

 Key: SPARK-8244
 URL: https://issues.apache.org/jira/browse/SPARK-8244
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao
Priority: Minor

 find_in_set(string str, string strList): int
 Returns the first occurance of str in strList where strList is a 
 comma-delimited string. Returns null if either argument is null. Returns 0 if 
 the first argument contains any commas. For example, find_in_set('ab', 
 'abc,b,ab,c,def') returns 3.
 Only add this to SQL, not DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8244) string function: find_in_set


[ 
https://issues.apache.org/jira/browse/SPARK-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611698#comment-14611698
 ] 

Apache Spark commented on SPARK-8244:
-

User 'tarekauel' has created a pull request for this issue:
https://github.com/apache/spark/pull/7186

 string function: find_in_set
 

 Key: SPARK-8244
 URL: https://issues.apache.org/jira/browse/SPARK-8244
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao
Priority: Minor

 find_in_set(string str, string strList): int
 Returns the first occurance of str in strList where strList is a 
 comma-delimited string. Returns null if either argument is null. Returns 0 if 
 the first argument contains any commas. For example, find_in_set('ab', 
 'abc,b,ab,c,def') returns 3.
 Only add this to SQL, not DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8389) Expose KafkaRDDs offsetRange in Python


[ 
https://issues.apache.org/jira/browse/SPARK-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611669#comment-14611669
 ] 

Apache Spark commented on SPARK-8389:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/7185

 Expose KafkaRDDs offsetRange in Python
 --

 Key: SPARK-8389
 URL: https://issues.apache.org/jira/browse/SPARK-8389
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Tathagata Das
Priority: Critical

 Probably requires creating a JavaKafkaPairRDD and also use that in the python 
 APIs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8684) Update R version in Spark EC2 AMI


[ 
https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611680#comment-14611680
 ] 

Vincent Warmerdam edited comment on SPARK-8684 at 7/2/15 9:09 AM:
--

Mhm... I've tried multiple approaches. My collegue even had a look at it and 
left him without a clue. 

Made a stackoverflow question for advice. 

http://stackoverflow.com/questions/31180061/r-3-2-on-aws-ami

I get the impression that the amazon AMI forces you to use the amazon repos if 
the package you need is also available in the amazon package system... which 
only have the old versions. 


was (Author: cantdutchthis):
Mhm... I've tried multiple approaches. My collegue even had a look at it and 
left him without a clue. 

Make a stackoverflow question for advice. 

http://stackoverflow.com/questions/31180061/r-3-2-on-aws-ami

I get the impression that the amazon AMI forces you to use the amazon repos if 
the package you need is also available in the amazon package system... which 
only have the old versions. 

 Update R version in Spark EC2 AMI
 -

 Key: SPARK-8684
 URL: https://issues.apache.org/jira/browse/SPARK-8684
 Project: Spark
  Issue Type: Improvement
  Components: EC2, SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 Right now the R version in the AMI is 3.1 -- However a number of R libraries 
 need R version 3.2 and it will be good to update the R version on the AMI 
 while launching a EC2 cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8463) No suitable driver found for write.jdbc

2015-07-02 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611670#comment-14611670
 ] 

Reynold Xin commented on SPARK-8463:


[~mlety2] can you test this the patch created by [~viirya]?

 No suitable driver found for write.jdbc
 ---

 Key: SPARK-8463
 URL: https://issues.apache.org/jira/browse/SPARK-8463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.5.0
 Environment: Mesos, Ubuntu
Reporter: Matthew Jones

 I am getting a java.sql.SQLException: No suitable driver found for 
 jdbc:mysql://dbhost/test when using df.write.jdbc.
 I do not get this error when reading from the same database. 
 This simple script can repeat the problem.
 First one must create a database called test with a table called table1 and 
 insert some rows in it. The user test:secret must have read/write permissions.
 *testJDBC.scala:*
 import java.util.Properties
 import org.apache.spark.sql.Row
 import java.sql.Struct
 import org.apache.spark.sql.types.\{StructField, StructType, IntegerType, 
 StringType}
 import org.apache.spark.\{SparkConf, SparkContext}
 import org.apache.spark.sql.SQLContext
 val properties = new Properties()
 properties.setProperty(user, test)
 properties.setProperty(password, secret)
 val readTable = sqlContext.read.jdbc(jdbc:mysql://dbhost/test, table1, 
 properties)
 print(readTable.show())
 val rows = sc.parallelize(List(Row(1, write), Row(2, me)))
 val writeTable = sqlContext.createDataFrame(rows, 
 StructType(List(StructField(id, IntegerType), StructField(name, 
 StringType
 writeTable.write.jdbc(jdbc:mysql://dbhost/test, table2, properties)}}
 This is run using:
 {{spark-shell --conf 
 spark.executor.extraClassPath=/path/to/mysql-connector-java-5.1.35-bin.jar 
 --driver-class-path /path/to/mysql-connector-java-5.1.35-bin.jar --jars 
 /path/to/mysql-connector-java-5.1.35-bin.jar -i:testJDBC.scala}}
 The read works fine and will print the rows in the table. The write fails 
 with {{java.sql.SQLException: No suitable driver found for 
 jdbc:mysql://dbhost/test}}. The new table is successfully created but it is 
 empty.
 I have tested this on a Mesos cluster with Spark 1.4.0 and the current master 
 branch as of June 18th.
 In the executor logs I do see before the error:
 INFO Utils: Fetching 
 http://146.203.54.236:50624/jars/mysql-connector-java-5.1.35-bin.jar
 INFO Executor: Adding 
 file:/tmp/mesos/slaves/.../mysql-connector-java-5.1.35-bin.jar to class loader
 A workaround is to add the mysql-connector-java-5.1.35-bin.jar to the same 
 location on each executor node as defined in spark.executor.extraClassPath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8726) Wrong spark.executor.memory when using different EC2 master and worker machine types

2015-07-02 Thread Stefano Parmesan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefano Parmesan updated SPARK-8726:

Description: 
_(this is a mirror of 
[SPARK-8726|https://issues.apache.org/jira/browse/MESOS-2985])_

By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, 
master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32];
 when using the same instance type for master and workers you will not notice, 
but when using different ones (which makes sense, as the master cannot be a 
spot instance, and using a big machine for the master would be a waste of 
resources) the default amount of memory given to each worker is capped to the 
amount of RAM available on the master (ex: if you create a cluster with an 
m1.small master (1.7GB RAM) and one m1.large worker (7.5GB RAM), 
spark.executor.memory will be set to 512MB).

  was:By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, 
master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32];
 when using the same instance type for master and workers you will not notice, 
but when using different ones (which makes sense, as the master cannot be a 
spot instance, and using a big machine for the master would be a waste of 
resources) the default amount of memory given to each worker is capped to the 
amount of RAM available on the master (ex: if you create a cluster with an 
m1.small master (1.7GB RAM) and one m1.large worker (7.5GB RAM), 
spark.executor.memory will be set to 512MB).


 Wrong spark.executor.memory when using different EC2 master and worker 
 machine types
 

 Key: SPARK-8726
 URL: https://issues.apache.org/jira/browse/SPARK-8726
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Stefano Parmesan

 _(this is a mirror of 
 [SPARK-8726|https://issues.apache.org/jira/browse/MESOS-2985])_
 By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, 
 master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32];
  when using the same instance type for master and workers you will not 
 notice, but when using different ones (which makes sense, as the master 
 cannot be a spot instance, and using a big machine for the master would be a 
 waste of resources) the default amount of memory given to each worker is 
 capped to the amount of RAM available on the master (ex: if you create a 
 cluster with an m1.small master (1.7GB RAM) and one m1.large worker (7.5GB 
 RAM), spark.executor.memory will be set to 512MB).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8726) Wrong spark.executor.memory when using different EC2 master and worker machine types

2015-07-02 Thread Stefano Parmesan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefano Parmesan updated SPARK-8726:

Description: 
_(this is a mirror of 
[MESOS-2985|https://issues.apache.org/jira/browse/MESOS-2985])_

By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, 
master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32];
 when using the same instance type for master and workers you will not notice, 
but when using different ones (which makes sense, as the master cannot be a 
spot instance, and using a big machine for the master would be a waste of 
resources) the default amount of memory given to each worker is capped to the 
amount of RAM available on the master (ex: if you create a cluster with an 
m1.small master (1.7GB RAM) and one m1.large worker (7.5GB RAM), 
spark.executor.memory will be set to 512MB).

  was:
_(this is a mirror of 
[SPARK-8726|https://issues.apache.org/jira/browse/MESOS-2985])_

By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, 
master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32];
 when using the same instance type for master and workers you will not notice, 
but when using different ones (which makes sense, as the master cannot be a 
spot instance, and using a big machine for the master would be a waste of 
resources) the default amount of memory given to each worker is capped to the 
amount of RAM available on the master (ex: if you create a cluster with an 
m1.small master (1.7GB RAM) and one m1.large worker (7.5GB RAM), 
spark.executor.memory will be set to 512MB).


 Wrong spark.executor.memory when using different EC2 master and worker 
 machine types
 

 Key: SPARK-8726
 URL: https://issues.apache.org/jira/browse/SPARK-8726
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Stefano Parmesan

 _(this is a mirror of 
 [MESOS-2985|https://issues.apache.org/jira/browse/MESOS-2985])_
 By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, 
 master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32];
  when using the same instance type for master and workers you will not 
 notice, but when using different ones (which makes sense, as the master 
 cannot be a spot instance, and using a big machine for the master would be a 
 waste of resources) the default amount of memory given to each worker is 
 capped to the amount of RAM available on the master (ex: if you create a 
 cluster with an m1.small master (1.7GB RAM) and one m1.large worker (7.5GB 
 RAM), spark.executor.memory will be set to 512MB).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8463) No suitable driver found for write.jdbc

2015-07-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8463:
---
Shepherd: Reynold Xin
Assignee: Liang-Chi Hsieh
Target Version/s: 1.5.0, 1.4.2

 No suitable driver found for write.jdbc
 ---

 Key: SPARK-8463
 URL: https://issues.apache.org/jira/browse/SPARK-8463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.5.0
 Environment: Mesos, Ubuntu
Reporter: Matthew Jones
Assignee: Liang-Chi Hsieh

 I am getting a java.sql.SQLException: No suitable driver found for 
 jdbc:mysql://dbhost/test when using df.write.jdbc.
 I do not get this error when reading from the same database. 
 This simple script can repeat the problem.
 First one must create a database called test with a table called table1 and 
 insert some rows in it. The user test:secret must have read/write permissions.
 *testJDBC.scala:*
 import java.util.Properties
 import org.apache.spark.sql.Row
 import java.sql.Struct
 import org.apache.spark.sql.types.\{StructField, StructType, IntegerType, 
 StringType}
 import org.apache.spark.\{SparkConf, SparkContext}
 import org.apache.spark.sql.SQLContext
 val properties = new Properties()
 properties.setProperty(user, test)
 properties.setProperty(password, secret)
 val readTable = sqlContext.read.jdbc(jdbc:mysql://dbhost/test, table1, 
 properties)
 print(readTable.show())
 val rows = sc.parallelize(List(Row(1, write), Row(2, me)))
 val writeTable = sqlContext.createDataFrame(rows, 
 StructType(List(StructField(id, IntegerType), StructField(name, 
 StringType
 writeTable.write.jdbc(jdbc:mysql://dbhost/test, table2, properties)}}
 This is run using:
 {{spark-shell --conf 
 spark.executor.extraClassPath=/path/to/mysql-connector-java-5.1.35-bin.jar 
 --driver-class-path /path/to/mysql-connector-java-5.1.35-bin.jar --jars 
 /path/to/mysql-connector-java-5.1.35-bin.jar -i:testJDBC.scala}}
 The read works fine and will print the rows in the table. The write fails 
 with {{java.sql.SQLException: No suitable driver found for 
 jdbc:mysql://dbhost/test}}. The new table is successfully created but it is 
 empty.
 I have tested this on a Mesos cluster with Spark 1.4.0 and the current master 
 branch as of June 18th.
 In the executor logs I do see before the error:
 INFO Utils: Fetching 
 http://146.203.54.236:50624/jars/mysql-connector-java-5.1.35-bin.jar
 INFO Executor: Adding 
 file:/tmp/mesos/slaves/.../mysql-connector-java-5.1.35-bin.jar to class loader
 A workaround is to add the mysql-connector-java-5.1.35-bin.jar to the same 
 location on each executor node as defined in spark.executor.extraClassPath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8244) string function: find_in_set


 [ 
https://issues.apache.org/jira/browse/SPARK-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8244:
---

Assignee: Apache Spark  (was: Cheng Hao)

 string function: find_in_set
 

 Key: SPARK-8244
 URL: https://issues.apache.org/jira/browse/SPARK-8244
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
Priority: Minor

 find_in_set(string str, string strList): int
 Returns the first occurance of str in strList where strList is a 
 comma-delimited string. Returns null if either argument is null. Returns 0 if 
 the first argument contains any commas. For example, find_in_set('ab', 
 'abc,b,ab,c,def') returns 3.
 Only add this to SQL, not DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8788) Java unit test for PCA transformer

2015-07-02 Thread Yanbo Liang (JIRA)

Yanbo Liang created SPARK-8788:
--

 Summary: Java unit test for PCA transformer
 Key: SPARK-8788
 URL: https://issues.apache.org/jira/browse/SPARK-8788
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.0
Reporter: Yanbo Liang


Add Java unit test for PCA transformer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7401) Dot product and squared_distances should be vectorized in Vectors

2015-07-02 Thread Manoj Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-7401:
---
Priority: Major  (was: Minor)

 Dot product and squared_distances should be vectorized in Vectors
 -

 Key: SPARK-7401
 URL: https://issues.apache.org/jira/browse/SPARK-7401
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8616) SQLContext doesn't handle tricky column names when loading from JDBC

2015-07-02 Thread David Sabater (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611811#comment-14611811
 ] 

David Sabater commented on SPARK-8616:
--

I would assume the error here is the lack of support for columns containing 
characters like  ,;{}() = (This includes whitespaces which was my initial 
issue)
If we are ok restricting this we just need to improve the error message when 
the exception is raised.

I would suggest to revisit this in the Maillist to see what are the opinions 
out there.

 SQLContext doesn't handle tricky column names when loading from JDBC
 

 Key: SPARK-8616
 URL: https://issues.apache.org/jira/browse/SPARK-8616
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Ubuntu 14.04, Sqlite 3.8.7, Spark 1.4.0
Reporter: Gergely Svigruha

 Reproduce:
  - create a table in a relational database (in my case sqlite) with a column 
 name containing a space:
  CREATE TABLE my_table (id INTEGER, tricky column TEXT);
  - try to create a DataFrame using that table:
 sqlContext.read.format(jdbc).options(Map(
   url - jdbs:sqlite:...,
   dbtable - my_table)).load()
 java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
 column: tricky)
 According to the SQL spec this should be valid:
 http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8616) SQLContext doesn't handle tricky column names when loading from JDBC

2015-07-02 Thread David Sabater (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611811#comment-14611811
 ] 

David Sabater edited comment on SPARK-8616 at 7/2/15 11:39 AM:
---

I would assume the error here is the lack of support for column names 
containing characters like  ,;{}() = (This includes whitespaces which was my 
initial issue)
If we are ok restricting this we just need to improve the error message when 
the exception is raised.

I would suggest to revisit this in the Maillist to see what are the opinions 
out there.


was (Author: dsdinter):
I would assume the error here is the lack of support for columns containing 
characters like  ,;{}() = (This includes whitespaces which was my initial 
issue)
If we are ok restricting this we just need to improve the error message when 
the exception is raised.

I would suggest to revisit this in the Maillist to see what are the opinions 
out there.

 SQLContext doesn't handle tricky column names when loading from JDBC
 

 Key: SPARK-8616
 URL: https://issues.apache.org/jira/browse/SPARK-8616
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Ubuntu 14.04, Sqlite 3.8.7, Spark 1.4.0
Reporter: Gergely Svigruha

 Reproduce:
  - create a table in a relational database (in my case sqlite) with a column 
 name containing a space:
  CREATE TABLE my_table (id INTEGER, tricky column TEXT);
  - try to create a DataFrame using that table:
 sqlContext.read.format(jdbc).options(Map(
   url - jdbs:sqlite:...,
   dbtable - my_table)).load()
 java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
 column: tricky)
 According to the SQL spec this should be valid:
 http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException

2015-07-02 Thread Daniel Darabos (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611960#comment-14611960
]

Daniel Darabos commented on SPARK-5945:
---

At the moment we have a ton of these infinite retries. A stage is retried a few
dozen times, then its parent goes missing and Spark starts retrying the parent
until it also goes missing... We are still debugging the cause of our fetch
failures, but I just wanted to mention that if there were a
{{spark.stage.maxFailures}} option, we would be setting it to 1 at this point.

Thanks for all the work on this bug. Even if it's not fixed yet, it's very
informative.

Spark should not retry a stage infinitely on a FetchFailedException
---

Key: SPARK-5945
URL: https://issues.apache.org/jira/browse/SPARK-5945
Project: Spark
Issue Type: Bug
Components: Spark Core
Reporter: Imran Rashid
Assignee: Ilya Ganelin

While investigating SPARK-5928, I noticed some very strange behavior in the
way spark retries stages after a FetchFailedException. It seems that on a
FetchFailedException, instead of simply killing the task and retrying, Spark
aborts the stage and retries. If it just retried the task, the task might
fail 4 times and then trigger the usual job killing mechanism. But by
killing the stage instead, the max retry logic is skipped (it looks to me
like there is no limit for retries on a stage).
After a bit of discussion with Kay Ousterhout, it seems the idea is that if a
fetch fails, we assume that the block manager we are fetching from has
failed, and that it will succeed if we retry the stage w/out that block
manager. In that case, it wouldn't make any sense to retry the task, since
its doomed to fail every time, so we might as well kill the whole stage. But
this raises two questions:
1) Is it really safe to assume that a FetchFailedException means that the
BlockManager has failed, and ti will work if we just try another one?
SPARK-5928 shows that there are at least some cases where that assumption is
wrong. Even if we fix that case, this logic seems brittle to the next case
we find. I guess the idea is that this behavior is what gives us the R in
RDD ... but it seems like its not really that robust and maybe should be
reconsidered.
2) Should stages only be retried a limited number of times? It would be
pretty easy to put in a limited number of retries per stage. Though again,
we encounter issues with keeping things resilient. Theoretically one stage
could have many retries, but due to failures in different stages further
downstream, so we might need to track the cause of each retry as well to
still have the desired behavior.
In general it just seems there is some flakiness in the retry logic. This is
the only reproducible example I have at the moment, but I vaguely recall
hitting other cases of strange behavior w/ retries when trying to run long
pipelines. Eg., if one executor is stuck in a GC during a fetch, the fetch
fails, but the executor eventually comes back and the stage gets retried
again, but the same GC issues happen the second time around, etc.
Copied from SPARK-5928, here's the example program that can regularly produce
a loop of stage failures. Note that it will only fail from a remote fetch,
so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell
--num-executors 2 --executor-memory 4000m}}
{code}
val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
val n = 3e3.toInt
val arr = new Array[Byte](n)
//need to make sure the array doesn't compress to something small
scala.util.Random.nextBytes(arr)
arr
}
rdd.map { x = (1, x)}.groupByKey().count()
{code}

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2319) Number of tasks on executors become negative after executor failures

2015-07-02 Thread KaiXinXIaoLei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611849#comment-14611849
 ] 

KaiXinXIaoLei commented on SPARK-2319:
--

using the lastest version (1.4),I also met the same problem.

 Number of tasks on executors become negative after executor failures
 

 Key: SPARK-2319
 URL: https://issues.apache.org/jira/browse/SPARK-2319
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Andrew Or
 Fix For: 1.4.0

 Attachments: num active tasks become negative (-16).jpg


 See attached screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8787) Change the parameter order of @deprecated in package object sql


 [ 
https://issues.apache.org/jira/browse/SPARK-8787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8787:
-
Assignee: Vinod KC

 Change the parameter  order of @deprecated in package object sql
 

 Key: SPARK-8787
 URL: https://issues.apache.org/jira/browse/SPARK-8787
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Vinod KC
Assignee: Vinod KC
Priority: Trivial
 Fix For: 1.5.0, 1.4.2


 Parameter order of @deprecated annotation  in package object sql is wrong 
 deprecated(1.3.0, use DataFrame) .
 This has to be changed to  deprecated(use DataFrame, 1.3.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8787) Change the parameter order of @deprecated in package object sql


 [ 
https://issues.apache.org/jira/browse/SPARK-8787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8787.
--
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.2

Issue resolved by pull request 7183
[https://github.com/apache/spark/pull/7183]

 Change the parameter  order of @deprecated in package object sql
 

 Key: SPARK-8787
 URL: https://issues.apache.org/jira/browse/SPARK-8787
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Vinod KC
Priority: Trivial
 Fix For: 1.4.2, 1.5.0


 Parameter order of @deprecated annotation  in package object sql is wrong 
 deprecated(1.3.0, use DataFrame) .
 This has to be changed to  deprecated(use DataFrame, 1.3.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8791) Make a better hashcode for InternalRow

2015-07-02 Thread Cheng Hao (JIRA)

Cheng Hao created SPARK-8791:


 Summary: Make a better hashcode for InternalRow
 Key: SPARK-8791
 URL: https://issues.apache.org/jira/browse/SPARK-8791
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor


Currently, the InternalRow doesn't support well for complex data type while 
getting the hashCode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8790) BlockManager.reregister cause OOM

[
https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Liu updated SPARK-8790:
---
Description:
We run SparkSQL 1.2.1 on Yarn.

A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for
16m.

The webUI shows that the executor has running GC for 15m brfore OOM.

The log shows that the executor first try to connect to master to report
broadcast value, however the network is not available, so the executor connot
contact master. Then the executor lost connection with Master.
Then the master require the executor to reregister. When executor are
reporAllBlocks to master, the network is still not so stable, so sometimes
time-out.

Finally, the executor OOM.

Please take a look.

Attached is the detailed log.

was:
We run SparkSQL 1.2.1 on Yarn.

A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for
16m.

The webUI shows that the executor has running GC for 15m until OOM.

Finally, the executor OOM.

Please take a look.

Attached is the detailed log.

BlockManager.reregister cause OOM
-

Key: SPARK-8790
URL: https://issues.apache.org/jira/browse/SPARK-8790
Project: Spark
Issue Type: Bug
Components: Spark Core
Reporter: Patrick Liu
Attachments: driver.log, executor.log, webui-executor.png,
webui-slow-task.png

We run SparkSQL 1.2.1 on Yarn.
A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for
16m.
The webUI shows that the executor has running GC for 15m brfore OOM.
The log shows that the executor first try to connect to master to report
broadcast value, however the network is not available, so the executor connot
contact master. Then the executor lost connection with Master.
Then the master require the executor to reregister. When executor are
reporAllBlocks to master, the network is still not so stable, so sometimes
time-out.
Finally, the executor OOM.
Please take a look.
Attached is the detailed log.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8790) BlockManager.reregister cause OOM

[
https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Liu updated SPARK-8790:
---
Description:
We run SparkSQL 1.2.1 on Yarn.

A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for
16m.

The webUI shows that the executor has running GC for 15m brfore OOM.

The log shows that the executor first try to connect to master to report
broadcast value, however the network is not available, so the executor lost
heartbeat to Master.
Then the master require the executor to reregister. When executor are
reporAllBlocks to master, the network is still not so stable, sometimes
time-out.

Finally, the executor OOM.

Please take a look.

Attached is the detailed log.

was:
We run SparkSQL 1.2.1 on Yarn.

A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for
16m.

The webUI shows that the executor has running GC for 15m brfore OOM.

Finally, the executor OOM.

Please take a look.

Attached is the detailed log.

BlockManager.reregister cause OOM
-

We run SparkSQL 1.2.1 on Yarn.
A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for
16m.
The webUI shows that the executor has running GC for 15m brfore OOM.
The log shows that the executor first try to connect to master to report
broadcast value, however the network is not available, so the executor lost
heartbeat to Master.
Then the master require the executor to reregister. When executor are
reporAllBlocks to master, the network is still not so stable, sometimes
time-out.
Finally, the executor OOM.
Please take a look.
Attached is the detailed log.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2319) Number of tasks on executors become negative after executor failures

2015-07-02 Thread KaiXinXIaoLei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611856#comment-14611856
 ] 

KaiXinXIaoLei edited comment on SPARK-2319 at 7/2/15 12:08 PM:
---

Using 1.4, num active tasks become negative, and Complete Tasks is more 
bigger than Total Tasks


was (Author: kaixinxiaolei):
Using 1.4, num active tasks become negative, and Complete Tasks is more 
bigger then Total Tasks

 Number of tasks on executors become negative after executor failures
 

 Key: SPARK-2319
 URL: https://issues.apache.org/jira/browse/SPARK-2319
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Andrew Or
 Fix For: 1.4.0

 Attachments: active tasks.png, num active tasks become negative 
 (-16).jpg


 See attached screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6833) Extend `addPackage` so that any given R file can be sourced in the worker before functions are run.


 [ 
https://issues.apache.org/jira/browse/SPARK-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6833:
-
Assignee: Sun Rui

 Extend `addPackage` so that any given R file can be sourced in the worker 
 before functions are run.
 ---

 Key: SPARK-6833
 URL: https://issues.apache.org/jira/browse/SPARK-6833
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Sun Rui
Priority: Minor
 Fix For: 1.5.0


 Similar to how extra python files or packages can be specified (in zip / egg 
 formats), it will be good to support the ability to add extra R files to the 
 executors working directory.
 One thing that needs to be investigated is if this will just work out of the 
 box using the spark-submit flag --files ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8790) BlockManager.reregister cause OOM


 [ 
https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Liu updated SPARK-8790:
---
Attachment: executor.log
driver.log

 BlockManager.reregister cause OOM
 -

 Key: SPARK-8790
 URL: https://issues.apache.org/jira/browse/SPARK-8790
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Liu
 Attachments: driver.log, executor.log


 We run SparkSQL 1.2.1 on Yarn.
 A SQL consists of 100 tasks, most them finish in  10s, but only 1 lasts for 
 16m.
 The webUI shows that the executor has running GC for 15m until OOM.
 The log shows that the executor first try to connect to master to report 
 broadcast value, however the network is not available, so the executor connot 
 contact master. Then the executor lost connection with Master. 
 Then the master require the executor to reregister. When executor are 
 reporAllBlocks to master, the network is still not so stable, so sometimes 
 time-out.
 Finally, the executor OOM.
 Please take a look.
 Attached is the detailed log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8790) BlockManager.reregister cause OOM


 [ 
https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Liu updated SPARK-8790:
---
Attachment: webui-executor.png

 BlockManager.reregister cause OOM
 -

 Key: SPARK-8790
 URL: https://issues.apache.org/jira/browse/SPARK-8790
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Liu
 Attachments: driver.log, executor.log, webui-executor.png, 
 webui-slow-task.png


 We run SparkSQL 1.2.1 on Yarn.
 A SQL consists of 100 tasks, most them finish in  10s, but only 1 lasts for 
 16m.
 The webUI shows that the executor has running GC for 15m until OOM.
 The log shows that the executor first try to connect to master to report 
 broadcast value, however the network is not available, so the executor connot 
 contact master. Then the executor lost connection with Master. 
 Then the master require the executor to reregister. When executor are 
 reporAllBlocks to master, the network is still not so stable, so sometimes 
 time-out.
 Finally, the executor OOM.
 Please take a look.
 Attached is the detailed log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8790) BlockManager.reregister cause OOM


 [ 
https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Liu updated SPARK-8790:
---
Attachment: webui-slow-task.png

 BlockManager.reregister cause OOM
 -

 Key: SPARK-8790
 URL: https://issues.apache.org/jira/browse/SPARK-8790
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Liu
 Attachments: driver.log, executor.log, webui-executor.png, 
 webui-slow-task.png


 We run SparkSQL 1.2.1 on Yarn.
 A SQL consists of 100 tasks, most them finish in  10s, but only 1 lasts for 
 16m.
 The webUI shows that the executor has running GC for 15m until OOM.
 The log shows that the executor first try to connect to master to report 
 broadcast value, however the network is not available, so the executor connot 
 contact master. Then the executor lost connection with Master. 
 Then the master require the executor to reregister. When executor are 
 reporAllBlocks to master, the network is still not so stable, so sometimes 
 time-out.
 Finally, the executor OOM.
 Please take a look.
 Attached is the detailed log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8790) BlockManager.reregister cause OOM

[
https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Liu updated SPARK-8790:
---
Description:
We run SparkSQL 1.2.1 on Yarn.

A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for
16m.

The webUI shows that the executor has running GC for 15m brfore OOM.

Finally, the executor OOM.

Please take a look.

Attached is the detailed log.

was:
We run SparkSQL 1.2.1 on Yarn.

A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for
16m.

The webUI shows that the executor has running GC for 15m brfore OOM.

Finally, the executor OOM.

Please take a look.

Attached is the detailed log.

BlockManager.reregister cause OOM
-

We run SparkSQL 1.2.1 on Yarn.
A SQL consists of 100 tasks, most them finish in 10s, but only 1 lasts for
16m.
The webUI shows that the executor has running GC for 15m brfore OOM.
The log shows that the executor first try to connect to master to report
broadcast value, however the network is not available, so the executor lost
heartbeat to Master.
Then the master require the executor to reregister. When executor are
reporAllBlocks to master, the network is still not so stable, so sometimes
time-out.
Finally, the executor OOM.
Please take a look.
Attached is the detailed log.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8789) improve SQLQuerySuite resilience by dropping tables in setup


 [ 
https://issues.apache.org/jira/browse/SPARK-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8789:
---

Assignee: Apache Spark

 improve SQLQuerySuite resilience by dropping tables in setup
 

 Key: SPARK-8789
 URL: https://issues.apache.org/jira/browse/SPARK-8789
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 1.4.0
Reporter: Steve Loughran
Assignee: Apache Spark
Priority: Minor

 When some of the tests in {{SQLQuerySuite}} are having problems, followup 
 test runs fail because the tables are still present. this can be addressed by 
 some table dropping at startup, and some try/finally clauses



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8789) improve SQLQuerySuite resilience by dropping tables in setup


 [ 
https://issues.apache.org/jira/browse/SPARK-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8789:
---

Assignee: (was: Apache Spark)

 improve SQLQuerySuite resilience by dropping tables in setup
 

 Key: SPARK-8789
 URL: https://issues.apache.org/jira/browse/SPARK-8789
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 1.4.0
Reporter: Steve Loughran
Priority: Minor

 When some of the tests in {{SQLQuerySuite}} are having problems, followup 
 test runs fail because the tables are still present. this can be addressed by 
 some table dropping at startup, and some try/finally clauses



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8791) Make a better hashcode for InternalRow


[ 
https://issues.apache.org/jira/browse/SPARK-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611947#comment-14611947
 ] 

Apache Spark commented on SPARK-8791:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/7189

 Make a better hashcode for InternalRow
 --

 Key: SPARK-8791
 URL: https://issues.apache.org/jira/browse/SPARK-8791
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor

 Currently, the InternalRow doesn't support well for complex data type while 
 getting the hashCode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8791) Make a better hashcode for InternalRow


 [ 
https://issues.apache.org/jira/browse/SPARK-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8791:
---

Assignee: (was: Apache Spark)

 Make a better hashcode for InternalRow
 --

 Key: SPARK-8791
 URL: https://issues.apache.org/jira/browse/SPARK-8791
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor

 Currently, the InternalRow doesn't support well for complex data type while 
 getting the hashCode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8792) Add Python API for PCA transformer


 [ 
https://issues.apache.org/jira/browse/SPARK-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8792:
---

Assignee: (was: Apache Spark)

 Add Python API for PCA transformer
 --

 Key: SPARK-8792
 URL: https://issues.apache.org/jira/browse/SPARK-8792
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.5.0
Reporter: Yanbo Liang

 Add Python API for PCA transformer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2


[ 
https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611832#comment-14611832
 ] 

Vincent Warmerdam commented on SPARK-8596:
--

By the way, I now have scripts that do install Rstudio (just ran and 
confirmed). 

The code is here: 

https://github.com/koaning/spark-ec2/tree/rstudio-install
https://github.com/koaning/spark/tree/rstudio-install

When initializing with this command: 

./spark-ec2 --key-pair=spark-df 
--identity-file=/Users/code/Downloads/spark-df.pem --region=eu-west-1 -s 1 
--instance-type=c3.2xlarge 
--spark-ec2-git-repo=https://github.com/koaning/spark-ec2 
--spark-ec2-git-branch=rstudio-install launch mysparkr

I can confirm that rstudio is installand and that a correct user is added. 
There are two concerns:

- should we not force the user to supply the password themselves? setting a 
standard password seems like a security vulnerability. 
- I am not sure if this gets installed on all the slave nodes. I added this 
module 
(https://github.com/koaning/spark-ec2/blob/rstudio-install/rstudio/init.sh) and 
we only need it on the master node. I wonder what the best way is to ensure 
this.

 Install and configure RStudio server on Spark EC2
 -

 Key: SPARK-8596
 URL: https://issues.apache.org/jira/browse/SPARK-8596
 Project: Spark
  Issue Type: Improvement
  Components: EC2, SparkR
Reporter: Shivaram Venkataraman

 This will make it convenient for R users to use SparkR from their browsers 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8596) Install and configure RStudio server on Spark EC2

[
https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611832#comment-14611832
]

Vincent Warmerdam edited comment on SPARK-8596 at 7/2/15 11:55 AM:
---

By the way, I now have scripts that do install Rstudio (just ran and
confirmed).

The code is here:

https://github.com/koaning/spark-ec2/tree/rstudio-install (added rstudio as a
module)
https://github.com/koaning/spark/tree/rstudio-install

When initializing with this command:

./spark-ec2 --key-pair=spark-df
--identity-file=/Users/code/Downloads/spark-df.pem --region=eu-west-1 -s 1
--instance-type=c3.2xlarge
--spark-ec2-git-repo=https://github.com/koaning/spark-ec2
--spark-ec2-git-branch=rstudio-install launch mysparkr

I can confirm that rstudio is installand and that a correct user is added.
There are two concerns:

- should we not force the user to supply the password themselves? setting a
standard password seems like a security vulnerability.
- I am not sure if this gets installed on all the slave nodes. I added this
module
(https://github.com/koaning/spark-ec2/blob/rstudio-install/rstudio/init.sh) and
we only need it on the master node. I wonder what the best way is to ensure
this.

was (Author: cantdutchthis):
By the way, I now have scripts that do install Rstudio (just ran and
confirmed).

The code is here:

https://github.com/koaning/spark-ec2/tree/rstudio-install
https://github.com/koaning/spark/tree/rstudio-install

When initializing with this command:

I can confirm that rstudio is installand and that a correct user is added.
There are two concerns:

Install and configure RStudio server on Spark EC2
-

Key: SPARK-8596
URL: https://issues.apache.org/jira/browse/SPARK-8596
Project: Spark
Issue Type: Improvement
Components: EC2, SparkR
Reporter: Shivaram Venkataraman

This will make it convenient for R users to use SparkR from their browsers

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2319) Number of tasks on executors become negative after executor failures

2015-07-02 Thread KaiXinXIaoLei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-2319:
-
Attachment: active tasks.png

Using 1.4, num active tasks become negative, and Complete Tasks is more 
bigger then Total Tasks

 Number of tasks on executors become negative after executor failures
 

 Key: SPARK-2319
 URL: https://issues.apache.org/jira/browse/SPARK-2319
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Andrew Or
 Fix For: 1.4.0

 Attachments: active tasks.png, num active tasks become negative 
 (-16).jpg


 See attached screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8789) improve SQLQuerySuite resilience by dropping tables in setup


[ 
https://issues.apache.org/jira/browse/SPARK-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611871#comment-14611871
 ] 

Apache Spark commented on SPARK-8789:
-

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/7188

 improve SQLQuerySuite resilience by dropping tables in setup
 

 Key: SPARK-8789
 URL: https://issues.apache.org/jira/browse/SPARK-8789
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 1.4.0
Reporter: Steve Loughran
Priority: Minor

 When some of the tests in {{SQLQuerySuite}} are having problems, followup 
 test runs fail because the tables are still present. this can be addressed by 
 some table dropping at startup, and some try/finally clauses



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)


 [ 
https://issues.apache.org/jira/browse/SPARK-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8746:
-
Assignee: Christian Kadner

 Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
 --

 Key: SPARK-8746
 URL: https://issues.apache.org/jira/browse/SPARK-8746
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Christian Kadner
Assignee: Christian Kadner
Priority: Trivial
  Labels: documentation, test
 Fix For: 1.5.0, 1.4.2

   Original Estimate: 1h
  Remaining Estimate: 1h

 The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) 
 describes how to generate golden answer files for new hive comparison test 
 cases. However the download link for the Hive 0.13.1 jars points to 
 https://hive.apache.org/downloads.html but none of the linked mirror sites 
 still has the 0.13.1 version.
 We need to update the link to 
 https://archive.apache.org/dist/hive/hive-0.13.1/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)


 [ 
https://issues.apache.org/jira/browse/SPARK-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8746.
--
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.2

Issue resolved by pull request 7144
[https://github.com/apache/spark/pull/7144]

 Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
 --

 Key: SPARK-8746
 URL: https://issues.apache.org/jira/browse/SPARK-8746
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Christian Kadner
Priority: Trivial
  Labels: documentation, test
 Fix For: 1.4.2, 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) 
 describes how to generate golden answer files for new hive comparison test 
 cases. However the download link for the Hive 0.13.1 jars points to 
 https://hive.apache.org/downloads.html but none of the linked mirror sites 
 still has the 0.13.1 version.
 We need to update the link to 
 https://archive.apache.org/dist/hive/hive-0.13.1/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8790) BlockManager.reregister cause OOM

Patrick Liu created SPARK-8790:
--

 Summary: BlockManager.reregister cause OOM
 Key: SPARK-8790
 URL: https://issues.apache.org/jira/browse/SPARK-8790
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Liu


We run SparkSQL 1.2.1 on Yarn.

A SQL consists of 100 tasks, most them finish in  10s, but only 1 lasts for 
16m.

The webUI shows that the executor has running GC for 15m until OOM.

The log shows that the executor first try to connect to master to report 
broadcast value, however the network is not available, so the executor connot 
contact master. Then the executor lost connection with Master. 
Then the master require the executor to reregister. When executor are 
reporAllBlocks to master, the network is still not so stable, so sometimes 
time-out.

Finally, the executor OOM.

Please take a look.

Attached is the detailed log.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8789) improve SQLQuerySuite resilience by dropping tables in setup

2015-07-02 Thread Steve Loughran (JIRA)

Steve Loughran created SPARK-8789:
-

 Summary: improve SQLQuerySuite resilience by dropping tables in 
setup
 Key: SPARK-8789
 URL: https://issues.apache.org/jira/browse/SPARK-8789
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 1.4.0
Reporter: Steve Loughran
Priority: Minor


When some of the tests in {{SQLQuerySuite}} are having problems, followup test 
runs fail because the tables are still present. this can be addressed by some 
table dropping at startup, and some try/finally clauses



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8791) Make a better hashcode for InternalRow


 [ 
https://issues.apache.org/jira/browse/SPARK-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8791:
---

Assignee: Apache Spark

 Make a better hashcode for InternalRow
 --

 Key: SPARK-8791
 URL: https://issues.apache.org/jira/browse/SPARK-8791
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Assignee: Apache Spark
Priority: Minor

 Currently, the InternalRow doesn't support well for complex data type while 
 getting the hashCode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8792) Add Python API for PCA transformer

2015-07-02 Thread Yanbo Liang (JIRA)

Yanbo Liang created SPARK-8792:
--

 Summary: Add Python API for PCA transformer
 Key: SPARK-8792
 URL: https://issues.apache.org/jira/browse/SPARK-8792
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.5.0
Reporter: Yanbo Liang


Add Python API for PCA transformer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8792) Add Python API for PCA transformer


[ 
https://issues.apache.org/jira/browse/SPARK-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612002#comment-14612002
 ] 

Apache Spark commented on SPARK-8792:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7190

 Add Python API for PCA transformer
 --

 Key: SPARK-8792
 URL: https://issues.apache.org/jira/browse/SPARK-8792
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.5.0
Reporter: Yanbo Liang

 Add Python API for PCA transformer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8792) Add Python API for PCA transformer