[jira] [Commented] (SPARK-8652) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests

2015-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602464#comment-14602464
 ] 

Apache Spark commented on SPARK-8652:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7032

 PySpark tests sometimes forget to check return status of doctest.testmod(), 
 masking failing tests
 -

 Key: SPARK-8652
 URL: https://issues.apache.org/jira/browse/SPARK-8652
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 Several PySpark files call {{doctest.testmod()}} in order to run doctests, 
 but forget to check its return status. As a result, failures will not be 
 automatically detected by our test runner script, creating the potential for 
 bugs to slip through.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer

2015-06-26 Thread Tao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602468#comment-14602468
 ] 

Tao Li commented on SPARK-8332:
---

I found that BigDecimalDeserializer will extend StdDeserializer in 
jackson-databind project. Above version jackson-databind-2.3, StdDeserializer 
have method handledType(). But under version jackson-databind-2.2, 
StdDeserializer don't have method handledType().

In my environment, hadoop 2.3.0-cdh5.0.0, the jackson-databind is 
/usr/lib/hadoop-mapreduce/.//jackson-databind-2.2.3.jar. So it don't have the 
method handledType() and throws NoSuchMethodError.

 NoSuchMethodError: 
 com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
 --

 Key: SPARK-8332
 URL: https://issues.apache.org/jira/browse/SPARK-8332
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
 Environment: spark 1.4  hadoop 2.3.0-cdh5.0.0
Reporter: Tao Li
Priority: Critical
  Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson

 I complied new spark 1.4.0 version. 
 But when I run a simple WordCount demo, it throws NoSuchMethodError 
 {code}
 java.lang.NoSuchMethodError: 
 com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
 {code}
 I found out that the default fasterxml.jackson.version is 2.4.4. 
 Is there any wrong or conflict with the jackson version? 
 Or is there possibly some project maven dependency containing the wrong 
 version of jackson?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8652) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests

2015-06-26 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-8652:
-

 Summary: PySpark tests sometimes forget to check return status of 
doctest.testmod(), masking failing tests
 Key: SPARK-8652
 URL: https://issues.apache.org/jira/browse/SPARK-8652
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker


Several PySpark files call {{doctest.testmod()}} in order to run doctests, but 
forget to check its return status. As a result, failures will not be 
automatically detected by our test runner script, creating the potential for 
bugs to slip through.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8652) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8652:
---

Assignee: Apache Spark  (was: Josh Rosen)

 PySpark tests sometimes forget to check return status of doctest.testmod(), 
 masking failing tests
 -

 Key: SPARK-8652
 URL: https://issues.apache.org/jira/browse/SPARK-8652
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Reporter: Josh Rosen
Assignee: Apache Spark
Priority: Blocker

 Several PySpark files call {{doctest.testmod()}} in order to run doctests, 
 but forget to check its return status. As a result, failures will not be 
 automatically detected by our test runner script, creating the potential for 
 bugs to slip through.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8652) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8652:
---

Assignee: Josh Rosen  (was: Apache Spark)

 PySpark tests sometimes forget to check return status of doctest.testmod(), 
 masking failing tests
 -

 Key: SPARK-8652
 URL: https://issues.apache.org/jira/browse/SPARK-8652
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 Several PySpark files call {{doctest.testmod()}} in order to run doctests, 
 but forget to check its return status. As a result, failures will not be 
 automatically detected by our test runner script, creating the potential for 
 bugs to slip through.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer

2015-06-26 Thread Olivier Girardot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602477#comment-14602477
 ] 

Olivier Girardot commented on SPARK-8332:
-

Ok can you print the command line you're using to submit your job.
My conclusion was that it's difficult to create classpath compatible between 
hive/hadoop from CDH5.x and Spark if you want to use hive tables...
I'd like to see your classpath in order to better understand what is going on.

 NoSuchMethodError: 
 com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
 --

 Key: SPARK-8332
 URL: https://issues.apache.org/jira/browse/SPARK-8332
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
 Environment: spark 1.4  hadoop 2.3.0-cdh5.0.0
Reporter: Tao Li
Priority: Critical
  Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson

 I complied new spark 1.4.0 version. 
 But when I run a simple WordCount demo, it throws NoSuchMethodError 
 {code}
 java.lang.NoSuchMethodError: 
 com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
 {code}
 I found out that the default fasterxml.jackson.version is 2.4.4. 
 Is there any wrong or conflict with the jackson version? 
 Or is there possibly some project maven dependency containing the wrong 
 version of jackson?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8648) Documented command not working

2015-06-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8648.
--
Resolution: Not A Problem

I think it's easier if you put your text inline rather than in an RTF 
attachment, and generally you would use a PR to express the diff. Your JIRA 
title needs to be better as you have no detail at all in the JIRA itself. 
Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first 
before opening a JIRA.

However, this text is not present in {{master}} so is already fixed. You'll 
want to look at the latest doc source first.

 Documented command  not working
 ---

 Key: SPARK-8648
 URL: https://issues.apache.org/jira/browse/SPARK-8648
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
 Environment: Mac
Reporter: Sudhakar Thota
Priority: Trivial
 Attachments: SPARK-8648-1.rtf

   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn

2015-06-26 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602485#comment-14602485
 ] 

Josh Rosen commented on SPARK-8623:
---

There's another report of this issue at 
https://github.com/apache/spark/pull/6679#issuecomment-115546773

 Some queries in spark-sql lead to NullPointerException when using Yarn
 --

 Key: SPARK-8623
 URL: https://issues.apache.org/jira/browse/SPARK-8623
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Hadoop 2.6, Kerberos
Reporter: Bolke de Bruin

 The following query was executed using spark-sql --master yarn-client on 
 1.5.0-SNAPSHOT:
 select * from wcs.geolite_city limit 10;
 This lead to the following error:
 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException
   at org.apache.hadoop.conf.Configuration.init(Configuration.java:693)
   at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442)
   at org.apache.hadoop.mapreduce.Job.init(Job.java:131)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 This does not happen in every case, ie. some queries execute fine, and it is 
 unclear why.
 Using just spark-sql the query executes fine as well and thus the issue 
 seems to rely in the communication with Yarn. Also the query executes fine 
 (with yarn) in spark-shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8620) cleanup CodeGenContext

2015-06-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8620.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

 cleanup CodeGenContext
 --

 Key: SPARK-8620
 URL: https://issues.apache.org/jira/browse/SPARK-8620
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8653) Add constraint for Children expression for data type

2015-06-26 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-8653:


 Summary: Add constraint for Children expression for data type
 Key: SPARK-8653
 URL: https://issues.apache.org/jira/browse/SPARK-8653
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao


Currently, we have trait in Expression like `ExpectsInputTypes` and also the 
`checkInputDataTypes`, but can not convert the children expressions 
automatically, except we write the new rules in the `HiveTypeCoercion`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8405) Show executor logs on Web UI when Yarn log aggregation is enabled

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8405:
---

Assignee: Apache Spark

 Show executor logs on Web UI when Yarn log aggregation is enabled
 -

 Key: SPARK-8405
 URL: https://issues.apache.org/jira/browse/SPARK-8405
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Carson Wang
Assignee: Apache Spark
 Attachments: SparkLogError.png


 When running Spark application in Yarn mode and Yarn log aggregation is 
 enabled, customer is not able to view executor logs on the history server Web 
 UI. The only way for customer to view the logs is through the Yarn command 
 yarn logs -applicationId appId.
 An screenshot of the error is attached. When you click an executor’s log link 
 on the Spark history server, you’ll see the error if Yarn log aggregation is 
 enabled. The log URL redirects user to the node manager’s UI. This works if 
 the logs are located on that node. But since log aggregation is enabled, the 
 local logs are deleted once log aggregation is completed. 
 The logs should be available through the web UIs just like other Hadoop 
 components like MapReduce. For security reasons, end users may not be able to 
 log into the nodes and run the yarn logs -applicationId command. The web UIs 
 can be viewable and exposed through the firewall if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8405) Show executor logs on Web UI when Yarn log aggregation is enabled

2015-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602496#comment-14602496
 ] 

Apache Spark commented on SPARK-8405:
-

User 'carsonwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7033

 Show executor logs on Web UI when Yarn log aggregation is enabled
 -

 Key: SPARK-8405
 URL: https://issues.apache.org/jira/browse/SPARK-8405
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Carson Wang
 Attachments: SparkLogError.png


 When running Spark application in Yarn mode and Yarn log aggregation is 
 enabled, customer is not able to view executor logs on the history server Web 
 UI. The only way for customer to view the logs is through the Yarn command 
 yarn logs -applicationId appId.
 An screenshot of the error is attached. When you click an executor’s log link 
 on the Spark history server, you’ll see the error if Yarn log aggregation is 
 enabled. The log URL redirects user to the node manager’s UI. This works if 
 the logs are located on that node. But since log aggregation is enabled, the 
 local logs are deleted once log aggregation is completed. 
 The logs should be available through the web UIs just like other Hadoop 
 components like MapReduce. For security reasons, end users may not be able to 
 log into the nodes and run the yarn logs -applicationId command. The web UIs 
 can be viewable and exposed through the firewall if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8405) Show executor logs on Web UI when Yarn log aggregation is enabled

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8405:
---

Assignee: (was: Apache Spark)

 Show executor logs on Web UI when Yarn log aggregation is enabled
 -

 Key: SPARK-8405
 URL: https://issues.apache.org/jira/browse/SPARK-8405
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Carson Wang
 Attachments: SparkLogError.png


 When running Spark application in Yarn mode and Yarn log aggregation is 
 enabled, customer is not able to view executor logs on the history server Web 
 UI. The only way for customer to view the logs is through the Yarn command 
 yarn logs -applicationId appId.
 An screenshot of the error is attached. When you click an executor’s log link 
 on the Spark history server, you’ll see the error if Yarn log aggregation is 
 enabled. The log URL redirects user to the node manager’s UI. This works if 
 the logs are located on that node. But since log aggregation is enabled, the 
 local logs are deleted once log aggregation is completed. 
 The logs should be available through the web UIs just like other Hadoop 
 components like MapReduce. For security reasons, end users may not be able to 
 log into the nodes and run the yarn logs -applicationId command. The web UIs 
 can be viewable and exposed through the firewall if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8600) Naive Bayes API for spark.ml Pipelines

2015-06-26 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602498#comment-14602498
 ] 

Yanbo Liang commented on SPARK-8600:


[~mengxr] Can you assign this to me?

 Naive Bayes API for spark.ml Pipelines
 --

 Key: SPARK-8600
 URL: https://issues.apache.org/jira/browse/SPARK-8600
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng

 Create a NaiveBayes API for the spark.ml Pipelines API. This should wrap the 
 existing NaiveBayes implementation under spark.mllib package. Should also 
 keep the parameter names consistent. The output columns could include both 
 the prediction and confidence scores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8245) string function: format_number

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8245:
---

Assignee: Cheng Hao  (was: Apache Spark)

 string function: format_number
 --

 Key: SPARK-8245
 URL: https://issues.apache.org/jira/browse/SPARK-8245
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 format_number(number x, int d): string
 Formats the number X to a format like '#,###,###.##', rounded to D decimal 
 places, and returns the result as a string. If D is 0, the result has no 
 decimal point or fractional part. (As of Hive 0.10.0; bug with float types 
 fixed in Hive 0.14.0, decimal type support added in Hive 0.14.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8245) string function: format_number

2015-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602503#comment-14602503
 ] 

Apache Spark commented on SPARK-8245:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/7034

 string function: format_number
 --

 Key: SPARK-8245
 URL: https://issues.apache.org/jira/browse/SPARK-8245
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 format_number(number x, int d): string
 Formats the number X to a format like '#,###,###.##', rounded to D decimal 
 places, and returns the result as a string. If D is 0, the result has no 
 decimal point or fractional part. (As of Hive 0.10.0; bug with float types 
 fixed in Hive 0.14.0, decimal type support added in Hive 0.14.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8653) Add constraint for Children expression for data type

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8653:
---

Assignee: Apache Spark

 Add constraint for Children expression for data type
 

 Key: SPARK-8653
 URL: https://issues.apache.org/jira/browse/SPARK-8653
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Assignee: Apache Spark

 Currently, we have trait in Expression like `ExpectsInputTypes` and also the 
 `checkInputDataTypes`, but can not convert the children expressions 
 automatically, except we write the new rules in the `HiveTypeCoercion`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8653) Add constraint for Children expression for data type

2015-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602502#comment-14602502
 ] 

Apache Spark commented on SPARK-8653:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/7034

 Add constraint for Children expression for data type
 

 Key: SPARK-8653
 URL: https://issues.apache.org/jira/browse/SPARK-8653
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Currently, we have trait in Expression like `ExpectsInputTypes` and also the 
 `checkInputDataTypes`, but can not convert the children expressions 
 automatically, except we write the new rules in the `HiveTypeCoercion`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8245) string function: format_number

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8245:
---

Assignee: Apache Spark  (was: Cheng Hao)

 string function: format_number
 --

 Key: SPARK-8245
 URL: https://issues.apache.org/jira/browse/SPARK-8245
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 format_number(number x, int d): string
 Formats the number X to a format like '#,###,###.##', rounded to D decimal 
 places, and returns the result as a string. If D is 0, the result has no 
 decimal point or fractional part. (As of Hive 0.10.0; bug with float types 
 fixed in Hive 0.14.0, decimal type support added in Hive 0.14.0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8653) Add constraint for Children expression for data type

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8653:
---

Assignee: (was: Apache Spark)

 Add constraint for Children expression for data type
 

 Key: SPARK-8653
 URL: https://issues.apache.org/jira/browse/SPARK-8653
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Currently, we have trait in Expression like `ExpectsInputTypes` and also the 
 `checkInputDataTypes`, but can not convert the children expressions 
 automatically, except we write the new rules in the `HiveTypeCoercion`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling

2015-06-26 Thread Santiago M. Mola (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602520#comment-14602520
 ] 

Santiago M. Mola commented on SPARK-8636:
-

[~animeshbaranawal] Yes, I think so.

 CaseKeyWhen has incorrect NULL handling
 ---

 Key: SPARK-8636
 URL: https://issues.apache.org/jira/browse/SPARK-8636
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Santiago M. Mola
  Labels: starter

 CaseKeyWhen implementation in Spark uses the following equals implementation:
 {code}
   private def equalNullSafe(l: Any, r: Any) = {
 if (l == null  r == null) {
   true
 } else if (l == null || r == null) {
   false
 } else {
   l == r
 }
   }
 {code}
 Which is not correct, since in SQL, NULL is never equal to NULL (actually, it 
 is not unequal either). In this case, a NULL value in a CASE WHEN expression 
 should never match.
 For example, you can execute this in MySQL:
 {code}
 SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END 
 FROM DUAL;
 {code}
 And the result will be NULL DOES NOT MATCH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8644) SparkException thrown due to Executor exceptions should include caller site in stack trace

2015-06-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602531#comment-14602531
 ] 

Sean Owen commented on SPARK-8644:
--

This seems related to SPARK-8625 which asks to return the whole exception. 
Would that subsume this?

 SparkException thrown due to Executor exceptions should include caller site 
 in stack trace
 --

 Key: SPARK-8644
 URL: https://issues.apache.org/jira/browse/SPARK-8644
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson

 Currently when a job fails due to executor (or other) issues, the exception 
 thrown by Spark has a stack trace which stops at the DAGScheduler EventLoop, 
 which makes it hard to trace back to the user code which submitted the job. 
 It should try to include the user submission stack trace.
 Example exception today:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.lang.RuntimeException: uh-oh!
   at 
 org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34$$anonfun$apply$mcJ$sp$1.apply(DAGSchedulerSuite.scala:851)
   at 
 org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34$$anonfun$apply$mcJ$sp$1.apply(DAGSchedulerSuite.scala:851)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1637)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095)
   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1285)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1276)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1275)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1275)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:749)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:749)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:749)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1486)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
 {code}
 Here is the part I want to include:
 {code}
   at org.apache.spark.rdd.RDD.count(RDD.scala:1095)
   at 
 org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34.apply$mcJ$sp(DAGSchedulerSuite.scala:851)
   at 
 org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34.apply(DAGSchedulerSuite.scala:851)
   at 
 org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34.apply(DAGSchedulerSuite.scala:851)
   at org.scalatest.Assertions$class.intercept(Assertions.scala:997)
   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
   at 
 org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33.apply$mcV$sp(DAGSchedulerSuite.scala:850)
   at 
 org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33.apply(DAGSchedulerSuite.scala:849)
   at 
 org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33.apply(DAGSchedulerSuite.scala:849)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   

[jira] [Commented] (SPARK-8608) After initializing a DataFrame with random columns and a seed, df.show should return same value

2015-06-26 Thread Akhil Thatipamula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602543#comment-14602543
 ] 

Akhil Thatipamula commented on SPARK-8608:
--

[~brkyvz] more description would be helpful.

 After initializing a DataFrame with random columns and a seed, df.show should 
 return same value
 ---

 Key: SPARK-8608
 URL: https://issues.apache.org/jira/browse/SPARK-8608
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.4.0, 1.4.1
Reporter: Burak Yavuz
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6945) Provide SQL tab in the Spark UI

2015-06-26 Thread an lin zeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602545#comment-14602545
 ] 

an lin zeng commented on SPARK-6945:


Hi,what content will be shown on this sql tab?could you give more information 
about this?

 Provide SQL tab in the Spark UI
 ---

 Key: SPARK-6945
 URL: https://issues.apache.org/jira/browse/SPARK-6945
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Web UI
Reporter: Patrick Wendell
Assignee: Andrew Or





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8522) Disable feature scaling in Linear and Logistic Regression

2015-06-26 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-8522:
--

Assignee: DB Tsai  (was: holdenk)

 Disable feature scaling in Linear and Logistic Regression
 -

 Key: SPARK-8522
 URL: https://issues.apache.org/jira/browse/SPARK-8522
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: DB Tsai
Assignee: DB Tsai
 Fix For: 1.5.0


 All compressed sensing applications, and some of the regression use-cases 
 will have better result by turning the feature scaling off. However, if we 
 implement this naively by training the dataset without doing any 
 standardization, the rate of convergency will not be good. This can be 
 implemented by still standardizing the training dataset but we penalize each 
 component differently to get effectively the same objective function but a 
 better numerical problem. As a result, for those columns with high variances, 
 they will be penalized less, and vice versa. Without this, since all the 
 features are standardized, so they will be penalized the same.
 In R, there is an option for this.
 `standardize` 
 Logical flag for x variable standardization, prior to fitting the model 
 sequence. The coefficients are always returned on the original scale. Default 
 is standardize=TRUE. If variables are in the same units already, you might 
 not wish to standardize. See details below for y standardization with 
 family=gaussian.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8226) math function: shiftrightunsigned

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8226:
---

Assignee: zhichao-li  (was: Apache Spark)

 math function: shiftrightunsigned
 -

 Key: SPARK-8226
 URL: https://issues.apache.org/jira/browse/SPARK-8226
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: zhichao-li

 shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a)   
 Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, 
 smallint and int a. Returns bigint for bigint a.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8226) math function: shiftrightunsigned

2015-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602537#comment-14602537
 ] 

Apache Spark commented on SPARK-8226:
-

User 'zhichao-li' has created a pull request for this issue:
https://github.com/apache/spark/pull/7035

 math function: shiftrightunsigned
 -

 Key: SPARK-8226
 URL: https://issues.apache.org/jira/browse/SPARK-8226
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: zhichao-li

 shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a)   
 Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, 
 smallint and int a. Returns bigint for bigint a.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8226) math function: shiftrightunsigned

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8226:
---

Assignee: Apache Spark  (was: zhichao-li)

 math function: shiftrightunsigned
 -

 Key: SPARK-8226
 URL: https://issues.apache.org/jira/browse/SPARK-8226
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a)   
 Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, 
 smallint and int a. Returns bigint for bigint a.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8654) Analysis exception when using NULL IN (...): invalid cast

2015-06-26 Thread Santiago M. Mola (JIRA)
Santiago M. Mola created SPARK-8654:
---

 Summary: Analysis exception when using NULL IN (...): invalid 
cast
 Key: SPARK-8654
 URL: https://issues.apache.org/jira/browse/SPARK-8654
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Santiago M. Mola
Priority: Minor


The following query throws an analysis exception:

{code}
SELECT * FROM t WHERE NULL NOT IN (1, 2, 3);
{code}

The exception is:

{code}
org.apache.spark.sql.AnalysisException: invalid cast from int to null;
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:66)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
{code}

Here is a test that can be added to AnalysisSuite to check the issue:

{code}
  test(SPARK- regression test) {
val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), 
a)() :: Nil,
  LocalRelation()
)
caseInsensitiveAnalyze(plan)
  }
{code}

Note that this kind of query is a corner case, but it is still valid SQL. An 
expression such as NULL IN (...) or NULL NOT IN (...) always gives NULL as 
a result, even if the list contains NULL. So it is safe to translate these 
expressions to Literal(null) during analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8613) Add a param for disabling of feature scaling, default to true

2015-06-26 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-8613.

  Resolution: Fixed
Assignee: holdenk
   Fix Version/s: 1.5.0
Target Version/s: 1.5.0

Issue resolved by pull request 7024
https://github.com/apache/spark/pull/7024


 Add a param for disabling of feature scaling, default to true
 -

 Key: SPARK-8613
 URL: https://issues.apache.org/jira/browse/SPARK-8613
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: holdenk
Assignee: holdenk
 Fix For: 1.5.0


 Add a param to disable feature scaling. Do this distinct from disabling 
 scaling in any particular alg incase someone wants to work on logistic while 
 work in linear is in progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8522) Disable feature scaling in Linear and Logistic Regression

2015-06-26 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-8522:
---
Comment: was deleted

(was: Issue resolved by pull request 7024
[https://github.com/apache/spark/pull/7024])

 Disable feature scaling in Linear and Logistic Regression
 -

 Key: SPARK-8522
 URL: https://issues.apache.org/jira/browse/SPARK-8522
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: DB Tsai
Assignee: holdenk
 Fix For: 1.5.0


 All compressed sensing applications, and some of the regression use-cases 
 will have better result by turning the feature scaling off. However, if we 
 implement this naively by training the dataset without doing any 
 standardization, the rate of convergency will not be good. This can be 
 implemented by still standardizing the training dataset but we penalize each 
 component differently to get effectively the same objective function but a 
 better numerical problem. As a result, for those columns with high variances, 
 they will be penalized less, and vice versa. Without this, since all the 
 features are standardized, so they will be penalized the same.
 In R, there is an option for this.
 `standardize` 
 Logical flag for x variable standardization, prior to fitting the model 
 sequence. The coefficients are always returned on the original scale. Default 
 is standardize=TRUE. If variables are in the same units already, you might 
 not wish to standardize. See details below for y standardization with 
 family=gaussian.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8601) Disable feature scaling in Linear Regression

2015-06-26 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-8601:
---
Assignee: holdenk

 Disable feature scaling in Linear Regression
 

 Key: SPARK-8601
 URL: https://issues.apache.org/jira/browse/SPARK-8601
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: holdenk
Assignee: holdenk

 See parent task for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-06-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602585#comment-14602585
 ] 

Sean Owen commented on SPARK-8646:
--

You're saying it doesn't work at all on YARN? I'd hope there are some unit 
tests for this but I am not sure if it covers this case. Do we know more about 
the likely issue here -- something isn't packaging pyspark, or not unpacking 
it? CC [~lianhuiwang]

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8522) Disable feature scaling in Linear and Logistic Regression

2015-06-26 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-8522.

   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7024
[https://github.com/apache/spark/pull/7024]

 Disable feature scaling in Linear and Logistic Regression
 -

 Key: SPARK-8522
 URL: https://issues.apache.org/jira/browse/SPARK-8522
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: DB Tsai
Assignee: holdenk
 Fix For: 1.5.0


 All compressed sensing applications, and some of the regression use-cases 
 will have better result by turning the feature scaling off. However, if we 
 implement this naively by training the dataset without doing any 
 standardization, the rate of convergency will not be good. This can be 
 implemented by still standardizing the training dataset but we penalize each 
 component differently to get effectively the same objective function but a 
 better numerical problem. As a result, for those columns with high variances, 
 they will be penalized less, and vice versa. Without this, since all the 
 features are standardized, so they will be penalized the same.
 In R, there is an option for this.
 `standardize` 
 Logical flag for x variable standardization, prior to fitting the model 
 sequence. The coefficients are always returned on the original scale. Default 
 is standardize=TRUE. If variables are in the same units already, you might 
 not wish to standardize. See details below for y standardization with 
 family=gaussian.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-8522) Disable feature scaling in Linear and Logistic Regression

2015-06-26 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reopened SPARK-8522:


 Disable feature scaling in Linear and Logistic Regression
 -

 Key: SPARK-8522
 URL: https://issues.apache.org/jira/browse/SPARK-8522
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: DB Tsai
Assignee: holdenk
 Fix For: 1.5.0


 All compressed sensing applications, and some of the regression use-cases 
 will have better result by turning the feature scaling off. However, if we 
 implement this naively by training the dataset without doing any 
 standardization, the rate of convergency will not be good. This can be 
 implemented by still standardizing the training dataset but we penalize each 
 component differently to get effectively the same objective function but a 
 better numerical problem. As a result, for those columns with high variances, 
 they will be penalized less, and vice versa. Without this, since all the 
 features are standardized, so they will be penalized the same.
 In R, there is an option for this.
 `standardize` 
 Logical flag for x variable standardization, prior to fitting the model 
 sequence. The coefficients are always returned on the original scale. Default 
 is standardize=TRUE. If variables are in the same units already, you might 
 not wish to standardize. See details below for y standardization with 
 family=gaussian.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed

2015-06-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8383.
--
Resolution: Cannot Reproduce

 Spark History Server shows Last Updated as 1969/12/31 when SparkPI 
 application completed 
 -

 Key: SPARK-8383
 URL: https://issues.apache.org/jira/browse/SPARK-8383
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.3.1
 Environment: Spark1.3.1.2.3
Reporter: Irina Easterling
 Attachments: Spark_WrongLastUpdatedDate.png, 
 YARN_SparkJobCompleted.PNG


 Spark History Server shows Last Updated as 1969/12/31 when SparkPI 
 application completed and Started Date is 2015/06/10 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8666) checkpointing does not take advantage of persisted/cached RDDs

2015-06-26 Thread Glenn Strycker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glenn Strycker updated SPARK-8666:
--
Description: 
I have been noticing that when checkpointing RDDs, all operations are occurring 
TWICE.

For example, when I run the following code and watch the stages...

{noformat}
val newRDD = prevRDD.map(a = (a._1, 
1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER)
newRDD.checkpoint
print(newRDD.count())
{noformat}

I see distinct and count operations appearing TWICE, and shuffle disk writes 
and reads (from the distinct) occurring TWICE.

My newRDD is persisted to memory, why can't the checkpoint simply save those 
partitions to disk when the first operations have completed?

  was:
I have been noticing that when checkpointing RDDs, all operations are occurring 
TWICE.

For example, when I run the following code and watch the stages...

{noformat}
val newRDD = prevRDD.map(a = (a._1, 1L)).distinct.persist()
newRDD.checkpoint
print(newRDD.count())
{noformat}

I see distinct and count operations appearing TWICE, and shuffle disk writes 
and reads (from the distinct) occurring TWICE.

My newRDD is persisted to memory, why can't the checkpoint simply save those 
partitions to disk when the first operations have completed?


 checkpointing does not take advantage of persisted/cached RDDs
 --

 Key: SPARK-8666
 URL: https://issues.apache.org/jira/browse/SPARK-8666
 Project: Spark
  Issue Type: New Feature
Reporter: Glenn Strycker

 I have been noticing that when checkpointing RDDs, all operations are 
 occurring TWICE.
 For example, when I run the following code and watch the stages...
 {noformat}
 val newRDD = prevRDD.map(a = (a._1, 
 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER)
 newRDD.checkpoint
 print(newRDD.count())
 {noformat}
 I see distinct and count operations appearing TWICE, and shuffle disk writes 
 and reads (from the distinct) occurring TWICE.
 My newRDD is persisted to memory, why can't the checkpoint simply save those 
 partitions to disk when the first operations have completed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7756) Ensure Spark runs clean on IBM Java implementation

2015-06-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602899#comment-14602899
 ] 

Sean Owen commented on SPARK-7756:
--

I agree, https://github.com/apache/spark/pull/6740 should have been separate. 
The other two PRs are logically related.

 Ensure Spark runs clean on IBM Java implementation
 --

 Key: SPARK-7756
 URL: https://issues.apache.org/jira/browse/SPARK-7756
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Tim Ellison
Assignee: Tim Ellison
Priority: Minor
 Fix For: 1.4.0


 Spark should run successfully on the IBM Java implementation.  This issue is 
 to gather any minor issues seen running the tests and examples that are 
 attributable to differences in Java vendor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8659) SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server.

2015-06-26 Thread Premchandra Preetham Kukillaya (JIRA)
Premchandra Preetham Kukillaya created SPARK-8659:
-

 Summary: SQL Standard Based Hive Authorisation of Hive.13 does not 
work while pointing JDBC Application to Spark Thrift Server. 
 Key: SPARK-8659
 URL: https://issues.apache.org/jira/browse/SPARK-8659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
 Environment: Linux
Reporter: Premchandra Preetham Kukillaya


It seems like while pointing JDBC/ODBC Driver to Spark SQL Thrift Service 
Hive's feature SQL based authorization is not working whereas SQL based 
Authorization works when i am pointing the JDBC Driver to ThriftCLIService 
provided by HiveServer2.

The problem is user X can do select on table belonging to user Y.

I am using Hive .13.1 and Spark 1.3.1


./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
hostname.compute.amazonaws.com --hiveconf 
hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
 --hiveconf 
hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
 --hiveconf hive.server2.enable.doAs=false --hiveconf 
hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 
--hiveconf mapred.max.split.size=25600 --hiveconf 
hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
 --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
--hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
javax.jdo.option.ConnectionPassword=hive --hiveconf 
hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf 
hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8661) Update comments that contain R statements in ml.LinearRegressionSuite

2015-06-26 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8661:


 Summary: Update comments that contain R statements in 
ml.LinearRegressionSuite
 Key: SPARK-8661
 URL: https://issues.apache.org/jira/browse/SPARK-8661
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng


Similar to SPARK-8660, but for ml.LinearRegressionSuite: 
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8662) [SparkR] SparkSQL tests fail in R 3.2

2015-06-26 Thread Chris Freeman (JIRA)
Chris Freeman created SPARK-8662:


 Summary: [SparkR] SparkSQL tests fail in R 3.2
 Key: SPARK-8662
 URL: https://issues.apache.org/jira/browse/SPARK-8662
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
Reporter: Chris Freeman
 Fix For: 1.4.0


SparkR tests for equality using `all.equal` on environments fail in R 3.2.

This is due to a change in how equality between environments is handled in the 
new version of R.

This should most likely not be a huge problem, we'll just have to rewrite some 
of the tests to be more fine-grained instead of testing equality on entire 
environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8652) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests

2015-06-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8652.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7032
[https://github.com/apache/spark/pull/7032]

 PySpark tests sometimes forget to check return status of doctest.testmod(), 
 masking failing tests
 -

 Key: SPARK-8652
 URL: https://issues.apache.org/jira/browse/SPARK-8652
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker
 Fix For: 1.5.0


 Several PySpark files call {{doctest.testmod()}} in order to run doctests, 
 but forget to check its return status. As a result, failures will not be 
 automatically detected by our test runner script, creating the potential for 
 bugs to slip through.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8655) DataFrameReader#option supports more than String as value

2015-06-26 Thread Michael Nitschinger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Nitschinger updated SPARK-8655:
---
Description: 
I'm working on a custom data source, porting it from 1.3 to 1.4.

On 1.3 I could easily extend the SparkSQL imports and get access to it, which 
meant I could use custom options right away. One of those is I pass a Filter 
down to my Relation for tighter schema inference against a schemaless database.

So I would have something like:

n1ql(filter: Filter = null, userSchema: StructType = null, bucketName: String = 
null)

Since I want to move my API behind the DataFrameReader, the SQLContext is not 
available anymore, only through the RelationProvider, which I've implemented 
and it works nicely.

The only problem I have now is that while I can pass in custom options, they 
are all String typed. So I have no way to pass down my optional Filter anymore 
(since parameters is a Map[String, String]).

Would it be possible to extend the options so that more than just Strings can 
be passed in? Right now I probably need to work around that by documenting how 
people can pass in a string which I turn into a Filter, but that's somewhat 
hacky.

Note that built-in impls like JSON or JDBC have no issues, because since they 
can access the SQLContext (private) without issues, they don't need to go 
through the decoupling of the RelationProvider and can do any custom arguments 
they want on their methods.

  was:
I'm working on a custom data source, porting it from 1.3 to 1.4.

On 1.3 I could easily extend the SparkSQL imports and get access to it, which 
meant I could use custom options right away. One of those is I pass a Filter 
down to my Relation for tighter schema inference against a schemaless database.

So I would have something like:

n1ql(filter: Filter = null, userSchema: StructType = null, bucketName: String = 
null)

Since I want to move my API behind the DataFrameReader, the SQLContext is not 
available anymore, only through the RelationProvider, which I've implemented 
and it works nicely.

The only problem I have now is that while I can pass in custom options, they 
are all String typed. So I have no way to pass down my optional Filter anymore 
(since parameters is a Map[String, String]).

Would it be possible to extend the options so that more than just Strings can 
be passed in? Right now I probably need to work around that by documenting how 
people can pass in a string which I turn into a Filter, but that's somewhat 
hacky.


 DataFrameReader#option supports more than String as value
 -

 Key: SPARK-8655
 URL: https://issues.apache.org/jira/browse/SPARK-8655
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Michael Nitschinger

 I'm working on a custom data source, porting it from 1.3 to 1.4.
 On 1.3 I could easily extend the SparkSQL imports and get access to it, which 
 meant I could use custom options right away. One of those is I pass a Filter 
 down to my Relation for tighter schema inference against a schemaless 
 database.
 So I would have something like:
 n1ql(filter: Filter = null, userSchema: StructType = null, bucketName: String 
 = null)
 Since I want to move my API behind the DataFrameReader, the SQLContext is not 
 available anymore, only through the RelationProvider, which I've implemented 
 and it works nicely.
 The only problem I have now is that while I can pass in custom options, they 
 are all String typed. So I have no way to pass down my optional Filter 
 anymore (since parameters is a Map[String, String]).
 Would it be possible to extend the options so that more than just Strings can 
 be passed in? Right now I probably need to work around that by documenting 
 how people can pass in a string which I turn into a Filter, but that's 
 somewhat hacky.
 Note that built-in impls like JSON or JDBC have no issues, because since they 
 can access the SQLContext (private) without issues, they don't need to go 
 through the decoupling of the RelationProvider and can do any custom 
 arguments they want on their methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8656) Spark Standalone master json API's worker number is not match web UI number

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8656:
---

Assignee: (was: Apache Spark)

 Spark Standalone master json API's worker number is not match web UI number
 ---

 Key: SPARK-8656
 URL: https://issues.apache.org/jira/browse/SPARK-8656
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: thegiive
Priority: Minor

 Spark standalone master web UI show  Alive workers worker number, Alive 
 Workers total core, total used cores and Alive workers total memory, 
 memory used. 
 But the JSON API page http://MASTERURL:8088/json; shows all workers worker 
 number, core, memory number. 
 This webUI data is not sync with the JSON API. 
 The proper way is to sync the number with webUI and JSON API. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8656) Spark Standalone master json API's worker number is not match web UI number

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8656:
---

Assignee: Apache Spark

 Spark Standalone master json API's worker number is not match web UI number
 ---

 Key: SPARK-8656
 URL: https://issues.apache.org/jira/browse/SPARK-8656
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: thegiive
Assignee: Apache Spark
Priority: Minor

 Spark standalone master web UI show  Alive workers worker number, Alive 
 Workers total core, total used cores and Alive workers total memory, 
 memory used. 
 But the JSON API page http://MASTERURL:8088/json; shows all workers worker 
 number, core, memory number. 
 This webUI data is not sync with the JSON API. 
 The proper way is to sync the number with webUI and JSON API. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8615) sql programming guide recommends deprecated code

2015-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602695#comment-14602695
 ] 

Apache Spark commented on SPARK-8615:
-

User 'tijoparacka' has created a pull request for this issue:
https://github.com/apache/spark/pull/7039

 sql programming guide recommends deprecated code
 

 Key: SPARK-8615
 URL: https://issues.apache.org/jira/browse/SPARK-8615
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Gergely Svigruha
Priority: Minor

 The Spark 1.4 sql programming guide has an example code on how to use JDBC 
 tables:
 https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
 sqlContext.load(jdbc, Map(...))
 However this code complies with a warning, and recommends to do this:
  sqlContext.read.format(jdbc).options(Map(...)).load()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8657) Fail to upload conf archive to viewfs

2015-06-26 Thread Tao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Li updated SPARK-8657:
--
Description: 
When I run in spark-1.4 yarn-client mode, I throws the following Exception when 
trying to upload conf archive to viewfs:

15/06/26 17:56:37 INFO yarn.Client: Uploading resource 
file:/tmp/spark-095ec3d2-5dad-468c-8d46-2c813457404d/__hadoop_conf__8436284925771788661
.zip - 
viewfs://nsX/user/ultraman/.sparkStaging/application_1434370929997_191242/__hadoop_conf__8436284925771788661.zip
15/06/26 17:56:38 INFO yarn.Client: Deleting staging directory 
.sparkStaging/application_1434370929997_191242
15/06/26 17:56:38 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://SunshineNameNode2:8020/user/ultraman/.sparkStaging/application_1434370929997_191242/__had
oop_conf__8436284925771788661.zip, expected: viewfs://nsX/
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:117)
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:346)
at 
org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:341)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:338)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:559)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
at org.apache.spark.SparkContext.init(SparkContext.scala:497)
at 
org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
at $line3.$read$$iwC$$iwC.init(console:9)
at $line3.$read$$iwC.init(console:18)
at $line3.$read.init(console:20)
at $line3.$read$.init(console:24)
at $line3.$read$.clinit(console)
at $line3.$eval$.init(console:7)
at $line3.$eval$.clinit(console)
at $line3.$eval.$print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)

The bug is easy to fix, we should pass the correct file system object to 
addResource. The similar issure is: https://github.com/apache/spark/pull/1483. 
I will attach my bug fix PR very soon.

The code in Client.scala is need to fix:

  was:
When I run in spark-1.4 yarn-client mode, I throws the following Exception when 
trying to upload conf archive to viewfs:

15/06/26 17:56:37 INFO yarn.Client: Uploading resource 
file:/tmp/spark-095ec3d2-5dad-468c-8d46-2c813457404d/__hadoop_conf__8436284925771788661
.zip - 
viewfs://nsX/user/ultraman/.sparkStaging/application_1434370929997_191242/__hadoop_conf__8436284925771788661.zip
15/06/26 17:56:38 INFO yarn.Client: Deleting staging directory 
.sparkStaging/application_1434370929997_191242
15/06/26 17:56:38 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://SunshineNameNode2:8020/user/ultraman/.sparkStaging/application_1434370929997_191242/__had
oop_conf__8436284925771788661.zip, expected: viewfs://nsX/
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:117)
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:346)
at 
org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:341)
at 
org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:338)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
at 

[jira] [Created] (SPARK-8658) AttributeReference equals method only compare name, exprId and dataType

2015-06-26 Thread Antonio Jesus Navarro (JIRA)
Antonio Jesus Navarro created SPARK-8658:


 Summary: AttributeReference equals method only compare name, 
exprId and dataType
 Key: SPARK-8658
 URL: https://issues.apache.org/jira/browse/SPARK-8658
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.3.1, 1.3.0
Reporter: Antonio Jesus Navarro


The AttributeReference equals method only accept as different objects with 
different name, expression id or dataType. With this behavior when I tried to 
do a transformExpressionsDown and try to transform qualifiers inside 
AttributeReferences, these objects are not replaced, because the transformer 
considers them equal.

I propose to add to the equals method this variables:

name, dataType, nullable, metadata, epxrId, qualifiers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5768) Spark UI Shows incorrect memory under Yarn

2015-06-26 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602819#comment-14602819
 ] 

Kousuke Saruta commented on SPARK-5768:
---

[~srowen] Oh, I see. Thanks for letting me know. If 1.4.1 is released without 
another RC, I'll modify the fix version.

 Spark UI Shows incorrect memory under Yarn
 --

 Key: SPARK-5768
 URL: https://issues.apache.org/jira/browse/SPARK-5768
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0, 1.2.1
 Environment: Centos 6
Reporter: Al M
Assignee: Rekha Joshi
Priority: Trivial
 Fix For: 1.4.1, 1.5.0


 I am running Spark on Yarn with 2 executors.  The executors are running on 
 separate physical machines.
 I have spark.executor.memory set to '40g'.  This is because I want to have 
 40g of memory used on each machine.  I have one executor per machine.
 When I run my application I see from 'top' that both my executors are using 
 the full 40g of memory I allocated to them.
 The 'Executors' tab in the Spark UI shows something different.  It shows the 
 memory used as a total of 20GB per executor e.g. x / 20.3GB.  This makes it 
 look like I only have 20GB available per executor when really I have 40GB 
 available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8302) Support heterogeneous cluster nodes on YARN

2015-06-26 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-8302.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6752
[https://github.com/apache/spark/pull/6752]

 Support heterogeneous cluster nodes on YARN
 ---

 Key: SPARK-8302
 URL: https://issues.apache.org/jira/browse/SPARK-8302
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Affects Versions: 1.5.0
Reporter: Marcelo Vanzin
 Fix For: 1.5.0


 Some of our customers install Hadoop on different paths across the cluster. 
 When running a Spark app, this leads to a few complications because of how we 
 try to reuse the rest of Hadoop.
 Since all configuration for a Spark-on-YARN application is local, the code 
 does not have enough information about how to run things on the rest of the 
 cluster in such cases.
 To illustrate: let's say that a node's configuration says that 
 {{SPARK_DIST_CLASSPATH=/disk1/hadoop/lib/*}}. If I launch a Spark app from 
 that machine, but there's a machine on the cluster where Hadoop is actually 
 installed in {{/disk2/hadoop/lib}}, then any container launched on that node 
 will fail.
 The problem does not exist (or is much less pronounced) on standalone and 
 mesos since they require a local Spark installation and configuration.
 It would be nice if we could easily support this use case on YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()

2015-06-26 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603094#comment-14603094
 ] 

Shivaram Venkataraman commented on SPARK-8409:
--

If you open those links it says at the top 'This post has NOT been accepted by 
the mailing list yet' -- I don't use nabble, so I can't comment on why that is 
happening. 
If you can't get mailing lists to work, please post to StackOverflow with the 
tag apache-spark, sparkr . The JIRA is not something we use for supporting 
users in the Spark project.

  In windows cant able to read .csv or .json files using read.df()
 -

 Key: SPARK-8409
 URL: https://issues.apache.org/jira/browse/SPARK-8409
 Project: Spark
  Issue Type: Bug
  Components: SparkR, Windows
Affects Versions: 1.4.0
 Environment: sparkR API
Reporter: Arun
Priority: Critical

 Hi, 
 In SparkR shell, I invoke: 
  mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json, 
  header=false) 
 I have tried various filetypes (csv, txt), all fail.   
  in sparkR of spark 1.4 for eg.) df_1- read.df(sqlContext, 
 E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv,
  source = csv)
 RESPONSE: ERROR RBackendHandler: load on 1 failed 
 BELOW THE WHOLE RESPONSE: 
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
 curMem=0, maxMem=278302556 
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 173.4 KB, free 265.2 MB) 
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
 curMem=177600, maxMem=278302556 
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 16.2 KB, free 265.2 MB) 
 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
 NativeMethodAccessorImpl.java:-2 
 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
 feature cannot be used because libhadoop cannot be loaded. 
 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
 java.lang.reflect.InvocationTargetException 
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  
 at java.lang.reflect.Method.invoke(Method.java:606) 
 at 
 org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
  
 at 
 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
 at 
 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
 at 
 io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
  
 at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
  
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
  
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
  
 at 
 io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
  
 at java.lang.Thread.run(Thread.java:745) 
 

[jira] [Commented] (SPARK-8372) History server shows incorrect information for application not started

2015-06-26 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603134#comment-14603134
 ] 

Marcelo Vanzin commented on SPARK-8372:
---

bq. The log path name may also end with an attempt id

I'm not saying the single line patch I posted is the answer, I was just 
pointing out the current patch in master caused a regression.

 History server shows incorrect information for application not started
 --

 Key: SPARK-8372
 URL: https://issues.apache.org/jira/browse/SPARK-8372
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Web UI
Affects Versions: 1.4.0
Reporter: Carson Wang
Assignee: Carson Wang
Priority: Minor
 Fix For: 1.4.1, 1.5.0

 Attachments: IncorrectAppInfo.png


 The history server may show an incorrect App ID for an incomplete application 
 like App ID.inprogress. This app info will never disappear even after the 
 app is completed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8666) checkpointing does not take advantage of persisted/cached RDDs

2015-06-26 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603175#comment-14603175
 ] 

Glenn Strycker commented on SPARK-8666:
---

I added a stackoverflow question to parallel this ticket:  
http://stackoverflow.com/questions/31078350/spark-rdd-checkpoint-on-persisted-cached-rdds-are-performing-the-dag-twice

One idea I had is that maybe I have to materialize twice?

{noformat}
// this will create the RDD and cache, when materialized
val newRDD = prevRDD.map(a = (a._1, 
1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER)
print(newRDD.count())

// will this now checkpoint FROM THE EXISTING CACHE IN MEMORY?
newRDD.checkpoint
print(newRDD.count())
{noformat}

 checkpointing does not take advantage of persisted/cached RDDs
 --

 Key: SPARK-8666
 URL: https://issues.apache.org/jira/browse/SPARK-8666
 Project: Spark
  Issue Type: New Feature
Reporter: Glenn Strycker

 I have been noticing that when checkpointing RDDs, all operations are 
 occurring TWICE.
 For example, when I run the following code and watch the stages...
 {noformat}
 val newRDD = prevRDD.map(a = (a._1, 
 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER)
 newRDD.checkpoint
 print(newRDD.count())
 {noformat}
 I see distinct and count operations appearing TWICE, and shuffle disk writes 
 and reads (from the distinct) occurring TWICE.
 My newRDD is persisted to memory, why can't the checkpoint simply save those 
 partitions to disk when the first operations have completed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8666) checkpointing does not take advantage of persisted/cached RDDs

2015-06-26 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603189#comment-14603189
 ] 

Glenn Strycker commented on SPARK-8666:
---

Looks like this is ticket is a duplicate of 
https://issues.apache.org/jira/browse/SPARK-8582

 checkpointing does not take advantage of persisted/cached RDDs
 --

 Key: SPARK-8666
 URL: https://issues.apache.org/jira/browse/SPARK-8666
 Project: Spark
  Issue Type: New Feature
Reporter: Glenn Strycker

 I have been noticing that when checkpointing RDDs, all operations are 
 occurring TWICE.
 For example, when I run the following code and watch the stages...
 {noformat}
 val newRDD = prevRDD.map(a = (a._1, 
 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER)
 newRDD.checkpoint
 print(newRDD.count())
 {noformat}
 I see distinct and count operations appearing TWICE, and shuffle disk writes 
 and reads (from the distinct) occurring TWICE.
 My newRDD is persisted to memory, why can't the checkpoint simply save those 
 partitions to disk when the first operations have completed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8666) checkpointing does not take advantage of persisted/cached RDDs

2015-06-26 Thread Glenn Strycker (JIRA)
Glenn Strycker created SPARK-8666:
-

 Summary: checkpointing does not take advantage of persisted/cached 
RDDs
 Key: SPARK-8666
 URL: https://issues.apache.org/jira/browse/SPARK-8666
 Project: Spark
  Issue Type: New Feature
Reporter: Glenn Strycker


I have been noticing that when checkpointing RDDs, all operations are occurring 
TWICE.

For example, when I run the following code and watch the stages...

{noformat}
val newRDD = prevRDD.map(a = (a._1, 1L)).distinct.persist()
newRDD.checkpoint
print(newRDD.count())
{noformat}

I see distinct and count operations appearing TWICE, and shuffle disk writes 
and reads (from the distinct) occurring TWICE.

My newRDD is persisted to memory, why can't the checkpoint simply save those 
partitions to disk when the first operations have completed?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2015-06-26 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603166#comment-14603166
 ] 

Zhan Zhang commented on SPARK-2883:
---

[~philclaridge] Please refer to the test case in the trunk on how to use it. 
saveAsOrcFile/orcFile is removed from upstream.

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: New Feature
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Assignee: Zhan Zhang
Priority: Critical
 Fix For: 1.4.0

 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png, orc.diff


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8659) SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server.

2015-06-26 Thread Premchandra Preetham Kukillaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Premchandra Preetham Kukillaya updated SPARK-8659:
--
Description: 
It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
Hive's security  feature SQL based authorisation is not working whereas SQL 
based Authorisation works when i am pointing the JDBC Driver to 
ThriftCLIService provided by HiveServer2. But we need to use Spark SQL Thrift 
Service as we require  to use Spark SQL with Tableau

The problem is user X can do select on table belonging to user Y, though 
permission for table is explicitly defined

I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark 


./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
hostname.compute.amazonaws.com --hiveconf 
hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
 --hiveconf 
hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
 --hiveconf hive.server2.enable.doAs=false --hiveconf 
hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 
--hiveconf mapred.max.split.size=25600 --hiveconf 
hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
 --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
--hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
javax.jdo.option.ConnectionPassword=hive --hiveconf 
hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf 
hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true

  was:
It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
Hive's security  feature SQL based authorisation is not working whereas SQL 
based Authorisation works when i am pointing the JDBC Driver to 
ThriftCLIService provided by HiveServer2. But we need to use Spark SQL Thrift 
Service as we require  to use Spark SQL with Tableau

The problem is user X can do select on table belonging to user Y, though 
permission for table is explicitly defined

I am using Hive .13.1 and Spark 1.3.1


./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
hostname.compute.amazonaws.com --hiveconf 
hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
 --hiveconf 
hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
 --hiveconf hive.server2.enable.doAs=false --hiveconf 
hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 
--hiveconf mapred.max.split.size=25600 --hiveconf 
hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
 --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
--hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
javax.jdo.option.ConnectionPassword=hive --hiveconf 
hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf 
hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true


 SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing 
 JDBC Application to Spark Thrift Server. 
 ---

 Key: SPARK-8659
 URL: https://issues.apache.org/jira/browse/SPARK-8659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
 Environment: Linux
Reporter: Premchandra Preetham Kukillaya

 It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
 Hive's security  feature SQL based authorisation is not working whereas SQL 
 based Authorisation works when i am pointing the JDBC Driver to 
 ThriftCLIService provided by HiveServer2. But we need to use Spark SQL Thrift 
 Service as we require  to use Spark SQL with Tableau
 The problem is user X can do select on table belonging to user Y, though 
 permission for table is explicitly defined
 I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to 
 Spark 
 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
 hostname.compute.amazonaws.com --hiveconf 
 hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
  --hiveconf 
 hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
  --hiveconf hive.server2.enable.doAs=false --hiveconf 
 

[jira] [Updated] (SPARK-8660) Update comments that contain R statements in ml.logisticRegressionSuite

2015-06-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8660:
-
Remaining Estimate: 20m
 Original Estimate: 20m

 Update comments that contain R statements in ml.logisticRegressionSuite
 ---

 Key: SPARK-8660
 URL: https://issues.apache.org/jira/browse/SPARK-8660
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Priority: Trivial
  Labels: starter
   Original Estimate: 20m
  Remaining Estimate: 20m

 We put R statements as comments in unit test. However, there are two issues:
 1. JavaDoc style /** ... */ is used instead of normal multiline comment /* 
 ... */.
 2. We put a leading * on each line. It is hard to copy  paste the commands 
 to/from R and verify the result.
 For example, in 
 https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L504
 {code}
 /**
  * Using the following R code to load the data and train the model using 
 glmnet package.
  *
  *  library(glmnet)
  *  data - read.csv(path, header=FALSE)
  *  label = factor(data$V1)
  *  features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5))
  *  weights = coef(glmnet(features,label, family=binomial, alpha = 
 1.0, lambda = 6.0))
  *  weights
  * 5 x 1 sparse Matrix of class dgCMatrix
  *  s0
  * (Intercept) -0.2480643
  * data.V2  0.000
  * data.V3   .
  * data.V4   .
  * data.V5   .
  */
 {code}
 should change to
 {code}
 /*
   Using the following R code to load the data and train the model using 
 glmnet package.
  
   library(glmnet)
   data - read.csv(path, header=FALSE)
   label = factor(data$V1)
   features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5))
   weights = coef(glmnet(features,label, family=binomial, alpha = 1.0, 
 lambda = 6.0))
   weights
   5 x 1 sparse Matrix of class dgCMatrix
s0
   (Intercept) -0.2480643
   data.V2  0.000
   data.V3   .
   data.V4   .
   data.V5   .
 */
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8660) Update comments that contain R statements in ml.logisticRegressionSuite

2015-06-26 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8660:


 Summary: Update comments that contain R statements in 
ml.logisticRegressionSuite
 Key: SPARK-8660
 URL: https://issues.apache.org/jira/browse/SPARK-8660
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Priority: Trivial


We put R statements as comments in unit test. However, there are two issues:

1. JavaDoc style /** ... */ is used instead of normal multiline comment /* 
... */.
2. We put a leading * on each line. It is hard to copy  paste the commands 
to/from R and verify the result.

For example, in 
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L504

{code}
/**
 * Using the following R code to load the data and train the model using 
glmnet package.
 *
 *  library(glmnet)
 *  data - read.csv(path, header=FALSE)
 *  label = factor(data$V1)
 *  features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5))
 *  weights = coef(glmnet(features,label, family=binomial, alpha = 1.0, 
lambda = 6.0))
 *  weights
 * 5 x 1 sparse Matrix of class dgCMatrix
 *  s0
 * (Intercept) -0.2480643
 * data.V2  0.000
 * data.V3   .
 * data.V4   .
 * data.V5   .
 */
{code}

should change to

{code}
/*
  Using the following R code to load the data and train the model using 
glmnet package.
 
  library(glmnet)
  data - read.csv(path, header=FALSE)
  label = factor(data$V1)
  features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5))
  weights = coef(glmnet(features,label, family=binomial, alpha = 1.0, 
lambda = 6.0))
  weights

  5 x 1 sparse Matrix of class dgCMatrix
   s0
  (Intercept) -0.2480643
  data.V2  0.000
  data.V3   .
  data.V4   .
  data.V5   .
*/
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8659) SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server.

2015-06-26 Thread Premchandra Preetham Kukillaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Premchandra Preetham Kukillaya updated SPARK-8659:
--
Description: 
It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
Hive's security  feature SQL based authorisation is not working. It ignores the 
security settings passed through the command line. The arguments for command 
line is given below for reference

The problem is user X can do select on table belonging to user Y, though 
permission for table is explicitly defined and its a data security risk.

I am using Hive .13.1 and Spark 1.3.1 and here is the list arguments passed to 
Spark SQL Thrift Server.


./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
hostname.compute.amazonaws.com --hiveconf 
hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
 --hiveconf 
hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
 --hiveconf hive.server2.enable.doAs=false --hiveconf 
hive.security.authorization.enabled=true --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
 --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
--hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
javax.jdo.option.ConnectionPassword=hive

  was:
It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
Hive's security  feature SQL based authorisation is not working. It ignores the 
security settings passed through the command line. The arguments for command 
line is given below for reference

The problem is user X can do select on table belonging to user Y, though 
permission for table is explicitly defined

I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark 


./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
hostname.compute.amazonaws.com --hiveconf 
hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
 --hiveconf 
hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
 --hiveconf hive.server2.enable.doAs=false --hiveconf 
hive.security.authorization.enabled=true --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
 --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
--hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
javax.jdo.option.ConnectionPassword=hive


 SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing 
 JDBC Application to Spark Thrift Server. 
 ---

 Key: SPARK-8659
 URL: https://issues.apache.org/jira/browse/SPARK-8659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
 Environment: Linux
Reporter: Premchandra Preetham Kukillaya

 It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
 Hive's security  feature SQL based authorisation is not working. It ignores 
 the security settings passed through the command line. The arguments for 
 command line is given below for reference
 The problem is user X can do select on table belonging to user Y, though 
 permission for table is explicitly defined and its a data security risk.
 I am using Hive .13.1 and Spark 1.3.1 and here is the list arguments passed 
 to Spark SQL Thrift Server.
 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
 hostname.compute.amazonaws.com --hiveconf 
 hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
  --hiveconf 
 hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
  --hiveconf hive.server2.enable.doAs=false --hiveconf 
 hive.security.authorization.enabled=true --hiveconf 
 javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
  --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
 --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
 javax.jdo.option.ConnectionPassword=hive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8661) Update comments that contain R statements in ml.LinearRegressionSuite

2015-06-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8661:
-
Remaining Estimate: 20m
 Original Estimate: 20m

 Update comments that contain R statements in ml.LinearRegressionSuite
 -

 Key: SPARK-8661
 URL: https://issues.apache.org/jira/browse/SPARK-8661
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
  Labels: starter
   Original Estimate: 20m
  Remaining Estimate: 20m

 Similar to SPARK-8660, but for ml.LinearRegressionSuite: 
 https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8559) Support association rule generation in FPGrowth

2015-06-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8559:
-
Shepherd: Xiangrui Meng

 Support association rule generation in FPGrowth
 ---

 Key: SPARK-8559
 URL: https://issues.apache.org/jira/browse/SPARK-8559
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guangwen Liu
Assignee: Feynman Liang

 It will be more useful and practical to include the association rule 
 generation part for real applications, though it is not hard by user to find 
 association rules from the frequent itemset with frequency which is output by 
 FP growth.
 However how to generate association rules in an efficient way is not widely 
 reported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8664) Add PCA transformer

2015-06-26 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603043#comment-14603043
 ] 

Yanbo Liang commented on SPARK-8664:


[~mengxr] I am already working on it, please assign it to me.

 Add PCA transformer
 ---

 Key: SPARK-8664
 URL: https://issues.apache.org/jira/browse/SPARK-8664
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.5.0
Reporter: Yanbo Liang

 Add PCA transformer for ML pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()

2015-06-26 Thread Arun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603053#comment-14603053
 ] 

Arun commented on SPARK-8409:
-

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-row-bind-two-data-frames-in-SparkR-td23502.html
  
 This is the link I posted

  In windows cant able to read .csv or .json files using read.df()
 -

 Key: SPARK-8409
 URL: https://issues.apache.org/jira/browse/SPARK-8409
 Project: Spark
  Issue Type: Bug
  Components: SparkR, Windows
Affects Versions: 1.4.0
 Environment: sparkR API
Reporter: Arun
Priority: Critical

 Hi, 
 In SparkR shell, I invoke: 
  mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json, 
  header=false) 
 I have tried various filetypes (csv, txt), all fail.   
  in sparkR of spark 1.4 for eg.) df_1- read.df(sqlContext, 
 E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv,
  source = csv)
 RESPONSE: ERROR RBackendHandler: load on 1 failed 
 BELOW THE WHOLE RESPONSE: 
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
 curMem=0, maxMem=278302556 
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 173.4 KB, free 265.2 MB) 
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
 curMem=177600, maxMem=278302556 
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 16.2 KB, free 265.2 MB) 
 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
 NativeMethodAccessorImpl.java:-2 
 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
 feature cannot be used because libhadoop cannot be loaded. 
 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
 java.lang.reflect.InvocationTargetException 
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  
 at java.lang.reflect.Method.invoke(Method.java:606) 
 at 
 org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
  
 at 
 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
 at 
 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
 at 
 io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
  
 at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
  
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
  
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
  
 at 
 io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
  
 at java.lang.Thread.run(Thread.java:745) 
 Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
 not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json 
 at 
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
  
 

[jira] [Commented] (SPARK-8410) Hive VersionsSuite RuntimeException

2015-06-26 Thread Josiah Samuel Sathiadass (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602993#comment-14602993
 ] 

Josiah Samuel Sathiadass commented on SPARK-8410:
-

Captured some logs from 2 servers ( server 1 - with the issue, server 2 - w/o 
the issue )

The below text is collected from file :: 
~/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml

Server 1: ( looks for the respective jars in the local-m2-cache)

 entries from a machine where the problem is 

module organisation=org.codehaus.groovy name=groovy-all
revision name=2.1.6 status=release 
pubdate=20150618023837 resolver=local-m2-cache artresolver=local-m2-cache 
homepage=http://groovy.codehaus.org/; downloaded=false searched=false 
default=false conf=compile, master(*), runtime, compile(*), runtime(*), 
master position=52
license name=The Apache Software License, 
Version 2.0 url=http://www.apache.org/licenses/LICENSE-2.0.txt/
metadata-artifact status=no details= 
size=5591 time=0 
location=/home/joe/.ivy2/cache/org.codehaus.groovy/groovy-all/ivy-2.1.6.xml 
searched=false 
original-local-location=/home/joe/.m2/repository/org/codehaus/groovy/groovy-all/2.1.6/groovy-all-2.1.6.pom
 origin-is-local=true 
origin-location=file:/home/joe/.m2/repository/org/codehaus/groovy/groovy-all/2.1.6/groovy-all-2.1.6.pom/
caller organisation=org.apache.hive 
name=hive-exec conf=default, compile, runtime, master rev=2.1.6 
rev-constraint-default=2.1.6 rev-constraint-dynamic=2.1.6 
callerrev=0.13.1/
artifacts
artifact name=groovy-all type=jar 
ext=jar status=failed details=missing artifact size=0 time=0/
/artifacts
/revision
/module

 entries from a machine where the problem is 




Server 2:  ( looks for the respective jars in the central)

 entries from a machine where it works 
module organisation=org.codehaus.groovy name=groovy-all
revision name=2.1.6 status=release 
pubdate=20130709121712 resolver=central artresolver=central 
homepage=http://groovy.codehaus.org/; downloaded=false searched=false 
default=false conf=compile, master(*), runtime, compile(*), runtime(*), 
master position=52
license name=The Apache Software License, 
Version 2.0 url=http://www.apache.org/licenses/LICENSE-2.0.txt/
metadata-artifact status=no details= 
size=5591 time=0 
location=/home/joe/.ivy2/cache/org.codehaus.groovy/groovy-all/ivy-2.1.6.xml 
searched=false 
original-local-location=/home/joe/.ivy2/cache/org.codehaus.groovy/groovy-all/ivy-2.1.6.xml.original
 origin-is-local=false 
origin-location=https://repo1.maven.org/maven2/org/codehaus/groovy/groovy-all/2.1.6/groovy-all-2.1.6.pom/
caller organisation=org.apache.hive 
name=hive-exec conf=default, compile, runtime, master rev=2.1.6 
rev-constraint-default=2.1.6 rev-constraint-dynamic=2.1.6 
callerrev=0.13.1/
artifacts
artifact name=groovy-all type=jar 
ext=jar status=no details= size=6377448 time=0 
location=/home/joe/.ivy2/cache/org.codehaus.groovy/groovy-all/jars/groovy-all-2.1.6.jar
origin-location 
is-local=false 
location=https://repo1.maven.org/maven2/org/codehaus/groovy/groovy-all/2.1.6/groovy-all-2.1.6.jar/
/artifact
/artifacts
/revision
/module
 entries from a machine where it works 



Need some help from the community to identify from where ivy is picking up the 
settings to populate this file 
~/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml so that I can 
narrow down the problem.

Thanks,
Joe.

 Hive VersionsSuite RuntimeException
 ---

 Key: SPARK-8410
 URL: https://issues.apache.org/jira/browse/SPARK-8410
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
 Environment: IBM Power system - P7
 running Ubuntu 14.04LE
Reporter: Josiah Samuel Sathiadass
Assignee: Burak Yavuz
Priority: Minor

 While testing Spark Project Hive, there are RuntimeExceptions as follows,
 VersionsSuite:
 - success sanity check *** FAILED ***
   java.lang.RuntimeException: [download failed: 
 org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: 
 org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar, download failed: 
 asm#asm;3.2!asm.jar]
   at 
 

[jira] [Comment Edited] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval

2015-06-26 Thread yuemeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602987#comment-14602987
 ] 

yuemeng edited comment on SPARK-8663 at 6/26/15 2:53 PM:
-

The driver log like:
15/06/25 23:16:16 INFO DAGScheduler: Executor lost: 1 (epoch 1)
15/06/25 23:16:16 INFO BlockManagerMasterActor: Trying to remove executor 1 
from BlockManagerMaster.
15/06/25 23:16:16 INFO BlockManagerMasterActor: Removing block manager 
BlockManagerId(1, 9.96.1.223, 23577)
15/06/25 23:16:16 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
15/06/25 23:16:45 ERROR ContextCleaner: Error cleaning broadcast 3512
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137)
at 
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at 
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)
at 
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:199)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:159)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:150)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:150)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1550)
at 
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:143)
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
15/06/25 23:16:45 INFO DAGScheduler: Stopping DAGScheduler
15/06/25 23:16:45 INFO YarnClientSchedulerBackend: Shutting down all executors
15/06/25 23:16:45 INFO YarnClientSchedulerBackend: Asking each executor to shut 
down
15/06/25 23:16:45 INFO DAGScheduler: Job 3555 failed: count at console:18, 
took 29.811052 s
15/06/25 23:16:45 INFO DAGScheduler: Job 3539 failed: count at console:18, 
took 30.089501 s
15/06/25 23:16:45 INFO DAGScheduler: Job 3553 failed: count at console:18, 
took 29.842839 s
15/06/25 23:16:45 WARN BlockManagerMaster: Failed to remove broadcast 3512 with 
removeFromMaster = true - Ask timed out on 
[Actor[akka.tcp://sparkExecutor@DS-222:23604/user/BlockManagerActor1#1981879442]]
 after [3 ms]}
calcFunc start
calcFunc start
15/06/25 23:16:45 INFO DAGScheduler: Job 3554 failed: count at console:18, 
took 29.827635 s
15/06/25 23:16:45 INFO SparkContext: Starting job: count at console:18
15/06/25 23:16:45 INFO SparkContext: Starting job: count at console:18
15/06/25 23:16:45 INFO YarnClientSchedulerBackend: Stopped
15/06/25 23:16:45 WARN Remoting: Tried to associate with unreachable remote 
address [akka.tcp://sparkYarnAM@DS-222:23129]. Address is now gated for 5000 
ms, all messages to this address will be delivered to dead letters. Reason: 
Connection refused: DS-222/9.96.1.222:23129
15/06/25 23:16:46 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
stopped!
15/06/25 23:16:46 INFO MemoryStore: MemoryStore cleared
15/06/25 23:16:46 INFO BlockManager: BlockManager stopped
15/06/25 23:16:46 INFO BlockManagerMaster: BlockManagerMaster stopped
15/06/25 23:16:46 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
15/06/25 23:16:46 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
15/06/25 23:16:46 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
15/06/25 23:16:46 INFO SparkContext: Successfully stopped SparkContext
And the driver Thread dump like:
ForkJoinPool-3-worker-3 daemon prio=10 tid=0x00991000 nid=0x3dab 
waiting on condition [0x7fc9507dd000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to 

[jira] [Assigned] (SPARK-8662) [SparkR] SparkSQL tests fail in R 3.2

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8662:
---

Assignee: Apache Spark

 [SparkR] SparkSQL tests fail in R 3.2
 -

 Key: SPARK-8662
 URL: https://issues.apache.org/jira/browse/SPARK-8662
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
Reporter: Chris Freeman
Assignee: Apache Spark
 Fix For: 1.4.0


 SparkR tests for equality using `all.equal` on environments fail in R 3.2.
 This is due to a change in how equality between environments is handled in 
 the new version of R.
 This should most likely not be a huge problem, we'll just have to rewrite 
 some of the tests to be more fine-grained instead of testing equality on 
 entire environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()

2015-06-26 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603030#comment-14603030
 ] 

Shivaram Venkataraman commented on SPARK-8409:
--

I don't see your email in the Spark user mailing list. I think one needs to 
subscribe to the list first to be able to post. 
You can send an email to user-subscr...@spark.apache.org to subscribe (See 
http://www.apache.org/foundation/mailinglists.html for more details).

  In windows cant able to read .csv or .json files using read.df()
 -

 Key: SPARK-8409
 URL: https://issues.apache.org/jira/browse/SPARK-8409
 Project: Spark
  Issue Type: Bug
  Components: SparkR, Windows
Affects Versions: 1.4.0
 Environment: sparkR API
Reporter: Arun
Priority: Critical

 Hi, 
 In SparkR shell, I invoke: 
  mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json, 
  header=false) 
 I have tried various filetypes (csv, txt), all fail.   
  in sparkR of spark 1.4 for eg.) df_1- read.df(sqlContext, 
 E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv,
  source = csv)
 RESPONSE: ERROR RBackendHandler: load on 1 failed 
 BELOW THE WHOLE RESPONSE: 
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
 curMem=0, maxMem=278302556 
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 173.4 KB, free 265.2 MB) 
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
 curMem=177600, maxMem=278302556 
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 16.2 KB, free 265.2 MB) 
 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
 NativeMethodAccessorImpl.java:-2 
 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
 feature cannot be used because libhadoop cannot be loaded. 
 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
 java.lang.reflect.InvocationTargetException 
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  
 at java.lang.reflect.Method.invoke(Method.java:606) 
 at 
 org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
  
 at 
 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
 at 
 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
 at 
 io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
  
 at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
  
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
  
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
  
 at 
 io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
  
 at java.lang.Thread.run(Thread.java:745) 
 Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
 not 

[jira] [Resolved] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()

2015-06-26 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8409.
--
Resolution: Not A Problem

  In windows cant able to read .csv or .json files using read.df()
 -

 Key: SPARK-8409
 URL: https://issues.apache.org/jira/browse/SPARK-8409
 Project: Spark
  Issue Type: Bug
  Components: SparkR, Windows
Affects Versions: 1.4.0
 Environment: sparkR API
Reporter: Arun
Priority: Critical

 Hi, 
 In SparkR shell, I invoke: 
  mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json, 
  header=false) 
 I have tried various filetypes (csv, txt), all fail.   
  in sparkR of spark 1.4 for eg.) df_1- read.df(sqlContext, 
 E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv,
  source = csv)
 RESPONSE: ERROR RBackendHandler: load on 1 failed 
 BELOW THE WHOLE RESPONSE: 
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
 curMem=0, maxMem=278302556 
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 173.4 KB, free 265.2 MB) 
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
 curMem=177600, maxMem=278302556 
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 16.2 KB, free 265.2 MB) 
 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
 NativeMethodAccessorImpl.java:-2 
 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
 feature cannot be used because libhadoop cannot be loaded. 
 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
 java.lang.reflect.InvocationTargetException 
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  
 at java.lang.reflect.Method.invoke(Method.java:606) 
 at 
 org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
  
 at 
 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
 at 
 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
 at 
 io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
  
 at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
  
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
  
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
  
 at 
 io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
  
 at java.lang.Thread.run(Thread.java:745) 
 Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
 not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json 
 at 
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
  
 at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) 
 at 
 

[jira] [Commented] (SPARK-8521) Feature Transformers in 1.5

2015-06-26 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603036#comment-14603036
 ] 

Yanbo Liang commented on SPARK-8521:


Yes, I agree. I will open a jira and work on it.

 Feature Transformers in 1.5
 ---

 Key: SPARK-8521
 URL: https://issues.apache.org/jira/browse/SPARK-8521
 Project: Spark
  Issue Type: Umbrella
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 This is a list of feature transformers we plan to add in Spark 1.5. Feel free 
 to propose useful transformers that are not on the list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8665) Update ALS documentation to include performance tips

2015-06-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8665:
-
Remaining Estimate: 1h
 Original Estimate: 1h

 Update ALS documentation to include performance tips
 

 Key: SPARK-8665
 URL: https://issues.apache.org/jira/browse/SPARK-8665
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
   Original Estimate: 1h
  Remaining Estimate: 1h

 With the new ALS implementation, users still need to deal with 
 computation/communication trade-offs. It would be nice to document this 
 clearly based on the issues on the mailing list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8665) Update ALS documentation to include performance tips

2015-06-26 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8665:


 Summary: Update ALS documentation to include performance tips
 Key: SPARK-8665
 URL: https://issues.apache.org/jira/browse/SPARK-8665
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


With the new ALS implementation, users still need to deal with 
computation/communication trade-offs. It would be nice to document this clearly 
based on the issues on the mailing list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8659) SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server.

2015-06-26 Thread Premchandra Preetham Kukillaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Premchandra Preetham Kukillaya updated SPARK-8659:
--
Description: 
It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
Hive's security  feature SQL based authorisation is not working. It ignores the 
security settings passed through the command line. The arguments for command 
line is given below for reference

The problem is user X can do select on table belonging to user Y, though 
permission for table is explicitly defined

I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark 


./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
hostname.compute.amazonaws.com --hiveconf 
hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
 --hiveconf 
hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
 --hiveconf hive.server2.enable.doAs=false --hiveconf 
hive.security.authorization.enabled=true --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
 --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
--hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
javax.jdo.option.ConnectionPassword=hive

  was:
It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
Hive's security  feature SQL based authorisation is not working. It ignores the 
security settings passed through the command line. The arguments for command 
line is given below for reference

The problem is user X can do select on table belonging to user Y, though 
permission for table is explicitly defined

I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark 


./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
hostname.compute.amazonaws.com --hiveconf 
hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
 --hiveconf 
hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
 --hiveconf hive.server2.enable.doAs=false --hiveconf 
hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 
--hiveconf mapred.max.split.size=25600 --hiveconf 
hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
 --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
--hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
javax.jdo.option.ConnectionPassword=hive --hiveconf 
hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf 
hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true


 SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing 
 JDBC Application to Spark Thrift Server. 
 ---

 Key: SPARK-8659
 URL: https://issues.apache.org/jira/browse/SPARK-8659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
 Environment: Linux
Reporter: Premchandra Preetham Kukillaya

 It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
 Hive's security  feature SQL based authorisation is not working. It ignores 
 the security settings passed through the command line. The arguments for 
 command line is given below for reference
 The problem is user X can do select on table belonging to user Y, though 
 permission for table is explicitly defined
 I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to 
 Spark 
 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
 hostname.compute.amazonaws.com --hiveconf 
 hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
  --hiveconf 
 hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
  --hiveconf hive.server2.enable.doAs=false --hiveconf 
 hive.security.authorization.enabled=true --hiveconf 
 javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
  --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
 --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
 javax.jdo.option.ConnectionPassword=hive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Created] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval

2015-06-26 Thread yuemeng (JIRA)
yuemeng created SPARK-8663:
--

 Summary: Dirver will be hang if there is a job submit during 
SparkContex stop Interval
 Key: SPARK-8663
 URL: https://issues.apache.org/jira/browse/SPARK-8663
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.1.1, 1.0.0
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64
Reporter: yuemeng
 Fix For: 1.2.2, 1.1.1, 1.0.0


Driver process will be hang if a job had submit during sc.stop Interval.This 
interval mean from start stop SparkContext to finish .
The probability of this situation is very small,but If present, will cause 
driver process never exit.
Reproduce step:
1)modify source code to make SparkContext stop() method sleep 2s
in my situation,i make DAGScheduler stop method sleep 2s
2)submit an application ,code like:
object DriverThreadTest {

  def main(args: Array[String]) {

val sconf = new SparkConf().setAppName(TestJobWaitor)
val sc= new SparkContext(sconf)

Thread.sleep(5000)
val t = new Thread {
  override def run() {
while (true) {
  try {
val rdd = sc.parallelize( 1 to 1000)
var i = 0
println(calcfunc start)
while ( i  10){
  i+=1
  rdd.count
}
println(calcfunc end)
  }catch{
case e: Exception =
  e.printStackTrace()
  }
}
  }
}

t.start()

val t2 = new Thread {
  override def run() {
Thread.sleep(2000)
println(stop sc thread)
sc.stop()
println(sc already stoped)
  }
}
t2.start()
  }

}
driver will be never exit





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval

2015-06-26 Thread yuemeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuemeng updated SPARK-8663:
---
Affects Version/s: (was: 1.2.0)
   (was: 1.1.1)
   (was: 1.0.0)
   1.2.2
   1.3.0

 Dirver will be hang if there is a job submit during SparkContex stop Interval
 -

 Key: SPARK-8663
 URL: https://issues.apache.org/jira/browse/SPARK-8663
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2, 1.3.0
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
Reporter: yuemeng
 Fix For: 1.0.0, 1.1.1, 1.2.2


 Driver process will be hang if a job had submit during sc.stop Interval.This 
 interval mean from start stop SparkContext to finish .
 The probability of this situation is very small,but If present, will cause 
 driver process never exit.
 Reproduce step:
 1)modify source code to make SparkContext stop() method sleep 2s
 in my situation,i make DAGScheduler stop method sleep 2s
 2)submit an application ,code like:
 object DriverThreadTest {
   def main(args: Array[String]) {
 val sconf = new SparkConf().setAppName(TestJobWaitor)
 val sc= new SparkContext(sconf)
 Thread.sleep(5000)
 val t = new Thread {
   override def run() {
 while (true) {
   try {
 val rdd = sc.parallelize( 1 to 1000)
 var i = 0
 println(calcfunc start)
 while ( i  10){
   i+=1
   rdd.count
 }
 println(calcfunc end)
   }catch{
 case e: Exception =
   e.printStackTrace()
   }
 }
   }
 }
 
 t.start()
 
 val t2 = new Thread {
   override def run() {
 Thread.sleep(2000)
 println(stop sc thread)
 sc.stop()
 println(sc already stoped)
   }
 }
 t2.start()
   }
 }
 driver will be never exit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3650) Triangle Count handles reverse edges incorrectly

2015-06-26 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603016#comment-14603016
 ] 

Robin East commented on SPARK-3650:
---

What is the status of this issue? A user on the mailing list just ran into to 
this issue. It looks like PR-2495 should fix the issue. Is there a version that 
is being targeted for the fix?

 Triangle Count handles reverse edges incorrectly
 

 Key: SPARK-3650
 URL: https://issues.apache.org/jira/browse/SPARK-3650
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.1.0, 1.2.0
Reporter: Joseph E. Gonzalez
Priority: Critical

 The triangle count implementation assumes that edges are aligned in a 
 canonical direction.  As stated in the documentation:
 bq. Note that the input graph should have its edges in canonical direction 
 (i.e. the `sourceId` less than `destId`)
 However the TriangleCount algorithm does not verify that this condition holds 
 and indeed even the unit tests exploits this functionality:
 {code:scala}
 val triangles = Array(0L - 1L, 1L - 2L, 2L - 0L) ++
 Array(0L - -1L, -1L - -2L, -2L - 0L)
   val rawEdges = sc.parallelize(triangles, 2)
   val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
   val triangleCount = graph.triangleCount()
   val verts = triangleCount.vertices
   verts.collect().foreach { case (vid, count) =
 if (vid == 0) {
   assert(count === 4)  // -- Should be 2
 } else {
   assert(count === 2) // -- Should be 1
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8664) Add PCA transformer

2015-06-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8664:
-
Assignee: Yanbo Liang

 Add PCA transformer
 ---

 Key: SPARK-8664
 URL: https://issues.apache.org/jira/browse/SPARK-8664
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.5.0
Reporter: Yanbo Liang
Assignee: Yanbo Liang

 Add PCA transformer for ML pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval

2015-06-26 Thread yuemeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuemeng updated SPARK-8663:
---
Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)  (was: SUSE 
Linux Enterprise Server 11 SP3  (x86_64)

 Dirver will be hang if there is a job submit during SparkContex stop Interval
 -

 Key: SPARK-8663
 URL: https://issues.apache.org/jira/browse/SPARK-8663
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.1, 1.2.0
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
Reporter: yuemeng
 Fix For: 1.0.0, 1.1.1, 1.2.2


 Driver process will be hang if a job had submit during sc.stop Interval.This 
 interval mean from start stop SparkContext to finish .
 The probability of this situation is very small,but If present, will cause 
 driver process never exit.
 Reproduce step:
 1)modify source code to make SparkContext stop() method sleep 2s
 in my situation,i make DAGScheduler stop method sleep 2s
 2)submit an application ,code like:
 object DriverThreadTest {
   def main(args: Array[String]) {
 val sconf = new SparkConf().setAppName(TestJobWaitor)
 val sc= new SparkContext(sconf)
 Thread.sleep(5000)
 val t = new Thread {
   override def run() {
 while (true) {
   try {
 val rdd = sc.parallelize( 1 to 1000)
 var i = 0
 println(calcfunc start)
 while ( i  10){
   i+=1
   rdd.count
 }
 println(calcfunc end)
   }catch{
 case e: Exception =
   e.printStackTrace()
   }
 }
   }
 }
 
 t.start()
 
 val t2 = new Thread {
   override def run() {
 Thread.sleep(2000)
 println(stop sc thread)
 sc.stop()
 println(sc already stoped)
   }
 }
 t2.start()
   }
 }
 driver will be never exit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8662) [SparkR] SparkSQL tests fail in R 3.2

2015-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602992#comment-14602992
 ] 

Apache Spark commented on SPARK-8662:
-

User 'cafreeman' has created a pull request for this issue:
https://github.com/apache/spark/pull/7045

 [SparkR] SparkSQL tests fail in R 3.2
 -

 Key: SPARK-8662
 URL: https://issues.apache.org/jira/browse/SPARK-8662
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
Reporter: Chris Freeman
 Fix For: 1.4.0


 SparkR tests for equality using `all.equal` on environments fail in R 3.2.
 This is due to a change in how equality between environments is handled in 
 the new version of R.
 This should most likely not be a huge problem, we'll just have to rewrite 
 some of the tests to be more fine-grained instead of testing equality on 
 entire environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8662) [SparkR] SparkSQL tests fail in R 3.2

2015-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8662:
---

Assignee: (was: Apache Spark)

 [SparkR] SparkSQL tests fail in R 3.2
 -

 Key: SPARK-8662
 URL: https://issues.apache.org/jira/browse/SPARK-8662
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
Reporter: Chris Freeman
 Fix For: 1.4.0


 SparkR tests for equality using `all.equal` on environments fail in R 3.2.
 This is due to a change in how equality between environments is handled in 
 the new version of R.
 This should most likely not be a huge problem, we'll just have to rewrite 
 some of the tests to be more fine-grained instead of testing equality on 
 entire environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval

2015-06-26 Thread yuemeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603001#comment-14603001
 ] 

yuemeng commented on SPARK-8663:


I think the reason becasue:
1)eventProcessActor ! JobSubmitted( 
  jobId, rdd, func2, partitions.toArray, allowLocal, callSite, waiter, 
properties)
    waiter
  }  //eventProcessActor  had dead, and this meassage sent to deadmailbox.so it 
will be lost waiter,
2)
def awaitResult(): JobResult = synchronized {
while (!_jobFinished) {
  this.wait()
}
return jobResult
  } //this will enter loop stituation


 Dirver will be hang if there is a job submit during SparkContex stop Interval
 -

 Key: SPARK-8663
 URL: https://issues.apache.org/jira/browse/SPARK-8663
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.1, 1.2.0
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
Reporter: yuemeng
 Fix For: 1.0.0, 1.1.1, 1.2.2


 Driver process will be hang if a job had submit during sc.stop Interval.This 
 interval mean from start stop SparkContext to finish .
 The probability of this situation is very small,but If present, will cause 
 driver process never exit.
 Reproduce step:
 1)modify source code to make SparkContext stop() method sleep 2s
 in my situation,i make DAGScheduler stop method sleep 2s
 2)submit an application ,code like:
 object DriverThreadTest {
   def main(args: Array[String]) {
 val sconf = new SparkConf().setAppName(TestJobWaitor)
 val sc= new SparkContext(sconf)
 Thread.sleep(5000)
 val t = new Thread {
   override def run() {
 while (true) {
   try {
 val rdd = sc.parallelize( 1 to 1000)
 var i = 0
 println(calcfunc start)
 while ( i  10){
   i+=1
   rdd.count
 }
 println(calcfunc end)
   }catch{
 case e: Exception =
   e.printStackTrace()
   }
 }
   }
 }
 
 t.start()
 
 val t2 = new Thread {
   override def run() {
 Thread.sleep(2000)
 println(stop sc thread)
 sc.stop()
 println(sc already stoped)
   }
 }
 t2.start()
   }
 }
 driver will be never exit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval

2015-06-26 Thread yuemeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuemeng updated SPARK-8663:
---
Fix Version/s: (was: 1.1.1)
   (was: 1.0.0)
   1.3.0

 Dirver will be hang if there is a job submit during SparkContex stop Interval
 -

 Key: SPARK-8663
 URL: https://issues.apache.org/jira/browse/SPARK-8663
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2, 1.3.0
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
Reporter: yuemeng
 Fix For: 1.2.2, 1.3.0


 Driver process will be hang if a job had submit during sc.stop Interval.This 
 interval mean from start stop SparkContext to finish .
 The probability of this situation is very small,but If present, will cause 
 driver process never exit.
 Reproduce step:
 1)modify source code to make SparkContext stop() method sleep 2s
 in my situation,i make DAGScheduler stop method sleep 2s
 2)submit an application ,code like:
 object DriverThreadTest {
   def main(args: Array[String]) {
 val sconf = new SparkConf().setAppName(TestJobWaitor)
 val sc= new SparkContext(sconf)
 Thread.sleep(5000)
 val t = new Thread {
   override def run() {
 while (true) {
   try {
 val rdd = sc.parallelize( 1 to 1000)
 var i = 0
 println(calcfunc start)
 while ( i  10){
   i+=1
   rdd.count
 }
 println(calcfunc end)
   }catch{
 case e: Exception =
   e.printStackTrace()
   }
 }
   }
 }
 
 t.start()
 
 val t2 = new Thread {
   override def run() {
 Thread.sleep(2000)
 println(stop sc thread)
 sc.stop()
 println(sc already stoped)
   }
 }
 t2.start()
   }
 }
 driver will be never exit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8664) Add PCA transformer

2015-06-26 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-8664:
--

 Summary: Add PCA transformer
 Key: SPARK-8664
 URL: https://issues.apache.org/jira/browse/SPARK-8664
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.5.0
Reporter: Yanbo Liang


Add PCA transformer for ML pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8659) SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server.

2015-06-26 Thread Premchandra Preetham Kukillaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Premchandra Preetham Kukillaya updated SPARK-8659:
--
Description: 
It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
Hive's security  feature SQL based authorisation is not working. It ignores the 
security settings passed through the command line. The arguments for command 
line is given below for reference

The problem is user X can do select on table belonging to user Y, though 
permission for table is explicitly defined

I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark 


./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
hostname.compute.amazonaws.com --hiveconf 
hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
 --hiveconf 
hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
 --hiveconf hive.server2.enable.doAs=false --hiveconf 
hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 
--hiveconf mapred.max.split.size=25600 --hiveconf 
hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
 --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
--hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
javax.jdo.option.ConnectionPassword=hive --hiveconf 
hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf 
hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true

  was:
It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
Hive's security  feature SQL based authorisation is not working whereas SQL 
based Authorisation works when i am pointing the JDBC Driver to 
ThriftCLIService provided by HiveServer2. But we need to use Spark SQL Thrift 
Service as we require  to use Spark SQL with Tableau

The problem is user X can do select on table belonging to user Y, though 
permission for table is explicitly defined

I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark 


./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
hostname.compute.amazonaws.com --hiveconf 
hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
 --hiveconf 
hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
 --hiveconf hive.server2.enable.doAs=false --hiveconf 
hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 
--hiveconf mapred.max.split.size=25600 --hiveconf 
hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
 --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
--hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
javax.jdo.option.ConnectionPassword=hive --hiveconf 
hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf 
hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true


 SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing 
 JDBC Application to Spark Thrift Server. 
 ---

 Key: SPARK-8659
 URL: https://issues.apache.org/jira/browse/SPARK-8659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
 Environment: Linux
Reporter: Premchandra Preetham Kukillaya

 It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
 Hive's security  feature SQL based authorisation is not working. It ignores 
 the security settings passed through the command line. The arguments for 
 command line is given below for reference
 The problem is user X can do select on table belonging to user Y, though 
 permission for table is explicitly defined
 I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to 
 Spark 
 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
 hostname.compute.amazonaws.com --hiveconf 
 hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
  --hiveconf 
 hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
  --hiveconf hive.server2.enable.doAs=false --hiveconf 
 hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 
 --hiveconf mapred.max.split.size=25600 --hiveconf 
 

[jira] [Updated] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval

2015-06-26 Thread yuemeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuemeng updated SPARK-8663:
---
Target Version/s:   (was: 1.0.0, 1.1.1, 1.2.2)

 Dirver will be hang if there is a job submit during SparkContex stop Interval
 -

 Key: SPARK-8663
 URL: https://issues.apache.org/jira/browse/SPARK-8663
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2, 1.3.0
 Environment: SUSE Linux Enterprise Server 11 SP3  (x86_64)
Reporter: yuemeng
 Fix For: 1.2.2, 1.3.0


 Driver process will be hang if a job had submit during sc.stop Interval.This 
 interval mean from start stop SparkContext to finish .
 The probability of this situation is very small,but If present, will cause 
 driver process never exit.
 Reproduce step:
 1)modify source code to make SparkContext stop() method sleep 2s
 in my situation,i make DAGScheduler stop method sleep 2s
 2)submit an application ,code like:
 object DriverThreadTest {
   def main(args: Array[String]) {
 val sconf = new SparkConf().setAppName(TestJobWaitor)
 val sc= new SparkContext(sconf)
 Thread.sleep(5000)
 val t = new Thread {
   override def run() {
 while (true) {
   try {
 val rdd = sc.parallelize( 1 to 1000)
 var i = 0
 println(calcfunc start)
 while ( i  10){
   i+=1
   rdd.count
 }
 println(calcfunc end)
   }catch{
 case e: Exception =
   e.printStackTrace()
   }
 }
   }
 }
 
 t.start()
 
 val t2 = new Thread {
   override def run() {
 Thread.sleep(2000)
 println(stop sc thread)
 sc.stop()
 println(sc already stoped)
   }
 }
 t2.start()
   }
 }
 driver will be never exit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8664) Add PCA transformer

2015-06-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8664:
-
Remaining Estimate: 24h
 Original Estimate: 24h

 Add PCA transformer
 ---

 Key: SPARK-8664
 URL: https://issues.apache.org/jira/browse/SPARK-8664
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.5.0
Reporter: Yanbo Liang
Assignee: Yanbo Liang
   Original Estimate: 24h
  Remaining Estimate: 24h

 Add PCA transformer for ML pipeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8659) SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server.

2015-06-26 Thread Premchandra Preetham Kukillaya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Premchandra Preetham Kukillaya updated SPARK-8659:
--
Description: 
It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
Hive's security  feature SQL based authorisation is not working whereas SQL 
based Authorisation works when i am pointing the JDBC Driver to 
ThriftCLIService provided by HiveServer2. But we need to use Spark SQL Thrift 
Service as we require  to use Spark SQL with Tableau

The problem is user X can do select on table belonging to user Y, though 
permission for table is explicitly defined

I am using Hive .13.1 and Spark 1.3.1


./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
hostname.compute.amazonaws.com --hiveconf 
hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
 --hiveconf 
hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
 --hiveconf hive.server2.enable.doAs=false --hiveconf 
hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 
--hiveconf mapred.max.split.size=25600 --hiveconf 
hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
 --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
--hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
javax.jdo.option.ConnectionPassword=hive --hiveconf 
hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf 
hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true

  was:
It seems like while pointing JDBC/ODBC Driver to Spark SQL Thrift Service 
Hive's feature SQL based authorization is not working whereas SQL based 
Authorization works when i am pointing the JDBC Driver to ThriftCLIService 
provided by HiveServer2.

The problem is user X can do select on table belonging to user Y.

I am using Hive .13.1 and Spark 1.3.1


./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
hostname.compute.amazonaws.com --hiveconf 
hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
 --hiveconf 
hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
 --hiveconf hive.server2.enable.doAs=false --hiveconf 
hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 
--hiveconf mapred.max.split.size=25600 --hiveconf 
hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
 --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver 
--hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf 
javax.jdo.option.ConnectionPassword=hive --hiveconf 
hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf 
hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true


 SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing 
 JDBC Application to Spark Thrift Server. 
 ---

 Key: SPARK-8659
 URL: https://issues.apache.org/jira/browse/SPARK-8659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
 Environment: Linux
Reporter: Premchandra Preetham Kukillaya

 It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the 
 Hive's security  feature SQL based authorisation is not working whereas SQL 
 based Authorisation works when i am pointing the JDBC Driver to 
 ThriftCLIService provided by HiveServer2. But we need to use Spark SQL Thrift 
 Service as we require  to use Spark SQL with Tableau
 The problem is user X can do select on table belonging to user Y, though 
 permission for table is explicitly defined
 I am using Hive .13.1 and Spark 1.3.1
 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf 
 hostname.compute.amazonaws.com --hiveconf 
 hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
  --hiveconf 
 hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
  --hiveconf hive.server2.enable.doAs=false --hiveconf 
 hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 
 --hiveconf mapred.max.split.size=25600 --hiveconf 
 hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources 
 --hiveconf 
 

[jira] [Commented] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval

2015-06-26 Thread yuemeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602987#comment-14602987
 ] 

yuemeng commented on SPARK-8663:


the driver log like:
15/06/25 23:16:16 INFO DAGScheduler: Executor lost: 1 (epoch 1)
15/06/25 23:16:16 INFO BlockManagerMasterActor: Trying to remove executor 1 
from BlockManagerMaster.
15/06/25 23:16:16 INFO BlockManagerMasterActor: Removing block manager 
BlockManagerId(1, 9.96.1.223, 23577)
15/06/25 23:16:16 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
15/06/25 23:16:45 ERROR ContextCleaner: Error cleaning broadcast 3512
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137)
at 
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at 
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)
at 
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:199)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:159)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:150)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:150)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1550)
at 
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:143)
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
15/06/25 23:16:45 INFO DAGScheduler: Stopping DAGScheduler
15/06/25 23:16:45 INFO YarnClientSchedulerBackend: Shutting down all executors
15/06/25 23:16:45 INFO YarnClientSchedulerBackend: Asking each executor to shut 
down
15/06/25 23:16:45 INFO DAGScheduler: Job 3555 failed: count at console:18, 
took 29.811052 s
15/06/25 23:16:45 INFO DAGScheduler: Job 3539 failed: count at console:18, 
took 30.089501 s
15/06/25 23:16:45 INFO DAGScheduler: Job 3553 failed: count at console:18, 
took 29.842839 s
15/06/25 23:16:45 WARN BlockManagerMaster: Failed to remove broadcast 3512 with 
removeFromMaster = true - Ask timed out on 
[Actor[akka.tcp://sparkExecutor@DS-222:23604/user/BlockManagerActor1#1981879442]]
 after [3 ms]}
calcFunc start
calcFunc start
15/06/25 23:16:45 INFO DAGScheduler: Job 3554 failed: count at console:18, 
took 29.827635 s
15/06/25 23:16:45 INFO SparkContext: Starting job: count at console:18
15/06/25 23:16:45 INFO SparkContext: Starting job: count at console:18
15/06/25 23:16:45 INFO YarnClientSchedulerBackend: Stopped
15/06/25 23:16:45 WARN Remoting: Tried to associate with unreachable remote 
address [akka.tcp://sparkYarnAM@DS-222:23129]. Address is now gated for 5000 
ms, all messages to this address will be delivered to dead letters. Reason: 
Connection refused: DS-222/9.96.1.222:23129
15/06/25 23:16:46 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
stopped!
15/06/25 23:16:46 INFO MemoryStore: MemoryStore cleared
15/06/25 23:16:46 INFO BlockManager: BlockManager stopped
15/06/25 23:16:46 INFO BlockManagerMaster: BlockManagerMaster stopped
15/06/25 23:16:46 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
15/06/25 23:16:46 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
15/06/25 23:16:46 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
15/06/25 23:16:46 INFO SparkContext: Successfully stopped SparkContext
and the driver Thread dump like:
ForkJoinPool-3-worker-3 daemon prio=10 tid=0x00991000 nid=0x3dab 
waiting on condition [0x7fc9507dd000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0xfe9ea670 (a 

[jira] [Commented] (SPARK-8662) [SparkR] SparkSQL tests fail in R 3.2

2015-06-26 Thread Chris Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602989#comment-14602989
 ] 

Chris Freeman commented on SPARK-8662:
--

PR here: https://github.com/apache/spark/pull/7045

 [SparkR] SparkSQL tests fail in R 3.2
 -

 Key: SPARK-8662
 URL: https://issues.apache.org/jira/browse/SPARK-8662
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
Reporter: Chris Freeman
 Fix For: 1.4.0


 SparkR tests for equality using `all.equal` on environments fail in R 3.2.
 This is due to a change in how equality between environments is handled in 
 the new version of R.
 This should most likely not be a huge problem, we'll just have to rewrite 
 some of the tests to be more fine-grained instead of testing equality on 
 entire environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()

2015-06-26 Thread Arun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603055#comment-14603055
 ] 

Arun commented on SPARK-8409:
-

http://apache-spark-user-list.1001560.n3.nabble.com/Convert-R-code-into-SparkR-code-for-spark-1-4-version-td23489.html

Another link I posted

  In windows cant able to read .csv or .json files using read.df()
 -

 Key: SPARK-8409
 URL: https://issues.apache.org/jira/browse/SPARK-8409
 Project: Spark
  Issue Type: Bug
  Components: SparkR, Windows
Affects Versions: 1.4.0
 Environment: sparkR API
Reporter: Arun
Priority: Critical

 Hi, 
 In SparkR shell, I invoke: 
  mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json, 
  header=false) 
 I have tried various filetypes (csv, txt), all fail.   
  in sparkR of spark 1.4 for eg.) df_1- read.df(sqlContext, 
 E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv,
  source = csv)
 RESPONSE: ERROR RBackendHandler: load on 1 failed 
 BELOW THE WHOLE RESPONSE: 
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with 
 curMem=0, maxMem=278302556 
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 173.4 KB, free 265.2 MB) 
 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with 
 curMem=177600, maxMem=278302556 
 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 16.2 KB, free 265.2 MB) 
 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 
 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at 
 NativeMethodAccessorImpl.java:-2 
 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads 
 feature cannot be used because libhadoop cannot be loaded. 
 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed 
 java.lang.reflect.InvocationTargetException 
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  
 at java.lang.reflect.Method.invoke(Method.java:606) 
 at 
 org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127)
  
 at 
 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) 
 at 
 org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) 
 at 
 io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
  
 at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
  
 at 
 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
  
 at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
  
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) 
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
  
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) 
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) 
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
  
 at 
 io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
  
 at java.lang.Thread.run(Thread.java:745) 
 Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does 
 not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json 
 at 
 

[jira] [Resolved] (SPARK-4609) Job can not finish if there is one bad slave in clusters

2015-06-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4609.
--
Resolution: Duplicate

 Job can not finish if there is one bad slave in clusters
 

 Key: SPARK-4609
 URL: https://issues.apache.org/jira/browse/SPARK-4609
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu

 If there is one bad machine in the cluster, the executor will keep die (such 
 as out of space in the disk), some task may be scheduled to this machines 
 multiple times, then the job will failed after several failures of one task.
 {code}
 14/11/26 00:34:57 INFO TaskSetManager: Starting task 39.0 in stage 3.0 (TID 
 1255, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
 14/11/26 00:34:57 WARN TaskSetManager: Lost task 39.0 in stage 3.0 (TID 1255, 
 spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 60 
 lost)
 14/11/26 00:35:02 INFO TaskSetManager: Starting task 39.1 in stage 3.0 (TID 
 1256, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
 14/11/26 00:35:03 WARN TaskSetManager: Lost task 39.1 in stage 3.0 (TID 1256, 
 spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 61 
 lost)
 14/11/26 00:35:08 INFO TaskSetManager: Starting task 39.2 in stage 3.0 (TID 
 1257, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
 14/11/26 00:35:08 WARN TaskSetManager: Lost task 39.2 in stage 3.0 (TID 1257, 
 spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 62 
 lost)
 14/11/26 00:35:13 INFO TaskSetManager: Starting task 39.3 in stage 3.0 (TID 
 1258, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
 14/11/26 00:35:14 WARN TaskSetManager: Lost task 39.3 in stage 3.0 (TID 1258, 
 spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 63 
 lost)
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 39 in 
 stage 3.0 failed 4 times, most recent failure: Lost task 39.3 in stage 3.0 
 (TID 1258, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure 
 (executor 63 lost)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1207)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1196)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1195)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1195)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1413)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1368)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 The task should not be scheduled to a machines for more than one times. Also, 
 if one machine failed with executor lost, it should be put in black list for 
 some time, then try again.
 cc [~kayousterhout] [~matei]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8424) Add blacklist mechanism for task scheduler and Yarn container allocation

2015-06-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603995#comment-14603995
 ] 

Sean Owen commented on SPARK-8424:
--

[~jerryshao] I don't think this umbrella JIRA is useful as its description is 
exactly the union of its two children. Let's close it and leave the children.

 Add blacklist mechanism for task scheduler and Yarn container allocation
 

 Key: SPARK-8424
 URL: https://issues.apache.org/jira/browse/SPARK-8424
 Project: Spark
  Issue Type: New Feature
  Components: Scheduler, YARN
Affects Versions: 1.4.0
Reporter: Saisai Shao

 Previously MapReduce has  a blacklist and graylist to exclude some constantly 
 failed TaskTrackers/nodes, it is important for a large cluster to alleviate 
 the problem of  increasing chance of hardware and software failure. 
 Unfortunately current version of Spark lacks such mechanism to blacklist some 
 constantly failed executors/nodes. The only blacklist mechanism in Spark is 
 to avoid relaunching the task on the same executor when this task is 
 previously failed on this executor within specified time. So here propose a 
 new feature to add blacklist mechanism for Spark, this proposal is divided 
 into two sub-tasks:
 1. Add a heuristic blacklist algorithm to track the status of executors by 
 the status of finished tasks, and enable blacklist mechanism in tasking 
 scheduling.
 2. Enable blacklist mechanism in YARN container allocation (avoid allocating 
 containers on the blacklist hosts).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8639) Instructions for executing jekyll in docs/README.md could be slightly more clear, typo in docs/api.md

2015-06-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8639.
--
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.1

Issue resolved by pull request 7046
[https://github.com/apache/spark/pull/7046]

 Instructions for executing jekyll in docs/README.md could be slightly more 
 clear, typo in docs/api.md
 -

 Key: SPARK-8639
 URL: https://issues.apache.org/jira/browse/SPARK-8639
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Rosstin Murphy
Priority: Trivial
 Fix For: 1.4.1, 1.5.0


 In docs/README.md, the text states around line 31
 Execute 'jekyll' from the 'docs/' directory. Compiling the site with Jekyll 
 will create a directory called '_site' containing index.html as well as the 
 rest of the compiled files.
 It might be more clear if we said
 Execute 'jekyll build' from the 'docs/' directory to compile the site. 
 Compiling the site with Jekyll will create a directory called '_site' 
 containing index.html as well as the rest of the compiled files.
 In docs/api.md: Here you can API docs for Spark and its submodules.
 should be something like: Here you can read API docs for Spark and its 
 submodules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1476) 2GB limit in spark for blocks

2015-06-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603220#comment-14603220
 ] 

Thomas Graves commented on SPARK-1476:
--

we have a lot of jira about the 2G limit.  I'm going to dup this to the 
umbrella jira https://issues.apache.org/jira/browse/SPARK-6235

If someone things something is missing from that, lets add another item there.

 2GB limit in spark for blocks
 -

 Key: SPARK-1476
 URL: https://issues.apache.org/jira/browse/SPARK-1476
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
 Environment: all
Reporter: Mridul Muralidharan
Priority: Critical
 Attachments: 2g_fix_proposal.pdf


 The underlying abstraction for blocks in spark is a ByteBuffer : which limits 
 the size of the block to 2GB.
 This has implication not just for managed blocks in use, but also for shuffle 
 blocks (memory mapped blocks are limited to 2gig, even though the api allows 
 for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
 This is a severe limitation for use of spark when used on non trivial 
 datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8405) Show executor logs on Web UI when Yarn log aggregation is enabled

2015-06-26 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603269#comment-14603269
 ] 

Hari Shreedharan commented on SPARK-8405:
-

Actually I think this is a config issue on your YARN cluster. See: 
https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/HBGzj_NG9_s and 
http://stackoverflow.com/questions/24076192/yarn-jobhistory-error-failed-redirect-for-container-140026075-3309-01-0

 Show executor logs on Web UI when Yarn log aggregation is enabled
 -

 Key: SPARK-8405
 URL: https://issues.apache.org/jira/browse/SPARK-8405
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Carson Wang
 Attachments: SparkLogError.png


 When running Spark application in Yarn mode and Yarn log aggregation is 
 enabled, customer is not able to view executor logs on the history server Web 
 UI. The only way for customer to view the logs is through the Yarn command 
 yarn logs -applicationId appId.
 An screenshot of the error is attached. When you click an executor’s log link 
 on the Spark history server, you’ll see the error if Yarn log aggregation is 
 enabled. The log URL redirects user to the node manager’s UI. This works if 
 the logs are located on that node. But since log aggregation is enabled, the 
 local logs are deleted once log aggregation is completed. 
 The logs should be available through the web UIs just like other Hadoop 
 components like MapReduce. For security reasons, end users may not be able to 
 log into the nodes and run the yarn logs -applicationId command. The web UIs 
 can be viewable and exposed through the firewall if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN

2015-06-26 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603263#comment-14603263
 ] 

Juliet Hougland edited comment on SPARK-8646 at 6/26/15 5:35 PM:
-

Results from pi-test are uploaded in the attachment pi-test.log. Still a module 
missing error, this time it is pandas.algo.


was (Author: juliet):
Results from pu-test.log

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8667) Improve Spark UI behavior at scale

2015-06-26 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-8667:
--

 Summary: Improve Spark UI behavior at scale
 Key: SPARK-8667
 URL: https://issues.apache.org/jira/browse/SPARK-8667
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Assignee: Shixiong Zhu


This is a parent ticket and we can create child tickets when solving specific 
issues. The main problem I would like to solve is the fact that the Spark UI 
has issues at very large scale.

The worst issue is when there is a stage page with more than a few thousand 
tasks. In this case:
1. The page itself is very slow to load and becomes unresponsive with huge 
number of tasks.
2. The Scala XML output can become so large that it crashes the driver program 
due to OOM for a page with a huge number of tasks.

I am not sure if (1) is caused by javascript slowness, or maybe just the raw 
amount of data sent over the wire. If it is the latter, it might be possible to 
add compression to the HTTP payload to help improve load time.

It would be nice to reproduce+investigate these issues further and create 
specific sub tasks to improve them.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8667) Improve Spark UI behavior at scale

2015-06-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8667:
---
Component/s: Web UI

 Improve Spark UI behavior at scale
 --

 Key: SPARK-8667
 URL: https://issues.apache.org/jira/browse/SPARK-8667
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Patrick Wendell
Assignee: Shixiong Zhu

 This is a parent ticket and we can create child tickets when solving specific 
 issues. The main problem I would like to solve is the fact that the Spark UI 
 has issues at very large scale.
 The worst issue is when there is a stage page with more than a few thousand 
 tasks. In this case:
 1. The page itself is very slow to load and becomes unresponsive with huge 
 number of tasks.
 2. The Scala XML output can become so large that it crashes the driver 
 program due to OOM for a page with a huge number of tasks.
 I am not sure if (1) is caused by javascript slowness, or maybe just the raw 
 amount of data sent over the wire. If it is the latter, it might be possible 
 to add compression to the HTTP payload to help improve load time.
 It would be nice to reproduce+investigate these issues further and create 
 specific sub tasks to improve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8647) Potential issues with the constant hashCode

2015-06-26 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603302#comment-14603302
 ] 

Alok Singh edited comment on SPARK-8647 at 6/26/15 5:53 PM:


Hi Xiangrui,

1.Same instances
=
   In that case, why not use the scala object to have singleton. 
Is it since MatrixUDT is used in the pyspark which might work better with class 
type than object type. Also in java we will have extra $ in the end for the 
object?

But if the goal is to have the same instance, isn't it would be nice to have 
hashCode to be 

override def hashCode():Int  = 
org.apache.spark.mllib.linalg.MatrixUDT.hashCode()

what are your thoughts?

2.Performance
==
I think in MatrixUDT case this will not be the pb, as there won't be many 
classes similar to  MatrixUDT with constant hashCode which is also 1994.
I was refering to 
http://java-performance.info/hashcode-method-performance-tuning/
However,  if we use the solution of  Same Instance section above, we may not 
have this issue.


Summary
===
for practical purpose it won't be the performance issue, but I think,  it would 
be nicer from aesthetic perspective to use the same instance section i.e 
[org.apache.spark.mllib.linalg.hashCode()], if we can't use the scala object.


Please suggest, should i change just the code docs explaining the reason  or 
as per the 1. above.

thanks
Alok


was (Author: aloknsingh):
Hi Xiangrui,

1.Same instances
=
   In that case, why not use the scala object to have singleton. 
Is it since MatrixUDT is used in the pyspark which might work better with class 
type than object type. Also in java we will have extra $ in the end for the 
object?

But if the goal is to have the same instance, isn't it would be nice to have 
hashCode to be 

override def hashCode():Int  = 
org.apache.spark.mllib.linalg.MatrixUDT.hashCode()

what are your thoughts?

2.Performance
==
I think in MatrixUDT case this will not be the pb, as there won't be many 
classes similar to  MatrixUDT with constant hashCode which is also 1994.
I was refering to 
http://java-performance.info/hashcode-method-performance-tuning/
However,  if we use the solution of  Same Instance section above, we may not 
have this issue.


Summary
===
for practical purpose it won't be the performance issue, but I think,  it would 
be nicer from aesthetic perspective to use the same instance section, if we 
can't use the scala object.


Please suggest, should i change just the code docs explaining the reason  or 
as per the 1. above.

thanks
Alok

 Potential issues with the constant hashCode 
 

 Key: SPARK-8647
 URL: https://issues.apache.org/jira/browse/SPARK-8647
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Alok Singh
Priority: Minor
  Labels: performance

 Hi,
 This may be potential bug or performance issue or just the code docs.
 The issue is wrt to MatrixUDT class.
  If we decide to put instance of MatrixUDT into the hash based collection.
 The hashCode function is returning constant and even though equals method is 
 consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e 
 constant) has been used.
 I was expecting it to be similar to the other matrix class or the vector 
 class .
 If there is the reason why we have this code, we should document it properly 
 in the code so that others reading it is fine.
 regards,
 Alok
 Details
 =
 a)
 In reference to the file 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
 line 188-197 ie
  override def equals(o: Any): Boolean = {
 o match {
 case v: MatrixUDT = true
 case _ = false
 }
 }
 override def hashCode(): Int = 1994
 b) the commit is 
 https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436
 on March 20.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8647) Potential issues with the constant hashCode

2015-06-26 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603302#comment-14603302
 ] 

Alok Singh edited comment on SPARK-8647 at 6/26/15 5:53 PM:


Hi Xiangrui,

1.Same instances
=
   In that case, why not use the scala object to have singleton. 
Is it since MatrixUDT is used in the pyspark which might work better with class 
type than object type. Also in java we will have extra $ in the end for the 
object?

But if the goal is to have the same instance, isn't it would be nice to have 
hashCode to be 

override def hashCode():Int  = 
org.apache.spark.mllib.linalg.MatrixUDT.hashCode()

what are your thoughts?

2.Performance
==
I think in MatrixUDT case this will not be the pb, as there won't be many 
classes similar to  MatrixUDT with constant hashCode which is also 1994.
I was refering to 
http://java-performance.info/hashcode-method-performance-tuning/
However,  if we use the solution of  Same Instance section above, we may not 
have this issue.


Summary
===
for practical purpose it won't be the performance issue, but I think,  it would 
be nicer from aesthetic perspective to use the same instance section i.e 
[org.apache.spark.mllib.linalg.MatrixUDT.hashCode()], if we can't use the 
scala object.


Please suggest, should i change just the code docs explaining the reason  or 
as per the 1. above.

thanks
Alok


was (Author: aloknsingh):
Hi Xiangrui,

1.Same instances
=
   In that case, why not use the scala object to have singleton. 
Is it since MatrixUDT is used in the pyspark which might work better with class 
type than object type. Also in java we will have extra $ in the end for the 
object?

But if the goal is to have the same instance, isn't it would be nice to have 
hashCode to be 

override def hashCode():Int  = 
org.apache.spark.mllib.linalg.MatrixUDT.hashCode()

what are your thoughts?

2.Performance
==
I think in MatrixUDT case this will not be the pb, as there won't be many 
classes similar to  MatrixUDT with constant hashCode which is also 1994.
I was refering to 
http://java-performance.info/hashcode-method-performance-tuning/
However,  if we use the solution of  Same Instance section above, we may not 
have this issue.


Summary
===
for practical purpose it won't be the performance issue, but I think,  it would 
be nicer from aesthetic perspective to use the same instance section i.e 
[org.apache.spark.mllib.linalg.hashCode()], if we can't use the scala object.


Please suggest, should i change just the code docs explaining the reason  or 
as per the 1. above.

thanks
Alok

 Potential issues with the constant hashCode 
 

 Key: SPARK-8647
 URL: https://issues.apache.org/jira/browse/SPARK-8647
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Alok Singh
Priority: Minor
  Labels: performance

 Hi,
 This may be potential bug or performance issue or just the code docs.
 The issue is wrt to MatrixUDT class.
  If we decide to put instance of MatrixUDT into the hash based collection.
 The hashCode function is returning constant and even though equals method is 
 consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e 
 constant) has been used.
 I was expecting it to be similar to the other matrix class or the vector 
 class .
 If there is the reason why we have this code, we should document it properly 
 in the code so that others reading it is fine.
 regards,
 Alok
 Details
 =
 a)
 In reference to the file 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
 line 188-197 ie
  override def equals(o: Any): Boolean = {
 o match {
 case v: MatrixUDT = true
 case _ = false
 }
 }
 override def hashCode(): Int = 1994
 b) the commit is 
 https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436
 on March 20.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >