[jira] [Commented] (SPARK-8652) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests
[ https://issues.apache.org/jira/browse/SPARK-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602464#comment-14602464 ] Apache Spark commented on SPARK-8652: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7032 PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests - Key: SPARK-8652 URL: https://issues.apache.org/jira/browse/SPARK-8652 Project: Spark Issue Type: Bug Components: PySpark, Tests Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Several PySpark files call {{doctest.testmod()}} in order to run doctests, but forget to check its return status. As a result, failures will not be automatically detected by our test runner script, creating the potential for bugs to slip through. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
[ https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602468#comment-14602468 ] Tao Li commented on SPARK-8332: --- I found that BigDecimalDeserializer will extend StdDeserializer in jackson-databind project. Above version jackson-databind-2.3, StdDeserializer have method handledType(). But under version jackson-databind-2.2, StdDeserializer don't have method handledType(). In my environment, hadoop 2.3.0-cdh5.0.0, the jackson-databind is /usr/lib/hadoop-mapreduce/.//jackson-databind-2.2.3.jar. So it don't have the method handledType() and throws NoSuchMethodError. NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer -- Key: SPARK-8332 URL: https://issues.apache.org/jira/browse/SPARK-8332 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Environment: spark 1.4 hadoop 2.3.0-cdh5.0.0 Reporter: Tao Li Priority: Critical Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson I complied new spark 1.4.0 version. But when I run a simple WordCount demo, it throws NoSuchMethodError {code} java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer {code} I found out that the default fasterxml.jackson.version is 2.4.4. Is there any wrong or conflict with the jackson version? Or is there possibly some project maven dependency containing the wrong version of jackson? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8652) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests
Josh Rosen created SPARK-8652: - Summary: PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests Key: SPARK-8652 URL: https://issues.apache.org/jira/browse/SPARK-8652 Project: Spark Issue Type: Bug Components: PySpark, Tests Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Several PySpark files call {{doctest.testmod()}} in order to run doctests, but forget to check its return status. As a result, failures will not be automatically detected by our test runner script, creating the potential for bugs to slip through. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8652) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests
[ https://issues.apache.org/jira/browse/SPARK-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8652: --- Assignee: Apache Spark (was: Josh Rosen) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests - Key: SPARK-8652 URL: https://issues.apache.org/jira/browse/SPARK-8652 Project: Spark Issue Type: Bug Components: PySpark, Tests Reporter: Josh Rosen Assignee: Apache Spark Priority: Blocker Several PySpark files call {{doctest.testmod()}} in order to run doctests, but forget to check its return status. As a result, failures will not be automatically detected by our test runner script, creating the potential for bugs to slip through. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8652) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests
[ https://issues.apache.org/jira/browse/SPARK-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8652: --- Assignee: Josh Rosen (was: Apache Spark) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests - Key: SPARK-8652 URL: https://issues.apache.org/jira/browse/SPARK-8652 Project: Spark Issue Type: Bug Components: PySpark, Tests Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Several PySpark files call {{doctest.testmod()}} in order to run doctests, but forget to check its return status. As a result, failures will not be automatically detected by our test runner script, creating the potential for bugs to slip through. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
[ https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602477#comment-14602477 ] Olivier Girardot commented on SPARK-8332: - Ok can you print the command line you're using to submit your job. My conclusion was that it's difficult to create classpath compatible between hive/hadoop from CDH5.x and Spark if you want to use hive tables... I'd like to see your classpath in order to better understand what is going on. NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer -- Key: SPARK-8332 URL: https://issues.apache.org/jira/browse/SPARK-8332 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Environment: spark 1.4 hadoop 2.3.0-cdh5.0.0 Reporter: Tao Li Priority: Critical Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson I complied new spark 1.4.0 version. But when I run a simple WordCount demo, it throws NoSuchMethodError {code} java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer {code} I found out that the default fasterxml.jackson.version is 2.4.4. Is there any wrong or conflict with the jackson version? Or is there possibly some project maven dependency containing the wrong version of jackson? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8648) Documented command not working
[ https://issues.apache.org/jira/browse/SPARK-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8648. -- Resolution: Not A Problem I think it's easier if you put your text inline rather than in an RTF attachment, and generally you would use a PR to express the diff. Your JIRA title needs to be better as you have no detail at all in the JIRA itself. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first before opening a JIRA. However, this text is not present in {{master}} so is already fixed. You'll want to look at the latest doc source first. Documented command not working --- Key: SPARK-8648 URL: https://issues.apache.org/jira/browse/SPARK-8648 Project: Spark Issue Type: Documentation Components: Spark Core Environment: Mac Reporter: Sudhakar Thota Priority: Trivial Attachments: SPARK-8648-1.rtf Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn
[ https://issues.apache.org/jira/browse/SPARK-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602485#comment-14602485 ] Josh Rosen commented on SPARK-8623: --- There's another report of this issue at https://github.com/apache/spark/pull/6679#issuecomment-115546773 Some queries in spark-sql lead to NullPointerException when using Yarn -- Key: SPARK-8623 URL: https://issues.apache.org/jira/browse/SPARK-8623 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: Hadoop 2.6, Kerberos Reporter: Bolke de Bruin The following query was executed using spark-sql --master yarn-client on 1.5.0-SNAPSHOT: select * from wcs.geolite_city limit 10; This lead to the following error: 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.init(Configuration.java:693) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442) at org.apache.hadoop.mapreduce.Job.init(Job.java:131) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) This does not happen in every case, ie. some queries execute fine, and it is unclear why. Using just spark-sql the query executes fine as well and thus the issue seems to rely in the communication with Yarn. Also the query executes fine (with yarn) in spark-shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8620) cleanup CodeGenContext
[ https://issues.apache.org/jira/browse/SPARK-8620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8620. --- Resolution: Fixed Fix Version/s: 1.5.0 cleanup CodeGenContext -- Key: SPARK-8620 URL: https://issues.apache.org/jira/browse/SPARK-8620 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8653) Add constraint for Children expression for data type
Cheng Hao created SPARK-8653: Summary: Add constraint for Children expression for data type Key: SPARK-8653 URL: https://issues.apache.org/jira/browse/SPARK-8653 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, we have trait in Expression like `ExpectsInputTypes` and also the `checkInputDataTypes`, but can not convert the children expressions automatically, except we write the new rules in the `HiveTypeCoercion`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8405) Show executor logs on Web UI when Yarn log aggregation is enabled
[ https://issues.apache.org/jira/browse/SPARK-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8405: --- Assignee: Apache Spark Show executor logs on Web UI when Yarn log aggregation is enabled - Key: SPARK-8405 URL: https://issues.apache.org/jira/browse/SPARK-8405 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Assignee: Apache Spark Attachments: SparkLogError.png When running Spark application in Yarn mode and Yarn log aggregation is enabled, customer is not able to view executor logs on the history server Web UI. The only way for customer to view the logs is through the Yarn command yarn logs -applicationId appId. An screenshot of the error is attached. When you click an executor’s log link on the Spark history server, you’ll see the error if Yarn log aggregation is enabled. The log URL redirects user to the node manager’s UI. This works if the logs are located on that node. But since log aggregation is enabled, the local logs are deleted once log aggregation is completed. The logs should be available through the web UIs just like other Hadoop components like MapReduce. For security reasons, end users may not be able to log into the nodes and run the yarn logs -applicationId command. The web UIs can be viewable and exposed through the firewall if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8405) Show executor logs on Web UI when Yarn log aggregation is enabled
[ https://issues.apache.org/jira/browse/SPARK-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602496#comment-14602496 ] Apache Spark commented on SPARK-8405: - User 'carsonwang' has created a pull request for this issue: https://github.com/apache/spark/pull/7033 Show executor logs on Web UI when Yarn log aggregation is enabled - Key: SPARK-8405 URL: https://issues.apache.org/jira/browse/SPARK-8405 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Attachments: SparkLogError.png When running Spark application in Yarn mode and Yarn log aggregation is enabled, customer is not able to view executor logs on the history server Web UI. The only way for customer to view the logs is through the Yarn command yarn logs -applicationId appId. An screenshot of the error is attached. When you click an executor’s log link on the Spark history server, you’ll see the error if Yarn log aggregation is enabled. The log URL redirects user to the node manager’s UI. This works if the logs are located on that node. But since log aggregation is enabled, the local logs are deleted once log aggregation is completed. The logs should be available through the web UIs just like other Hadoop components like MapReduce. For security reasons, end users may not be able to log into the nodes and run the yarn logs -applicationId command. The web UIs can be viewable and exposed through the firewall if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8405) Show executor logs on Web UI when Yarn log aggregation is enabled
[ https://issues.apache.org/jira/browse/SPARK-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8405: --- Assignee: (was: Apache Spark) Show executor logs on Web UI when Yarn log aggregation is enabled - Key: SPARK-8405 URL: https://issues.apache.org/jira/browse/SPARK-8405 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Attachments: SparkLogError.png When running Spark application in Yarn mode and Yarn log aggregation is enabled, customer is not able to view executor logs on the history server Web UI. The only way for customer to view the logs is through the Yarn command yarn logs -applicationId appId. An screenshot of the error is attached. When you click an executor’s log link on the Spark history server, you’ll see the error if Yarn log aggregation is enabled. The log URL redirects user to the node manager’s UI. This works if the logs are located on that node. But since log aggregation is enabled, the local logs are deleted once log aggregation is completed. The logs should be available through the web UIs just like other Hadoop components like MapReduce. For security reasons, end users may not be able to log into the nodes and run the yarn logs -applicationId command. The web UIs can be viewable and exposed through the firewall if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8600) Naive Bayes API for spark.ml Pipelines
[ https://issues.apache.org/jira/browse/SPARK-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602498#comment-14602498 ] Yanbo Liang commented on SPARK-8600: [~mengxr] Can you assign this to me? Naive Bayes API for spark.ml Pipelines -- Key: SPARK-8600 URL: https://issues.apache.org/jira/browse/SPARK-8600 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Create a NaiveBayes API for the spark.ml Pipelines API. This should wrap the existing NaiveBayes implementation under spark.mllib package. Should also keep the parameter names consistent. The output columns could include both the prediction and confidence scores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8245) string function: format_number
[ https://issues.apache.org/jira/browse/SPARK-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8245: --- Assignee: Cheng Hao (was: Apache Spark) string function: format_number -- Key: SPARK-8245 URL: https://issues.apache.org/jira/browse/SPARK-8245 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao format_number(number x, int d): string Formats the number X to a format like '#,###,###.##', rounded to D decimal places, and returns the result as a string. If D is 0, the result has no decimal point or fractional part. (As of Hive 0.10.0; bug with float types fixed in Hive 0.14.0, decimal type support added in Hive 0.14.0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8245) string function: format_number
[ https://issues.apache.org/jira/browse/SPARK-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602503#comment-14602503 ] Apache Spark commented on SPARK-8245: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/7034 string function: format_number -- Key: SPARK-8245 URL: https://issues.apache.org/jira/browse/SPARK-8245 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao format_number(number x, int d): string Formats the number X to a format like '#,###,###.##', rounded to D decimal places, and returns the result as a string. If D is 0, the result has no decimal point or fractional part. (As of Hive 0.10.0; bug with float types fixed in Hive 0.14.0, decimal type support added in Hive 0.14.0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8653) Add constraint for Children expression for data type
[ https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8653: --- Assignee: Apache Spark Add constraint for Children expression for data type Key: SPARK-8653 URL: https://issues.apache.org/jira/browse/SPARK-8653 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Assignee: Apache Spark Currently, we have trait in Expression like `ExpectsInputTypes` and also the `checkInputDataTypes`, but can not convert the children expressions automatically, except we write the new rules in the `HiveTypeCoercion`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8653) Add constraint for Children expression for data type
[ https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602502#comment-14602502 ] Apache Spark commented on SPARK-8653: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/7034 Add constraint for Children expression for data type Key: SPARK-8653 URL: https://issues.apache.org/jira/browse/SPARK-8653 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, we have trait in Expression like `ExpectsInputTypes` and also the `checkInputDataTypes`, but can not convert the children expressions automatically, except we write the new rules in the `HiveTypeCoercion`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8245) string function: format_number
[ https://issues.apache.org/jira/browse/SPARK-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8245: --- Assignee: Apache Spark (was: Cheng Hao) string function: format_number -- Key: SPARK-8245 URL: https://issues.apache.org/jira/browse/SPARK-8245 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark format_number(number x, int d): string Formats the number X to a format like '#,###,###.##', rounded to D decimal places, and returns the result as a string. If D is 0, the result has no decimal point or fractional part. (As of Hive 0.10.0; bug with float types fixed in Hive 0.14.0, decimal type support added in Hive 0.14.0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8653) Add constraint for Children expression for data type
[ https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8653: --- Assignee: (was: Apache Spark) Add constraint for Children expression for data type Key: SPARK-8653 URL: https://issues.apache.org/jira/browse/SPARK-8653 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, we have trait in Expression like `ExpectsInputTypes` and also the `checkInputDataTypes`, but can not convert the children expressions automatically, except we write the new rules in the `HiveTypeCoercion`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling
[ https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602520#comment-14602520 ] Santiago M. Mola commented on SPARK-8636: - [~animeshbaranawal] Yes, I think so. CaseKeyWhen has incorrect NULL handling --- Key: SPARK-8636 URL: https://issues.apache.org/jira/browse/SPARK-8636 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Santiago M. Mola Labels: starter CaseKeyWhen implementation in Spark uses the following equals implementation: {code} private def equalNullSafe(l: Any, r: Any) = { if (l == null r == null) { true } else if (l == null || r == null) { false } else { l == r } } {code} Which is not correct, since in SQL, NULL is never equal to NULL (actually, it is not unequal either). In this case, a NULL value in a CASE WHEN expression should never match. For example, you can execute this in MySQL: {code} SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END FROM DUAL; {code} And the result will be NULL DOES NOT MATCH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8644) SparkException thrown due to Executor exceptions should include caller site in stack trace
[ https://issues.apache.org/jira/browse/SPARK-8644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602531#comment-14602531 ] Sean Owen commented on SPARK-8644: -- This seems related to SPARK-8625 which asks to return the whole exception. Would that subsume this? SparkException thrown due to Executor exceptions should include caller site in stack trace -- Key: SPARK-8644 URL: https://issues.apache.org/jira/browse/SPARK-8644 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Currently when a job fails due to executor (or other) issues, the exception thrown by Spark has a stack trace which stops at the DAGScheduler EventLoop, which makes it hard to trace back to the user code which submitted the job. It should try to include the user submission stack trace. Example exception today: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.RuntimeException: uh-oh! at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34$$anonfun$apply$mcJ$sp$1.apply(DAGSchedulerSuite.scala:851) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34$$anonfun$apply$mcJ$sp$1.apply(DAGSchedulerSuite.scala:851) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1637) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1285) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1276) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1275) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1275) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:749) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:749) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:749) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1486) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) {code} Here is the part I want to include: {code} at org.apache.spark.rdd.RDD.count(RDD.scala:1095) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34.apply$mcJ$sp(DAGSchedulerSuite.scala:851) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34.apply(DAGSchedulerSuite.scala:851) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34.apply(DAGSchedulerSuite.scala:851) at org.scalatest.Assertions$class.intercept(Assertions.scala:997) at org.scalatest.FunSuite.intercept(FunSuite.scala:1555) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33.apply$mcV$sp(DAGSchedulerSuite.scala:850) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33.apply(DAGSchedulerSuite.scala:849) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33.apply(DAGSchedulerSuite.scala:849) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22)
[jira] [Commented] (SPARK-8608) After initializing a DataFrame with random columns and a seed, df.show should return same value
[ https://issues.apache.org/jira/browse/SPARK-8608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602543#comment-14602543 ] Akhil Thatipamula commented on SPARK-8608: -- [~brkyvz] more description would be helpful. After initializing a DataFrame with random columns and a seed, df.show should return same value --- Key: SPARK-8608 URL: https://issues.apache.org/jira/browse/SPARK-8608 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.4.0, 1.4.1 Reporter: Burak Yavuz Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6945) Provide SQL tab in the Spark UI
[ https://issues.apache.org/jira/browse/SPARK-6945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602545#comment-14602545 ] an lin zeng commented on SPARK-6945: Hi,what content will be shown on this sql tab?could you give more information about this? Provide SQL tab in the Spark UI --- Key: SPARK-6945 URL: https://issues.apache.org/jira/browse/SPARK-6945 Project: Spark Issue Type: Sub-task Components: SQL, Web UI Reporter: Patrick Wendell Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8522) Disable feature scaling in Linear and Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-8522: -- Assignee: DB Tsai (was: holdenk) Disable feature scaling in Linear and Logistic Regression - Key: SPARK-8522 URL: https://issues.apache.org/jira/browse/SPARK-8522 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai Assignee: DB Tsai Fix For: 1.5.0 All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same. In R, there is an option for this. `standardize` Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family=gaussian. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8226) math function: shiftrightunsigned
[ https://issues.apache.org/jira/browse/SPARK-8226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8226: --- Assignee: zhichao-li (was: Apache Spark) math function: shiftrightunsigned - Key: SPARK-8226 URL: https://issues.apache.org/jira/browse/SPARK-8226 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: zhichao-li shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a) Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, smallint and int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8226) math function: shiftrightunsigned
[ https://issues.apache.org/jira/browse/SPARK-8226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602537#comment-14602537 ] Apache Spark commented on SPARK-8226: - User 'zhichao-li' has created a pull request for this issue: https://github.com/apache/spark/pull/7035 math function: shiftrightunsigned - Key: SPARK-8226 URL: https://issues.apache.org/jira/browse/SPARK-8226 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: zhichao-li shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a) Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, smallint and int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8226) math function: shiftrightunsigned
[ https://issues.apache.org/jira/browse/SPARK-8226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8226: --- Assignee: Apache Spark (was: zhichao-li) math function: shiftrightunsigned - Key: SPARK-8226 URL: https://issues.apache.org/jira/browse/SPARK-8226 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a) Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, smallint and int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8654) Analysis exception when using NULL IN (...): invalid cast
Santiago M. Mola created SPARK-8654: --- Summary: Analysis exception when using NULL IN (...): invalid cast Key: SPARK-8654 URL: https://issues.apache.org/jira/browse/SPARK-8654 Project: Spark Issue Type: Bug Components: SQL Reporter: Santiago M. Mola Priority: Minor The following query throws an analysis exception: {code} SELECT * FROM t WHERE NULL NOT IN (1, 2, 3); {code} The exception is: {code} org.apache.spark.sql.AnalysisException: invalid cast from int to null; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:66) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) {code} Here is a test that can be added to AnalysisSuite to check the issue: {code} test(SPARK- regression test) { val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), a)() :: Nil, LocalRelation() ) caseInsensitiveAnalyze(plan) } {code} Note that this kind of query is a corner case, but it is still valid SQL. An expression such as NULL IN (...) or NULL NOT IN (...) always gives NULL as a result, even if the list contains NULL. So it is safe to translate these expressions to Literal(null) during analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8613) Add a param for disabling of feature scaling, default to true
[ https://issues.apache.org/jira/browse/SPARK-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-8613. Resolution: Fixed Assignee: holdenk Fix Version/s: 1.5.0 Target Version/s: 1.5.0 Issue resolved by pull request 7024 https://github.com/apache/spark/pull/7024 Add a param for disabling of feature scaling, default to true - Key: SPARK-8613 URL: https://issues.apache.org/jira/browse/SPARK-8613 Project: Spark Issue Type: Sub-task Components: ML Reporter: holdenk Assignee: holdenk Fix For: 1.5.0 Add a param to disable feature scaling. Do this distinct from disabling scaling in any particular alg incase someone wants to work on logistic while work in linear is in progress. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8522) Disable feature scaling in Linear and Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-8522: --- Comment: was deleted (was: Issue resolved by pull request 7024 [https://github.com/apache/spark/pull/7024]) Disable feature scaling in Linear and Logistic Regression - Key: SPARK-8522 URL: https://issues.apache.org/jira/browse/SPARK-8522 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai Assignee: holdenk Fix For: 1.5.0 All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same. In R, there is an option for this. `standardize` Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family=gaussian. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8601) Disable feature scaling in Linear Regression
[ https://issues.apache.org/jira/browse/SPARK-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-8601: --- Assignee: holdenk Disable feature scaling in Linear Regression Key: SPARK-8601 URL: https://issues.apache.org/jira/browse/SPARK-8601 Project: Spark Issue Type: Sub-task Components: ML Reporter: holdenk Assignee: holdenk See parent task for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602585#comment-14602585 ] Sean Owen commented on SPARK-8646: -- You're saying it doesn't work at all on YARN? I'd hope there are some unit tests for this but I am not sure if it covers this case. Do we know more about the likely issue here -- something isn't packaging pyspark, or not unpacking it? CC [~lianhuiwang] PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8522) Disable feature scaling in Linear and Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-8522. Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7024 [https://github.com/apache/spark/pull/7024] Disable feature scaling in Linear and Logistic Regression - Key: SPARK-8522 URL: https://issues.apache.org/jira/browse/SPARK-8522 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai Assignee: holdenk Fix For: 1.5.0 All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same. In R, there is an option for this. `standardize` Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family=gaussian. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-8522) Disable feature scaling in Linear and Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reopened SPARK-8522: Disable feature scaling in Linear and Logistic Regression - Key: SPARK-8522 URL: https://issues.apache.org/jira/browse/SPARK-8522 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai Assignee: holdenk Fix For: 1.5.0 All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same. In R, there is an option for this. `standardize` Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family=gaussian. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed
[ https://issues.apache.org/jira/browse/SPARK-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8383. -- Resolution: Cannot Reproduce Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed - Key: SPARK-8383 URL: https://issues.apache.org/jira/browse/SPARK-8383 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.3.1 Environment: Spark1.3.1.2.3 Reporter: Irina Easterling Attachments: Spark_WrongLastUpdatedDate.png, YARN_SparkJobCompleted.PNG Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed and Started Date is 2015/06/10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8666) checkpointing does not take advantage of persisted/cached RDDs
[ https://issues.apache.org/jira/browse/SPARK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glenn Strycker updated SPARK-8666: -- Description: I have been noticing that when checkpointing RDDs, all operations are occurring TWICE. For example, when I run the following code and watch the stages... {noformat} val newRDD = prevRDD.map(a = (a._1, 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER) newRDD.checkpoint print(newRDD.count()) {noformat} I see distinct and count operations appearing TWICE, and shuffle disk writes and reads (from the distinct) occurring TWICE. My newRDD is persisted to memory, why can't the checkpoint simply save those partitions to disk when the first operations have completed? was: I have been noticing that when checkpointing RDDs, all operations are occurring TWICE. For example, when I run the following code and watch the stages... {noformat} val newRDD = prevRDD.map(a = (a._1, 1L)).distinct.persist() newRDD.checkpoint print(newRDD.count()) {noformat} I see distinct and count operations appearing TWICE, and shuffle disk writes and reads (from the distinct) occurring TWICE. My newRDD is persisted to memory, why can't the checkpoint simply save those partitions to disk when the first operations have completed? checkpointing does not take advantage of persisted/cached RDDs -- Key: SPARK-8666 URL: https://issues.apache.org/jira/browse/SPARK-8666 Project: Spark Issue Type: New Feature Reporter: Glenn Strycker I have been noticing that when checkpointing RDDs, all operations are occurring TWICE. For example, when I run the following code and watch the stages... {noformat} val newRDD = prevRDD.map(a = (a._1, 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER) newRDD.checkpoint print(newRDD.count()) {noformat} I see distinct and count operations appearing TWICE, and shuffle disk writes and reads (from the distinct) occurring TWICE. My newRDD is persisted to memory, why can't the checkpoint simply save those partitions to disk when the first operations have completed? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7756) Ensure Spark runs clean on IBM Java implementation
[ https://issues.apache.org/jira/browse/SPARK-7756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602899#comment-14602899 ] Sean Owen commented on SPARK-7756: -- I agree, https://github.com/apache/spark/pull/6740 should have been separate. The other two PRs are logically related. Ensure Spark runs clean on IBM Java implementation -- Key: SPARK-7756 URL: https://issues.apache.org/jira/browse/SPARK-7756 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Tim Ellison Assignee: Tim Ellison Priority: Minor Fix For: 1.4.0 Spark should run successfully on the IBM Java implementation. This issue is to gather any minor issues seen running the tests and examples that are attributable to differences in Java vendor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8659) SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server.
Premchandra Preetham Kukillaya created SPARK-8659: - Summary: SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server. Key: SPARK-8659 URL: https://issues.apache.org/jira/browse/SPARK-8659 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Environment: Linux Reporter: Premchandra Preetham Kukillaya It seems like while pointing JDBC/ODBC Driver to Spark SQL Thrift Service Hive's feature SQL based authorization is not working whereas SQL based Authorization works when i am pointing the JDBC Driver to ThriftCLIService provided by HiveServer2. The problem is user X can do select on table belonging to user Y. I am using Hive .13.1 and Spark 1.3.1 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 --hiveconf mapred.max.split.size=25600 --hiveconf hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf javax.jdo.option.ConnectionPassword=hive --hiveconf hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8661) Update comments that contain R statements in ml.LinearRegressionSuite
Xiangrui Meng created SPARK-8661: Summary: Update comments that contain R statements in ml.LinearRegressionSuite Key: SPARK-8661 URL: https://issues.apache.org/jira/browse/SPARK-8661 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Similar to SPARK-8660, but for ml.LinearRegressionSuite: https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8662) [SparkR] SparkSQL tests fail in R 3.2
Chris Freeman created SPARK-8662: Summary: [SparkR] SparkSQL tests fail in R 3.2 Key: SPARK-8662 URL: https://issues.apache.org/jira/browse/SPARK-8662 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Reporter: Chris Freeman Fix For: 1.4.0 SparkR tests for equality using `all.equal` on environments fail in R 3.2. This is due to a change in how equality between environments is handled in the new version of R. This should most likely not be a huge problem, we'll just have to rewrite some of the tests to be more fine-grained instead of testing equality on entire environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8652) PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests
[ https://issues.apache.org/jira/browse/SPARK-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8652. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7032 [https://github.com/apache/spark/pull/7032] PySpark tests sometimes forget to check return status of doctest.testmod(), masking failing tests - Key: SPARK-8652 URL: https://issues.apache.org/jira/browse/SPARK-8652 Project: Spark Issue Type: Bug Components: PySpark, Tests Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Fix For: 1.5.0 Several PySpark files call {{doctest.testmod()}} in order to run doctests, but forget to check its return status. As a result, failures will not be automatically detected by our test runner script, creating the potential for bugs to slip through. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8655) DataFrameReader#option supports more than String as value
[ https://issues.apache.org/jira/browse/SPARK-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Nitschinger updated SPARK-8655: --- Description: I'm working on a custom data source, porting it from 1.3 to 1.4. On 1.3 I could easily extend the SparkSQL imports and get access to it, which meant I could use custom options right away. One of those is I pass a Filter down to my Relation for tighter schema inference against a schemaless database. So I would have something like: n1ql(filter: Filter = null, userSchema: StructType = null, bucketName: String = null) Since I want to move my API behind the DataFrameReader, the SQLContext is not available anymore, only through the RelationProvider, which I've implemented and it works nicely. The only problem I have now is that while I can pass in custom options, they are all String typed. So I have no way to pass down my optional Filter anymore (since parameters is a Map[String, String]). Would it be possible to extend the options so that more than just Strings can be passed in? Right now I probably need to work around that by documenting how people can pass in a string which I turn into a Filter, but that's somewhat hacky. Note that built-in impls like JSON or JDBC have no issues, because since they can access the SQLContext (private) without issues, they don't need to go through the decoupling of the RelationProvider and can do any custom arguments they want on their methods. was: I'm working on a custom data source, porting it from 1.3 to 1.4. On 1.3 I could easily extend the SparkSQL imports and get access to it, which meant I could use custom options right away. One of those is I pass a Filter down to my Relation for tighter schema inference against a schemaless database. So I would have something like: n1ql(filter: Filter = null, userSchema: StructType = null, bucketName: String = null) Since I want to move my API behind the DataFrameReader, the SQLContext is not available anymore, only through the RelationProvider, which I've implemented and it works nicely. The only problem I have now is that while I can pass in custom options, they are all String typed. So I have no way to pass down my optional Filter anymore (since parameters is a Map[String, String]). Would it be possible to extend the options so that more than just Strings can be passed in? Right now I probably need to work around that by documenting how people can pass in a string which I turn into a Filter, but that's somewhat hacky. DataFrameReader#option supports more than String as value - Key: SPARK-8655 URL: https://issues.apache.org/jira/browse/SPARK-8655 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Michael Nitschinger I'm working on a custom data source, porting it from 1.3 to 1.4. On 1.3 I could easily extend the SparkSQL imports and get access to it, which meant I could use custom options right away. One of those is I pass a Filter down to my Relation for tighter schema inference against a schemaless database. So I would have something like: n1ql(filter: Filter = null, userSchema: StructType = null, bucketName: String = null) Since I want to move my API behind the DataFrameReader, the SQLContext is not available anymore, only through the RelationProvider, which I've implemented and it works nicely. The only problem I have now is that while I can pass in custom options, they are all String typed. So I have no way to pass down my optional Filter anymore (since parameters is a Map[String, String]). Would it be possible to extend the options so that more than just Strings can be passed in? Right now I probably need to work around that by documenting how people can pass in a string which I turn into a Filter, but that's somewhat hacky. Note that built-in impls like JSON or JDBC have no issues, because since they can access the SQLContext (private) without issues, they don't need to go through the decoupling of the RelationProvider and can do any custom arguments they want on their methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8656) Spark Standalone master json API's worker number is not match web UI number
[ https://issues.apache.org/jira/browse/SPARK-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8656: --- Assignee: (was: Apache Spark) Spark Standalone master json API's worker number is not match web UI number --- Key: SPARK-8656 URL: https://issues.apache.org/jira/browse/SPARK-8656 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: thegiive Priority: Minor Spark standalone master web UI show Alive workers worker number, Alive Workers total core, total used cores and Alive workers total memory, memory used. But the JSON API page http://MASTERURL:8088/json; shows all workers worker number, core, memory number. This webUI data is not sync with the JSON API. The proper way is to sync the number with webUI and JSON API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8656) Spark Standalone master json API's worker number is not match web UI number
[ https://issues.apache.org/jira/browse/SPARK-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8656: --- Assignee: Apache Spark Spark Standalone master json API's worker number is not match web UI number --- Key: SPARK-8656 URL: https://issues.apache.org/jira/browse/SPARK-8656 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: thegiive Assignee: Apache Spark Priority: Minor Spark standalone master web UI show Alive workers worker number, Alive Workers total core, total used cores and Alive workers total memory, memory used. But the JSON API page http://MASTERURL:8088/json; shows all workers worker number, core, memory number. This webUI data is not sync with the JSON API. The proper way is to sync the number with webUI and JSON API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8615) sql programming guide recommends deprecated code
[ https://issues.apache.org/jira/browse/SPARK-8615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602695#comment-14602695 ] Apache Spark commented on SPARK-8615: - User 'tijoparacka' has created a pull request for this issue: https://github.com/apache/spark/pull/7039 sql programming guide recommends deprecated code Key: SPARK-8615 URL: https://issues.apache.org/jira/browse/SPARK-8615 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.4.0 Reporter: Gergely Svigruha Priority: Minor The Spark 1.4 sql programming guide has an example code on how to use JDBC tables: https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases sqlContext.load(jdbc, Map(...)) However this code complies with a warning, and recommends to do this: sqlContext.read.format(jdbc).options(Map(...)).load() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8657) Fail to upload conf archive to viewfs
[ https://issues.apache.org/jira/browse/SPARK-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Li updated SPARK-8657: -- Description: When I run in spark-1.4 yarn-client mode, I throws the following Exception when trying to upload conf archive to viewfs: 15/06/26 17:56:37 INFO yarn.Client: Uploading resource file:/tmp/spark-095ec3d2-5dad-468c-8d46-2c813457404d/__hadoop_conf__8436284925771788661 .zip - viewfs://nsX/user/ultraman/.sparkStaging/application_1434370929997_191242/__hadoop_conf__8436284925771788661.zip 15/06/26 17:56:38 INFO yarn.Client: Deleting staging directory .sparkStaging/application_1434370929997_191242 15/06/26 17:56:38 ERROR spark.SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: Wrong FS: hdfs://SunshineNameNode2:8020/user/ultraman/.sparkStaging/application_1434370929997_191242/__had oop_conf__8436284925771788661.zip, expected: viewfs://nsX/ at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:117) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:346) at org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:341) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:338) at scala.Option.foreach(Option.scala:236) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:559) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) at org.apache.spark.SparkContext.init(SparkContext.scala:497) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017) at $line3.$read$$iwC$$iwC.init(console:9) at $line3.$read$$iwC.init(console:18) at $line3.$read.init(console:20) at $line3.$read$.init(console:24) at $line3.$read$.clinit(console) at $line3.$eval$.init(console:7) at $line3.$eval$.clinit(console) at $line3.$eval.$print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) The bug is easy to fix, we should pass the correct file system object to addResource. The similar issure is: https://github.com/apache/spark/pull/1483. I will attach my bug fix PR very soon. The code in Client.scala is need to fix: was: When I run in spark-1.4 yarn-client mode, I throws the following Exception when trying to upload conf archive to viewfs: 15/06/26 17:56:37 INFO yarn.Client: Uploading resource file:/tmp/spark-095ec3d2-5dad-468c-8d46-2c813457404d/__hadoop_conf__8436284925771788661 .zip - viewfs://nsX/user/ultraman/.sparkStaging/application_1434370929997_191242/__hadoop_conf__8436284925771788661.zip 15/06/26 17:56:38 INFO yarn.Client: Deleting staging directory .sparkStaging/application_1434370929997_191242 15/06/26 17:56:38 ERROR spark.SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: Wrong FS: hdfs://SunshineNameNode2:8020/user/ultraman/.sparkStaging/application_1434370929997_191242/__had oop_conf__8436284925771788661.zip, expected: viewfs://nsX/ at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:117) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:346) at org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:341) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:338) at scala.Option.foreach(Option.scala:236) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338) at
[jira] [Created] (SPARK-8658) AttributeReference equals method only compare name, exprId and dataType
Antonio Jesus Navarro created SPARK-8658: Summary: AttributeReference equals method only compare name, exprId and dataType Key: SPARK-8658 URL: https://issues.apache.org/jira/browse/SPARK-8658 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.3.1, 1.3.0 Reporter: Antonio Jesus Navarro The AttributeReference equals method only accept as different objects with different name, expression id or dataType. With this behavior when I tried to do a transformExpressionsDown and try to transform qualifiers inside AttributeReferences, these objects are not replaced, because the transformer considers them equal. I propose to add to the equals method this variables: name, dataType, nullable, metadata, epxrId, qualifiers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
[ https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602819#comment-14602819 ] Kousuke Saruta commented on SPARK-5768: --- [~srowen] Oh, I see. Thanks for letting me know. If 1.4.1 is released without another RC, I'll modify the fix version. Spark UI Shows incorrect memory under Yarn -- Key: SPARK-5768 URL: https://issues.apache.org/jira/browse/SPARK-5768 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0, 1.2.1 Environment: Centos 6 Reporter: Al M Assignee: Rekha Joshi Priority: Trivial Fix For: 1.4.1, 1.5.0 I am running Spark on Yarn with 2 executors. The executors are running on separate physical machines. I have spark.executor.memory set to '40g'. This is because I want to have 40g of memory used on each machine. I have one executor per machine. When I run my application I see from 'top' that both my executors are using the full 40g of memory I allocated to them. The 'Executors' tab in the Spark UI shows something different. It shows the memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it look like I only have 20GB available per executor when really I have 40GB available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8302) Support heterogeneous cluster nodes on YARN
[ https://issues.apache.org/jira/browse/SPARK-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-8302. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6752 [https://github.com/apache/spark/pull/6752] Support heterogeneous cluster nodes on YARN --- Key: SPARK-8302 URL: https://issues.apache.org/jira/browse/SPARK-8302 Project: Spark Issue Type: New Feature Components: YARN Affects Versions: 1.5.0 Reporter: Marcelo Vanzin Fix For: 1.5.0 Some of our customers install Hadoop on different paths across the cluster. When running a Spark app, this leads to a few complications because of how we try to reuse the rest of Hadoop. Since all configuration for a Spark-on-YARN application is local, the code does not have enough information about how to run things on the rest of the cluster in such cases. To illustrate: let's say that a node's configuration says that {{SPARK_DIST_CLASSPATH=/disk1/hadoop/lib/*}}. If I launch a Spark app from that machine, but there's a machine on the cluster where Hadoop is actually installed in {{/disk2/hadoop/lib}}, then any container launched on that node will fail. The problem does not exist (or is much less pronounced) on standalone and mesos since they require a local Spark installation and configuration. It would be nice if we could easily support this use case on YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()
[ https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603094#comment-14603094 ] Shivaram Venkataraman commented on SPARK-8409: -- If you open those links it says at the top 'This post has NOT been accepted by the mailing list yet' -- I don't use nabble, so I can't comment on why that is happening. If you can't get mailing lists to work, please post to StackOverflow with the tag apache-spark, sparkr . The JIRA is not something we use for supporting users in the Spark project. In windows cant able to read .csv or .json files using read.df() - Key: SPARK-8409 URL: https://issues.apache.org/jira/browse/SPARK-8409 Project: Spark Issue Type: Bug Components: SparkR, Windows Affects Versions: 1.4.0 Environment: sparkR API Reporter: Arun Priority: Critical Hi, In SparkR shell, I invoke: mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json, header=false) I have tried various filetypes (csv, txt), all fail. in sparkR of spark 1.4 for eg.) df_1- read.df(sqlContext, E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv, source = csv) RESPONSE: ERROR RBackendHandler: load on 1 failed BELOW THE WHOLE RESPONSE: 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with curMem=0, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 173.4 KB, free 265.2 MB) 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with curMem=177600, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.2 KB, free 265.2 MB) 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at NativeMethodAccessorImpl.java:-2 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745)
[jira] [Commented] (SPARK-8372) History server shows incorrect information for application not started
[ https://issues.apache.org/jira/browse/SPARK-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603134#comment-14603134 ] Marcelo Vanzin commented on SPARK-8372: --- bq. The log path name may also end with an attempt id I'm not saying the single line patch I posted is the answer, I was just pointing out the current patch in master caused a regression. History server shows incorrect information for application not started -- Key: SPARK-8372 URL: https://issues.apache.org/jira/browse/SPARK-8372 Project: Spark Issue Type: Bug Components: Deploy, Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Assignee: Carson Wang Priority: Minor Fix For: 1.4.1, 1.5.0 Attachments: IncorrectAppInfo.png The history server may show an incorrect App ID for an incomplete application like App ID.inprogress. This app info will never disappear even after the app is completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8666) checkpointing does not take advantage of persisted/cached RDDs
[ https://issues.apache.org/jira/browse/SPARK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603175#comment-14603175 ] Glenn Strycker commented on SPARK-8666: --- I added a stackoverflow question to parallel this ticket: http://stackoverflow.com/questions/31078350/spark-rdd-checkpoint-on-persisted-cached-rdds-are-performing-the-dag-twice One idea I had is that maybe I have to materialize twice? {noformat} // this will create the RDD and cache, when materialized val newRDD = prevRDD.map(a = (a._1, 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER) print(newRDD.count()) // will this now checkpoint FROM THE EXISTING CACHE IN MEMORY? newRDD.checkpoint print(newRDD.count()) {noformat} checkpointing does not take advantage of persisted/cached RDDs -- Key: SPARK-8666 URL: https://issues.apache.org/jira/browse/SPARK-8666 Project: Spark Issue Type: New Feature Reporter: Glenn Strycker I have been noticing that when checkpointing RDDs, all operations are occurring TWICE. For example, when I run the following code and watch the stages... {noformat} val newRDD = prevRDD.map(a = (a._1, 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER) newRDD.checkpoint print(newRDD.count()) {noformat} I see distinct and count operations appearing TWICE, and shuffle disk writes and reads (from the distinct) occurring TWICE. My newRDD is persisted to memory, why can't the checkpoint simply save those partitions to disk when the first operations have completed? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8666) checkpointing does not take advantage of persisted/cached RDDs
[ https://issues.apache.org/jira/browse/SPARK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603189#comment-14603189 ] Glenn Strycker commented on SPARK-8666: --- Looks like this is ticket is a duplicate of https://issues.apache.org/jira/browse/SPARK-8582 checkpointing does not take advantage of persisted/cached RDDs -- Key: SPARK-8666 URL: https://issues.apache.org/jira/browse/SPARK-8666 Project: Spark Issue Type: New Feature Reporter: Glenn Strycker I have been noticing that when checkpointing RDDs, all operations are occurring TWICE. For example, when I run the following code and watch the stages... {noformat} val newRDD = prevRDD.map(a = (a._1, 1L)).distinct.persist(StorageLevel.MEMORY_AND_DISK_SER) newRDD.checkpoint print(newRDD.count()) {noformat} I see distinct and count operations appearing TWICE, and shuffle disk writes and reads (from the distinct) occurring TWICE. My newRDD is persisted to memory, why can't the checkpoint simply save those partitions to disk when the first operations have completed? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8666) checkpointing does not take advantage of persisted/cached RDDs
Glenn Strycker created SPARK-8666: - Summary: checkpointing does not take advantage of persisted/cached RDDs Key: SPARK-8666 URL: https://issues.apache.org/jira/browse/SPARK-8666 Project: Spark Issue Type: New Feature Reporter: Glenn Strycker I have been noticing that when checkpointing RDDs, all operations are occurring TWICE. For example, when I run the following code and watch the stages... {noformat} val newRDD = prevRDD.map(a = (a._1, 1L)).distinct.persist() newRDD.checkpoint print(newRDD.count()) {noformat} I see distinct and count operations appearing TWICE, and shuffle disk writes and reads (from the distinct) occurring TWICE. My newRDD is persisted to memory, why can't the checkpoint simply save those partitions to disk when the first operations have completed? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603166#comment-14603166 ] Zhan Zhang commented on SPARK-2883: --- [~philclaridge] Please refer to the test case in the trunk on how to use it. saveAsOrcFile/orcFile is removed from upstream. Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: New Feature Components: Input/Output, SQL Reporter: Zhan Zhang Assignee: Zhan Zhang Priority: Critical Fix For: 1.4.0 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png, orc.diff Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8659) SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server.
[ https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Premchandra Preetham Kukillaya updated SPARK-8659: -- Description: It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working whereas SQL based Authorisation works when i am pointing the JDBC Driver to ThriftCLIService provided by HiveServer2. But we need to use Spark SQL Thrift Service as we require to use Spark SQL with Tableau The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 --hiveconf mapred.max.split.size=25600 --hiveconf hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf javax.jdo.option.ConnectionPassword=hive --hiveconf hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true was: It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working whereas SQL based Authorisation works when i am pointing the JDBC Driver to ThriftCLIService provided by HiveServer2. But we need to use Spark SQL Thrift Service as we require to use Spark SQL with Tableau The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined I am using Hive .13.1 and Spark 1.3.1 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 --hiveconf mapred.max.split.size=25600 --hiveconf hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf javax.jdo.option.ConnectionPassword=hive --hiveconf hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server. --- Key: SPARK-8659 URL: https://issues.apache.org/jira/browse/SPARK-8659 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Environment: Linux Reporter: Premchandra Preetham Kukillaya It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working whereas SQL based Authorisation works when i am pointing the JDBC Driver to ThriftCLIService provided by HiveServer2. But we need to use Spark SQL Thrift Service as we require to use Spark SQL with Tableau The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf
[jira] [Updated] (SPARK-8660) Update comments that contain R statements in ml.logisticRegressionSuite
[ https://issues.apache.org/jira/browse/SPARK-8660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8660: - Remaining Estimate: 20m Original Estimate: 20m Update comments that contain R statements in ml.logisticRegressionSuite --- Key: SPARK-8660 URL: https://issues.apache.org/jira/browse/SPARK-8660 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Priority: Trivial Labels: starter Original Estimate: 20m Remaining Estimate: 20m We put R statements as comments in unit test. However, there are two issues: 1. JavaDoc style /** ... */ is used instead of normal multiline comment /* ... */. 2. We put a leading * on each line. It is hard to copy paste the commands to/from R and verify the result. For example, in https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L504 {code} /** * Using the following R code to load the data and train the model using glmnet package. * * library(glmnet) * data - read.csv(path, header=FALSE) * label = factor(data$V1) * features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5)) * weights = coef(glmnet(features,label, family=binomial, alpha = 1.0, lambda = 6.0)) * weights * 5 x 1 sparse Matrix of class dgCMatrix * s0 * (Intercept) -0.2480643 * data.V2 0.000 * data.V3 . * data.V4 . * data.V5 . */ {code} should change to {code} /* Using the following R code to load the data and train the model using glmnet package. library(glmnet) data - read.csv(path, header=FALSE) label = factor(data$V1) features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5)) weights = coef(glmnet(features,label, family=binomial, alpha = 1.0, lambda = 6.0)) weights 5 x 1 sparse Matrix of class dgCMatrix s0 (Intercept) -0.2480643 data.V2 0.000 data.V3 . data.V4 . data.V5 . */ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8660) Update comments that contain R statements in ml.logisticRegressionSuite
Xiangrui Meng created SPARK-8660: Summary: Update comments that contain R statements in ml.logisticRegressionSuite Key: SPARK-8660 URL: https://issues.apache.org/jira/browse/SPARK-8660 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Priority: Trivial We put R statements as comments in unit test. However, there are two issues: 1. JavaDoc style /** ... */ is used instead of normal multiline comment /* ... */. 2. We put a leading * on each line. It is hard to copy paste the commands to/from R and verify the result. For example, in https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L504 {code} /** * Using the following R code to load the data and train the model using glmnet package. * * library(glmnet) * data - read.csv(path, header=FALSE) * label = factor(data$V1) * features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5)) * weights = coef(glmnet(features,label, family=binomial, alpha = 1.0, lambda = 6.0)) * weights * 5 x 1 sparse Matrix of class dgCMatrix * s0 * (Intercept) -0.2480643 * data.V2 0.000 * data.V3 . * data.V4 . * data.V5 . */ {code} should change to {code} /* Using the following R code to load the data and train the model using glmnet package. library(glmnet) data - read.csv(path, header=FALSE) label = factor(data$V1) features = as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5)) weights = coef(glmnet(features,label, family=binomial, alpha = 1.0, lambda = 6.0)) weights 5 x 1 sparse Matrix of class dgCMatrix s0 (Intercept) -0.2480643 data.V2 0.000 data.V3 . data.V4 . data.V5 . */ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8659) SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server.
[ https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Premchandra Preetham Kukillaya updated SPARK-8659: -- Description: It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working. It ignores the security settings passed through the command line. The arguments for command line is given below for reference The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined and its a data security risk. I am using Hive .13.1 and Spark 1.3.1 and here is the list arguments passed to Spark SQL Thrift Server. ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf javax.jdo.option.ConnectionPassword=hive was: It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working. It ignores the security settings passed through the command line. The arguments for command line is given below for reference The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf javax.jdo.option.ConnectionPassword=hive SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server. --- Key: SPARK-8659 URL: https://issues.apache.org/jira/browse/SPARK-8659 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Environment: Linux Reporter: Premchandra Preetham Kukillaya It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working. It ignores the security settings passed through the command line. The arguments for command line is given below for reference The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined and its a data security risk. I am using Hive .13.1 and Spark 1.3.1 and here is the list arguments passed to Spark SQL Thrift Server. ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf javax.jdo.option.ConnectionPassword=hive -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8661) Update comments that contain R statements in ml.LinearRegressionSuite
[ https://issues.apache.org/jira/browse/SPARK-8661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8661: - Remaining Estimate: 20m Original Estimate: 20m Update comments that contain R statements in ml.LinearRegressionSuite - Key: SPARK-8661 URL: https://issues.apache.org/jira/browse/SPARK-8661 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Labels: starter Original Estimate: 20m Remaining Estimate: 20m Similar to SPARK-8660, but for ml.LinearRegressionSuite: https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8559) Support association rule generation in FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8559: - Shepherd: Xiangrui Meng Support association rule generation in FPGrowth --- Key: SPARK-8559 URL: https://issues.apache.org/jira/browse/SPARK-8559 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Guangwen Liu Assignee: Feynman Liang It will be more useful and practical to include the association rule generation part for real applications, though it is not hard by user to find association rules from the frequent itemset with frequency which is output by FP growth. However how to generate association rules in an efficient way is not widely reported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8664) Add PCA transformer
[ https://issues.apache.org/jira/browse/SPARK-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603043#comment-14603043 ] Yanbo Liang commented on SPARK-8664: [~mengxr] I am already working on it, please assign it to me. Add PCA transformer --- Key: SPARK-8664 URL: https://issues.apache.org/jira/browse/SPARK-8664 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.5.0 Reporter: Yanbo Liang Add PCA transformer for ML pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()
[ https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603053#comment-14603053 ] Arun commented on SPARK-8409: - http://apache-spark-user-list.1001560.n3.nabble.com/How-to-row-bind-two-data-frames-in-SparkR-td23502.html This is the link I posted In windows cant able to read .csv or .json files using read.df() - Key: SPARK-8409 URL: https://issues.apache.org/jira/browse/SPARK-8409 Project: Spark Issue Type: Bug Components: SparkR, Windows Affects Versions: 1.4.0 Environment: sparkR API Reporter: Arun Priority: Critical Hi, In SparkR shell, I invoke: mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json, header=false) I have tried various filetypes (csv, txt), all fail. in sparkR of spark 1.4 for eg.) df_1- read.df(sqlContext, E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv, source = csv) RESPONSE: ERROR RBackendHandler: load on 1 failed BELOW THE WHOLE RESPONSE: 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with curMem=0, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 173.4 KB, free 265.2 MB) 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with curMem=177600, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.2 KB, free 265.2 MB) 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at NativeMethodAccessorImpl.java:-2 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
[jira] [Commented] (SPARK-8410) Hive VersionsSuite RuntimeException
[ https://issues.apache.org/jira/browse/SPARK-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602993#comment-14602993 ] Josiah Samuel Sathiadass commented on SPARK-8410: - Captured some logs from 2 servers ( server 1 - with the issue, server 2 - w/o the issue ) The below text is collected from file :: ~/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml Server 1: ( looks for the respective jars in the local-m2-cache) entries from a machine where the problem is module organisation=org.codehaus.groovy name=groovy-all revision name=2.1.6 status=release pubdate=20150618023837 resolver=local-m2-cache artresolver=local-m2-cache homepage=http://groovy.codehaus.org/; downloaded=false searched=false default=false conf=compile, master(*), runtime, compile(*), runtime(*), master position=52 license name=The Apache Software License, Version 2.0 url=http://www.apache.org/licenses/LICENSE-2.0.txt/ metadata-artifact status=no details= size=5591 time=0 location=/home/joe/.ivy2/cache/org.codehaus.groovy/groovy-all/ivy-2.1.6.xml searched=false original-local-location=/home/joe/.m2/repository/org/codehaus/groovy/groovy-all/2.1.6/groovy-all-2.1.6.pom origin-is-local=true origin-location=file:/home/joe/.m2/repository/org/codehaus/groovy/groovy-all/2.1.6/groovy-all-2.1.6.pom/ caller organisation=org.apache.hive name=hive-exec conf=default, compile, runtime, master rev=2.1.6 rev-constraint-default=2.1.6 rev-constraint-dynamic=2.1.6 callerrev=0.13.1/ artifacts artifact name=groovy-all type=jar ext=jar status=failed details=missing artifact size=0 time=0/ /artifacts /revision /module entries from a machine where the problem is Server 2: ( looks for the respective jars in the central) entries from a machine where it works module organisation=org.codehaus.groovy name=groovy-all revision name=2.1.6 status=release pubdate=20130709121712 resolver=central artresolver=central homepage=http://groovy.codehaus.org/; downloaded=false searched=false default=false conf=compile, master(*), runtime, compile(*), runtime(*), master position=52 license name=The Apache Software License, Version 2.0 url=http://www.apache.org/licenses/LICENSE-2.0.txt/ metadata-artifact status=no details= size=5591 time=0 location=/home/joe/.ivy2/cache/org.codehaus.groovy/groovy-all/ivy-2.1.6.xml searched=false original-local-location=/home/joe/.ivy2/cache/org.codehaus.groovy/groovy-all/ivy-2.1.6.xml.original origin-is-local=false origin-location=https://repo1.maven.org/maven2/org/codehaus/groovy/groovy-all/2.1.6/groovy-all-2.1.6.pom/ caller organisation=org.apache.hive name=hive-exec conf=default, compile, runtime, master rev=2.1.6 rev-constraint-default=2.1.6 rev-constraint-dynamic=2.1.6 callerrev=0.13.1/ artifacts artifact name=groovy-all type=jar ext=jar status=no details= size=6377448 time=0 location=/home/joe/.ivy2/cache/org.codehaus.groovy/groovy-all/jars/groovy-all-2.1.6.jar origin-location is-local=false location=https://repo1.maven.org/maven2/org/codehaus/groovy/groovy-all/2.1.6/groovy-all-2.1.6.jar/ /artifact /artifacts /revision /module entries from a machine where it works Need some help from the community to identify from where ivy is picking up the settings to populate this file ~/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml so that I can narrow down the problem. Thanks, Joe. Hive VersionsSuite RuntimeException --- Key: SPARK-8410 URL: https://issues.apache.org/jira/browse/SPARK-8410 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.0 Environment: IBM Power system - P7 running Ubuntu 14.04LE Reporter: Josiah Samuel Sathiadass Assignee: Burak Yavuz Priority: Minor While testing Spark Project Hive, there are RuntimeExceptions as follows, VersionsSuite: - success sanity check *** FAILED *** java.lang.RuntimeException: [download failed: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: org.codehaus.groovy#groovy-all;2.1.6!groovy-all.jar, download failed: asm#asm;3.2!asm.jar] at
[jira] [Comment Edited] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval
[ https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602987#comment-14602987 ] yuemeng edited comment on SPARK-8663 at 6/26/15 2:53 PM: - The driver log like: 15/06/25 23:16:16 INFO DAGScheduler: Executor lost: 1 (epoch 1) 15/06/25 23:16:16 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/06/25 23:16:16 INFO BlockManagerMasterActor: Removing block manager BlockManagerId(1, 9.96.1.223, 23577) 15/06/25 23:16:16 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 15/06/25 23:16:45 ERROR ContextCleaner: Error cleaning broadcast 3512 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137) at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227) at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66) at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:199) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:159) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:150) at scala.Option.foreach(Option.scala:236) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:150) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1550) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:143) at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65) 15/06/25 23:16:45 INFO DAGScheduler: Stopping DAGScheduler 15/06/25 23:16:45 INFO YarnClientSchedulerBackend: Shutting down all executors 15/06/25 23:16:45 INFO YarnClientSchedulerBackend: Asking each executor to shut down 15/06/25 23:16:45 INFO DAGScheduler: Job 3555 failed: count at console:18, took 29.811052 s 15/06/25 23:16:45 INFO DAGScheduler: Job 3539 failed: count at console:18, took 30.089501 s 15/06/25 23:16:45 INFO DAGScheduler: Job 3553 failed: count at console:18, took 29.842839 s 15/06/25 23:16:45 WARN BlockManagerMaster: Failed to remove broadcast 3512 with removeFromMaster = true - Ask timed out on [Actor[akka.tcp://sparkExecutor@DS-222:23604/user/BlockManagerActor1#1981879442]] after [3 ms]} calcFunc start calcFunc start 15/06/25 23:16:45 INFO DAGScheduler: Job 3554 failed: count at console:18, took 29.827635 s 15/06/25 23:16:45 INFO SparkContext: Starting job: count at console:18 15/06/25 23:16:45 INFO SparkContext: Starting job: count at console:18 15/06/25 23:16:45 INFO YarnClientSchedulerBackend: Stopped 15/06/25 23:16:45 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkYarnAM@DS-222:23129]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: DS-222/9.96.1.222:23129 15/06/25 23:16:46 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 15/06/25 23:16:46 INFO MemoryStore: MemoryStore cleared 15/06/25 23:16:46 INFO BlockManager: BlockManager stopped 15/06/25 23:16:46 INFO BlockManagerMaster: BlockManagerMaster stopped 15/06/25 23:16:46 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/06/25 23:16:46 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/06/25 23:16:46 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/06/25 23:16:46 INFO SparkContext: Successfully stopped SparkContext And the driver Thread dump like: ForkJoinPool-3-worker-3 daemon prio=10 tid=0x00991000 nid=0x3dab waiting on condition [0x7fc9507dd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to
[jira] [Assigned] (SPARK-8662) [SparkR] SparkSQL tests fail in R 3.2
[ https://issues.apache.org/jira/browse/SPARK-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8662: --- Assignee: Apache Spark [SparkR] SparkSQL tests fail in R 3.2 - Key: SPARK-8662 URL: https://issues.apache.org/jira/browse/SPARK-8662 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Reporter: Chris Freeman Assignee: Apache Spark Fix For: 1.4.0 SparkR tests for equality using `all.equal` on environments fail in R 3.2. This is due to a change in how equality between environments is handled in the new version of R. This should most likely not be a huge problem, we'll just have to rewrite some of the tests to be more fine-grained instead of testing equality on entire environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()
[ https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603030#comment-14603030 ] Shivaram Venkataraman commented on SPARK-8409: -- I don't see your email in the Spark user mailing list. I think one needs to subscribe to the list first to be able to post. You can send an email to user-subscr...@spark.apache.org to subscribe (See http://www.apache.org/foundation/mailinglists.html for more details). In windows cant able to read .csv or .json files using read.df() - Key: SPARK-8409 URL: https://issues.apache.org/jira/browse/SPARK-8409 Project: Spark Issue Type: Bug Components: SparkR, Windows Affects Versions: 1.4.0 Environment: sparkR API Reporter: Arun Priority: Critical Hi, In SparkR shell, I invoke: mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json, header=false) I have tried various filetypes (csv, txt), all fail. in sparkR of spark 1.4 for eg.) df_1- read.df(sqlContext, E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv, source = csv) RESPONSE: ERROR RBackendHandler: load on 1 failed BELOW THE WHOLE RESPONSE: 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with curMem=0, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 173.4 KB, free 265.2 MB) 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with curMem=177600, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.2 KB, free 265.2 MB) 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at NativeMethodAccessorImpl.java:-2 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not
[jira] [Resolved] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()
[ https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8409. -- Resolution: Not A Problem In windows cant able to read .csv or .json files using read.df() - Key: SPARK-8409 URL: https://issues.apache.org/jira/browse/SPARK-8409 Project: Spark Issue Type: Bug Components: SparkR, Windows Affects Versions: 1.4.0 Environment: sparkR API Reporter: Arun Priority: Critical Hi, In SparkR shell, I invoke: mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json, header=false) I have tried various filetypes (csv, txt), all fail. in sparkR of spark 1.4 for eg.) df_1- read.df(sqlContext, E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv, source = csv) RESPONSE: ERROR RBackendHandler: load on 1 failed BELOW THE WHOLE RESPONSE: 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with curMem=0, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 173.4 KB, free 265.2 MB) 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with curMem=177600, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.2 KB, free 265.2 MB) 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at NativeMethodAccessorImpl.java:-2 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at
[jira] [Commented] (SPARK-8521) Feature Transformers in 1.5
[ https://issues.apache.org/jira/browse/SPARK-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603036#comment-14603036 ] Yanbo Liang commented on SPARK-8521: Yes, I agree. I will open a jira and work on it. Feature Transformers in 1.5 --- Key: SPARK-8521 URL: https://issues.apache.org/jira/browse/SPARK-8521 Project: Spark Issue Type: Umbrella Components: ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng This is a list of feature transformers we plan to add in Spark 1.5. Feel free to propose useful transformers that are not on the list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8665) Update ALS documentation to include performance tips
[ https://issues.apache.org/jira/browse/SPARK-8665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8665: - Remaining Estimate: 1h Original Estimate: 1h Update ALS documentation to include performance tips Key: SPARK-8665 URL: https://issues.apache.org/jira/browse/SPARK-8665 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Original Estimate: 1h Remaining Estimate: 1h With the new ALS implementation, users still need to deal with computation/communication trade-offs. It would be nice to document this clearly based on the issues on the mailing list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8665) Update ALS documentation to include performance tips
Xiangrui Meng created SPARK-8665: Summary: Update ALS documentation to include performance tips Key: SPARK-8665 URL: https://issues.apache.org/jira/browse/SPARK-8665 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng With the new ALS implementation, users still need to deal with computation/communication trade-offs. It would be nice to document this clearly based on the issues on the mailing list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8659) SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server.
[ https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Premchandra Preetham Kukillaya updated SPARK-8659: -- Description: It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working. It ignores the security settings passed through the command line. The arguments for command line is given below for reference The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf javax.jdo.option.ConnectionPassword=hive was: It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working. It ignores the security settings passed through the command line. The arguments for command line is given below for reference The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 --hiveconf mapred.max.split.size=25600 --hiveconf hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf javax.jdo.option.ConnectionPassword=hive --hiveconf hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server. --- Key: SPARK-8659 URL: https://issues.apache.org/jira/browse/SPARK-8659 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Environment: Linux Reporter: Premchandra Preetham Kukillaya It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working. It ignores the security settings passed through the command line. The arguments for command line is given below for reference The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf javax.jdo.option.ConnectionPassword=hive -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands,
[jira] [Created] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval
yuemeng created SPARK-8663: -- Summary: Dirver will be hang if there is a job submit during SparkContex stop Interval Key: SPARK-8663 URL: https://issues.apache.org/jira/browse/SPARK-8663 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.1.1, 1.0.0 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64 Reporter: yuemeng Fix For: 1.2.2, 1.1.1, 1.0.0 Driver process will be hang if a job had submit during sc.stop Interval.This interval mean from start stop SparkContext to finish . The probability of this situation is very small,but If present, will cause driver process never exit. Reproduce step: 1)modify source code to make SparkContext stop() method sleep 2s in my situation,i make DAGScheduler stop method sleep 2s 2)submit an application ,code like: object DriverThreadTest { def main(args: Array[String]) { val sconf = new SparkConf().setAppName(TestJobWaitor) val sc= new SparkContext(sconf) Thread.sleep(5000) val t = new Thread { override def run() { while (true) { try { val rdd = sc.parallelize( 1 to 1000) var i = 0 println(calcfunc start) while ( i 10){ i+=1 rdd.count } println(calcfunc end) }catch{ case e: Exception = e.printStackTrace() } } } } t.start() val t2 = new Thread { override def run() { Thread.sleep(2000) println(stop sc thread) sc.stop() println(sc already stoped) } } t2.start() } } driver will be never exit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval
[ https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuemeng updated SPARK-8663: --- Affects Version/s: (was: 1.2.0) (was: 1.1.1) (was: 1.0.0) 1.2.2 1.3.0 Dirver will be hang if there is a job submit during SparkContex stop Interval - Key: SPARK-8663 URL: https://issues.apache.org/jira/browse/SPARK-8663 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2, 1.3.0 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) Reporter: yuemeng Fix For: 1.0.0, 1.1.1, 1.2.2 Driver process will be hang if a job had submit during sc.stop Interval.This interval mean from start stop SparkContext to finish . The probability of this situation is very small,but If present, will cause driver process never exit. Reproduce step: 1)modify source code to make SparkContext stop() method sleep 2s in my situation,i make DAGScheduler stop method sleep 2s 2)submit an application ,code like: object DriverThreadTest { def main(args: Array[String]) { val sconf = new SparkConf().setAppName(TestJobWaitor) val sc= new SparkContext(sconf) Thread.sleep(5000) val t = new Thread { override def run() { while (true) { try { val rdd = sc.parallelize( 1 to 1000) var i = 0 println(calcfunc start) while ( i 10){ i+=1 rdd.count } println(calcfunc end) }catch{ case e: Exception = e.printStackTrace() } } } } t.start() val t2 = new Thread { override def run() { Thread.sleep(2000) println(stop sc thread) sc.stop() println(sc already stoped) } } t2.start() } } driver will be never exit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3650) Triangle Count handles reverse edges incorrectly
[ https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603016#comment-14603016 ] Robin East commented on SPARK-3650: --- What is the status of this issue? A user on the mailing list just ran into to this issue. It looks like PR-2495 should fix the issue. Is there a version that is being targeted for the fix? Triangle Count handles reverse edges incorrectly Key: SPARK-3650 URL: https://issues.apache.org/jira/browse/SPARK-3650 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.1.0, 1.2.0 Reporter: Joseph E. Gonzalez Priority: Critical The triangle count implementation assumes that edges are aligned in a canonical direction. As stated in the documentation: bq. Note that the input graph should have its edges in canonical direction (i.e. the `sourceId` less than `destId`) However the TriangleCount algorithm does not verify that this condition holds and indeed even the unit tests exploits this functionality: {code:scala} val triangles = Array(0L - 1L, 1L - 2L, 2L - 0L) ++ Array(0L - -1L, -1L - -2L, -2L - 0L) val rawEdges = sc.parallelize(triangles, 2) val graph = Graph.fromEdgeTuples(rawEdges, true).cache() val triangleCount = graph.triangleCount() val verts = triangleCount.vertices verts.collect().foreach { case (vid, count) = if (vid == 0) { assert(count === 4) // -- Should be 2 } else { assert(count === 2) // -- Should be 1 } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8664) Add PCA transformer
[ https://issues.apache.org/jira/browse/SPARK-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8664: - Assignee: Yanbo Liang Add PCA transformer --- Key: SPARK-8664 URL: https://issues.apache.org/jira/browse/SPARK-8664 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.5.0 Reporter: Yanbo Liang Assignee: Yanbo Liang Add PCA transformer for ML pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval
[ https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuemeng updated SPARK-8663: --- Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) (was: SUSE Linux Enterprise Server 11 SP3 (x86_64) Dirver will be hang if there is a job submit during SparkContex stop Interval - Key: SPARK-8663 URL: https://issues.apache.org/jira/browse/SPARK-8663 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.1.1, 1.2.0 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) Reporter: yuemeng Fix For: 1.0.0, 1.1.1, 1.2.2 Driver process will be hang if a job had submit during sc.stop Interval.This interval mean from start stop SparkContext to finish . The probability of this situation is very small,but If present, will cause driver process never exit. Reproduce step: 1)modify source code to make SparkContext stop() method sleep 2s in my situation,i make DAGScheduler stop method sleep 2s 2)submit an application ,code like: object DriverThreadTest { def main(args: Array[String]) { val sconf = new SparkConf().setAppName(TestJobWaitor) val sc= new SparkContext(sconf) Thread.sleep(5000) val t = new Thread { override def run() { while (true) { try { val rdd = sc.parallelize( 1 to 1000) var i = 0 println(calcfunc start) while ( i 10){ i+=1 rdd.count } println(calcfunc end) }catch{ case e: Exception = e.printStackTrace() } } } } t.start() val t2 = new Thread { override def run() { Thread.sleep(2000) println(stop sc thread) sc.stop() println(sc already stoped) } } t2.start() } } driver will be never exit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8662) [SparkR] SparkSQL tests fail in R 3.2
[ https://issues.apache.org/jira/browse/SPARK-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602992#comment-14602992 ] Apache Spark commented on SPARK-8662: - User 'cafreeman' has created a pull request for this issue: https://github.com/apache/spark/pull/7045 [SparkR] SparkSQL tests fail in R 3.2 - Key: SPARK-8662 URL: https://issues.apache.org/jira/browse/SPARK-8662 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Reporter: Chris Freeman Fix For: 1.4.0 SparkR tests for equality using `all.equal` on environments fail in R 3.2. This is due to a change in how equality between environments is handled in the new version of R. This should most likely not be a huge problem, we'll just have to rewrite some of the tests to be more fine-grained instead of testing equality on entire environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8662) [SparkR] SparkSQL tests fail in R 3.2
[ https://issues.apache.org/jira/browse/SPARK-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8662: --- Assignee: (was: Apache Spark) [SparkR] SparkSQL tests fail in R 3.2 - Key: SPARK-8662 URL: https://issues.apache.org/jira/browse/SPARK-8662 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Reporter: Chris Freeman Fix For: 1.4.0 SparkR tests for equality using `all.equal` on environments fail in R 3.2. This is due to a change in how equality between environments is handled in the new version of R. This should most likely not be a huge problem, we'll just have to rewrite some of the tests to be more fine-grained instead of testing equality on entire environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval
[ https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603001#comment-14603001 ] yuemeng commented on SPARK-8663: I think the reason becasue: 1)eventProcessActor ! JobSubmitted( jobId, rdd, func2, partitions.toArray, allowLocal, callSite, waiter, properties) waiter } //eventProcessActor had dead, and this meassage sent to deadmailbox.so it will be lost waiter, 2) def awaitResult(): JobResult = synchronized { while (!_jobFinished) { this.wait() } return jobResult } //this will enter loop stituation Dirver will be hang if there is a job submit during SparkContex stop Interval - Key: SPARK-8663 URL: https://issues.apache.org/jira/browse/SPARK-8663 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.1.1, 1.2.0 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) Reporter: yuemeng Fix For: 1.0.0, 1.1.1, 1.2.2 Driver process will be hang if a job had submit during sc.stop Interval.This interval mean from start stop SparkContext to finish . The probability of this situation is very small,but If present, will cause driver process never exit. Reproduce step: 1)modify source code to make SparkContext stop() method sleep 2s in my situation,i make DAGScheduler stop method sleep 2s 2)submit an application ,code like: object DriverThreadTest { def main(args: Array[String]) { val sconf = new SparkConf().setAppName(TestJobWaitor) val sc= new SparkContext(sconf) Thread.sleep(5000) val t = new Thread { override def run() { while (true) { try { val rdd = sc.parallelize( 1 to 1000) var i = 0 println(calcfunc start) while ( i 10){ i+=1 rdd.count } println(calcfunc end) }catch{ case e: Exception = e.printStackTrace() } } } } t.start() val t2 = new Thread { override def run() { Thread.sleep(2000) println(stop sc thread) sc.stop() println(sc already stoped) } } t2.start() } } driver will be never exit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval
[ https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuemeng updated SPARK-8663: --- Fix Version/s: (was: 1.1.1) (was: 1.0.0) 1.3.0 Dirver will be hang if there is a job submit during SparkContex stop Interval - Key: SPARK-8663 URL: https://issues.apache.org/jira/browse/SPARK-8663 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2, 1.3.0 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) Reporter: yuemeng Fix For: 1.2.2, 1.3.0 Driver process will be hang if a job had submit during sc.stop Interval.This interval mean from start stop SparkContext to finish . The probability of this situation is very small,but If present, will cause driver process never exit. Reproduce step: 1)modify source code to make SparkContext stop() method sleep 2s in my situation,i make DAGScheduler stop method sleep 2s 2)submit an application ,code like: object DriverThreadTest { def main(args: Array[String]) { val sconf = new SparkConf().setAppName(TestJobWaitor) val sc= new SparkContext(sconf) Thread.sleep(5000) val t = new Thread { override def run() { while (true) { try { val rdd = sc.parallelize( 1 to 1000) var i = 0 println(calcfunc start) while ( i 10){ i+=1 rdd.count } println(calcfunc end) }catch{ case e: Exception = e.printStackTrace() } } } } t.start() val t2 = new Thread { override def run() { Thread.sleep(2000) println(stop sc thread) sc.stop() println(sc already stoped) } } t2.start() } } driver will be never exit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8664) Add PCA transformer
Yanbo Liang created SPARK-8664: -- Summary: Add PCA transformer Key: SPARK-8664 URL: https://issues.apache.org/jira/browse/SPARK-8664 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.5.0 Reporter: Yanbo Liang Add PCA transformer for ML pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8659) SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server.
[ https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Premchandra Preetham Kukillaya updated SPARK-8659: -- Description: It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working. It ignores the security settings passed through the command line. The arguments for command line is given below for reference The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 --hiveconf mapred.max.split.size=25600 --hiveconf hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf javax.jdo.option.ConnectionPassword=hive --hiveconf hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true was: It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working whereas SQL based Authorisation works when i am pointing the JDBC Driver to ThriftCLIService provided by HiveServer2. But we need to use Spark SQL Thrift Service as we require to use Spark SQL with Tableau The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 --hiveconf mapred.max.split.size=25600 --hiveconf hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf javax.jdo.option.ConnectionPassword=hive --hiveconf hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server. --- Key: SPARK-8659 URL: https://issues.apache.org/jira/browse/SPARK-8659 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Environment: Linux Reporter: Premchandra Preetham Kukillaya It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working. It ignores the security settings passed through the command line. The arguments for command line is given below for reference The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined I am using Hive .13.1 and Spark 1.3.1 and here is the arguments passed to Spark ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 --hiveconf mapred.max.split.size=25600 --hiveconf
[jira] [Updated] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval
[ https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuemeng updated SPARK-8663: --- Target Version/s: (was: 1.0.0, 1.1.1, 1.2.2) Dirver will be hang if there is a job submit during SparkContex stop Interval - Key: SPARK-8663 URL: https://issues.apache.org/jira/browse/SPARK-8663 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.2, 1.3.0 Environment: SUSE Linux Enterprise Server 11 SP3 (x86_64) Reporter: yuemeng Fix For: 1.2.2, 1.3.0 Driver process will be hang if a job had submit during sc.stop Interval.This interval mean from start stop SparkContext to finish . The probability of this situation is very small,but If present, will cause driver process never exit. Reproduce step: 1)modify source code to make SparkContext stop() method sleep 2s in my situation,i make DAGScheduler stop method sleep 2s 2)submit an application ,code like: object DriverThreadTest { def main(args: Array[String]) { val sconf = new SparkConf().setAppName(TestJobWaitor) val sc= new SparkContext(sconf) Thread.sleep(5000) val t = new Thread { override def run() { while (true) { try { val rdd = sc.parallelize( 1 to 1000) var i = 0 println(calcfunc start) while ( i 10){ i+=1 rdd.count } println(calcfunc end) }catch{ case e: Exception = e.printStackTrace() } } } } t.start() val t2 = new Thread { override def run() { Thread.sleep(2000) println(stop sc thread) sc.stop() println(sc already stoped) } } t2.start() } } driver will be never exit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8664) Add PCA transformer
[ https://issues.apache.org/jira/browse/SPARK-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8664: - Remaining Estimate: 24h Original Estimate: 24h Add PCA transformer --- Key: SPARK-8664 URL: https://issues.apache.org/jira/browse/SPARK-8664 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.5.0 Reporter: Yanbo Liang Assignee: Yanbo Liang Original Estimate: 24h Remaining Estimate: 24h Add PCA transformer for ML pipeline -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8659) SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server.
[ https://issues.apache.org/jira/browse/SPARK-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Premchandra Preetham Kukillaya updated SPARK-8659: -- Description: It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working whereas SQL based Authorisation works when i am pointing the JDBC Driver to ThriftCLIService provided by HiveServer2. But we need to use Spark SQL Thrift Service as we require to use Spark SQL with Tableau The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined I am using Hive .13.1 and Spark 1.3.1 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 --hiveconf mapred.max.split.size=25600 --hiveconf hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf javax.jdo.option.ConnectionPassword=hive --hiveconf hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true was: It seems like while pointing JDBC/ODBC Driver to Spark SQL Thrift Service Hive's feature SQL based authorization is not working whereas SQL based Authorization works when i am pointing the JDBC Driver to ThriftCLIService provided by HiveServer2. The problem is user X can do select on table belonging to user Y. I am using Hive .13.1 and Spark 1.3.1 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 --hiveconf mapred.max.split.size=25600 --hiveconf hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf javax.jdo.option.ConnectionURL=jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true --hiveconf javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver --hiveconf javax.jdo.option.ConnectionUserName=hive --hiveconf javax.jdo.option.ConnectionPassword=hive --hiveconf hive.metastore.warehouse.dir=/user/hive/warehouse --hiveconf hive.metastore.connect.retries=5 --hiveconf datanucleus.fixedDatastore=true SQL Standard Based Hive Authorisation of Hive.13 does not work while pointing JDBC Application to Spark Thrift Server. --- Key: SPARK-8659 URL: https://issues.apache.org/jira/browse/SPARK-8659 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Environment: Linux Reporter: Premchandra Preetham Kukillaya It seems like while pointing JDBC/ODBC Driver to Spark SQLThrift Service ,the Hive's security feature SQL based authorisation is not working whereas SQL based Authorisation works when i am pointing the JDBC Driver to ThriftCLIService provided by HiveServer2. But we need to use Spark SQL Thrift Service as we require to use Spark SQL with Tableau The problem is user X can do select on table belonging to user Y, though permission for table is explicitly defined I am using Hive .13.1 and Spark 1.3.1 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --hiveconf hostname.compute.amazonaws.com --hiveconf hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator --hiveconf hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory --hiveconf hive.server2.enable.doAs=false --hiveconf hive.security.authorization.enabled=true --hiveconf mapred.reduce.tasks=-1 --hiveconf mapred.max.split.size=25600 --hiveconf hive.downloaded.resources.dir=/mnt/var/lib/hive/downloaded_resources --hiveconf
[jira] [Commented] (SPARK-8663) Dirver will be hang if there is a job submit during SparkContex stop Interval
[ https://issues.apache.org/jira/browse/SPARK-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602987#comment-14602987 ] yuemeng commented on SPARK-8663: the driver log like: 15/06/25 23:16:16 INFO DAGScheduler: Executor lost: 1 (epoch 1) 15/06/25 23:16:16 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/06/25 23:16:16 INFO BlockManagerMasterActor: Removing block manager BlockManagerId(1, 9.96.1.223, 23577) 15/06/25 23:16:16 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 15/06/25 23:16:45 ERROR ContextCleaner: Error cleaning broadcast 3512 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137) at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227) at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66) at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:199) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:159) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:150) at scala.Option.foreach(Option.scala:236) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:150) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1550) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:143) at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65) 15/06/25 23:16:45 INFO DAGScheduler: Stopping DAGScheduler 15/06/25 23:16:45 INFO YarnClientSchedulerBackend: Shutting down all executors 15/06/25 23:16:45 INFO YarnClientSchedulerBackend: Asking each executor to shut down 15/06/25 23:16:45 INFO DAGScheduler: Job 3555 failed: count at console:18, took 29.811052 s 15/06/25 23:16:45 INFO DAGScheduler: Job 3539 failed: count at console:18, took 30.089501 s 15/06/25 23:16:45 INFO DAGScheduler: Job 3553 failed: count at console:18, took 29.842839 s 15/06/25 23:16:45 WARN BlockManagerMaster: Failed to remove broadcast 3512 with removeFromMaster = true - Ask timed out on [Actor[akka.tcp://sparkExecutor@DS-222:23604/user/BlockManagerActor1#1981879442]] after [3 ms]} calcFunc start calcFunc start 15/06/25 23:16:45 INFO DAGScheduler: Job 3554 failed: count at console:18, took 29.827635 s 15/06/25 23:16:45 INFO SparkContext: Starting job: count at console:18 15/06/25 23:16:45 INFO SparkContext: Starting job: count at console:18 15/06/25 23:16:45 INFO YarnClientSchedulerBackend: Stopped 15/06/25 23:16:45 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkYarnAM@DS-222:23129]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: DS-222/9.96.1.222:23129 15/06/25 23:16:46 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 15/06/25 23:16:46 INFO MemoryStore: MemoryStore cleared 15/06/25 23:16:46 INFO BlockManager: BlockManager stopped 15/06/25 23:16:46 INFO BlockManagerMaster: BlockManagerMaster stopped 15/06/25 23:16:46 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/06/25 23:16:46 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/06/25 23:16:46 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/06/25 23:16:46 INFO SparkContext: Successfully stopped SparkContext and the driver Thread dump like: ForkJoinPool-3-worker-3 daemon prio=10 tid=0x00991000 nid=0x3dab waiting on condition [0x7fc9507dd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0xfe9ea670 (a
[jira] [Commented] (SPARK-8662) [SparkR] SparkSQL tests fail in R 3.2
[ https://issues.apache.org/jira/browse/SPARK-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602989#comment-14602989 ] Chris Freeman commented on SPARK-8662: -- PR here: https://github.com/apache/spark/pull/7045 [SparkR] SparkSQL tests fail in R 3.2 - Key: SPARK-8662 URL: https://issues.apache.org/jira/browse/SPARK-8662 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Reporter: Chris Freeman Fix For: 1.4.0 SparkR tests for equality using `all.equal` on environments fail in R 3.2. This is due to a change in how equality between environments is handled in the new version of R. This should most likely not be a huge problem, we'll just have to rewrite some of the tests to be more fine-grained instead of testing equality on entire environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8409) In windows cant able to read .csv or .json files using read.df()
[ https://issues.apache.org/jira/browse/SPARK-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603055#comment-14603055 ] Arun commented on SPARK-8409: - http://apache-spark-user-list.1001560.n3.nabble.com/Convert-R-code-into-SparkR-code-for-spark-1-4-version-td23489.html Another link I posted In windows cant able to read .csv or .json files using read.df() - Key: SPARK-8409 URL: https://issues.apache.org/jira/browse/SPARK-8409 Project: Spark Issue Type: Bug Components: SparkR, Windows Affects Versions: 1.4.0 Environment: sparkR API Reporter: Arun Priority: Critical Hi, In SparkR shell, I invoke: mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json, header=false) I have tried various filetypes (csv, txt), all fail. in sparkR of spark 1.4 for eg.) df_1- read.df(sqlContext, E:/setup/spark-1.4.0-bin-hadoop2.6/spark-1.4.0-bin-hadoop2.6/examples/src/main/resources/nycflights13.csv, source = csv) RESPONSE: ERROR RBackendHandler: load on 1 failed BELOW THE WHOLE RESPONSE: 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with curMem=0, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 173.4 KB, free 265.2 MB) 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with curMem=177600, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.2 KB, free 265.2 MB) 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at NativeMethodAccessorImpl.java:-2 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json at
[jira] [Resolved] (SPARK-4609) Job can not finish if there is one bad slave in clusters
[ https://issues.apache.org/jira/browse/SPARK-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4609. -- Resolution: Duplicate Job can not finish if there is one bad slave in clusters Key: SPARK-4609 URL: https://issues.apache.org/jira/browse/SPARK-4609 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu If there is one bad machine in the cluster, the executor will keep die (such as out of space in the disk), some task may be scheduled to this machines multiple times, then the job will failed after several failures of one task. {code} 14/11/26 00:34:57 INFO TaskSetManager: Starting task 39.0 in stage 3.0 (TID 1255, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes) 14/11/26 00:34:57 WARN TaskSetManager: Lost task 39.0 in stage 3.0 (TID 1255, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 60 lost) 14/11/26 00:35:02 INFO TaskSetManager: Starting task 39.1 in stage 3.0 (TID 1256, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes) 14/11/26 00:35:03 WARN TaskSetManager: Lost task 39.1 in stage 3.0 (TID 1256, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 61 lost) 14/11/26 00:35:08 INFO TaskSetManager: Starting task 39.2 in stage 3.0 (TID 1257, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes) 14/11/26 00:35:08 WARN TaskSetManager: Lost task 39.2 in stage 3.0 (TID 1257, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 62 lost) 14/11/26 00:35:13 INFO TaskSetManager: Starting task 39.3 in stage 3.0 (TID 1258, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes) 14/11/26 00:35:14 WARN TaskSetManager: Lost task 39.3 in stage 3.0 (TID 1258, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 63 lost) org.apache.spark.SparkException: Job aborted due to stage failure: Task 39 in stage 3.0 failed 4 times, most recent failure: Lost task 39.3 in stage 3.0 (TID 1258, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 63 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1207) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1196) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1195) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1195) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1413) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1368) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} The task should not be scheduled to a machines for more than one times. Also, if one machine failed with executor lost, it should be put in black list for some time, then try again. cc [~kayousterhout] [~matei] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8424) Add blacklist mechanism for task scheduler and Yarn container allocation
[ https://issues.apache.org/jira/browse/SPARK-8424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603995#comment-14603995 ] Sean Owen commented on SPARK-8424: -- [~jerryshao] I don't think this umbrella JIRA is useful as its description is exactly the union of its two children. Let's close it and leave the children. Add blacklist mechanism for task scheduler and Yarn container allocation Key: SPARK-8424 URL: https://issues.apache.org/jira/browse/SPARK-8424 Project: Spark Issue Type: New Feature Components: Scheduler, YARN Affects Versions: 1.4.0 Reporter: Saisai Shao Previously MapReduce has a blacklist and graylist to exclude some constantly failed TaskTrackers/nodes, it is important for a large cluster to alleviate the problem of increasing chance of hardware and software failure. Unfortunately current version of Spark lacks such mechanism to blacklist some constantly failed executors/nodes. The only blacklist mechanism in Spark is to avoid relaunching the task on the same executor when this task is previously failed on this executor within specified time. So here propose a new feature to add blacklist mechanism for Spark, this proposal is divided into two sub-tasks: 1. Add a heuristic blacklist algorithm to track the status of executors by the status of finished tasks, and enable blacklist mechanism in tasking scheduling. 2. Enable blacklist mechanism in YARN container allocation (avoid allocating containers on the blacklist hosts). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8639) Instructions for executing jekyll in docs/README.md could be slightly more clear, typo in docs/api.md
[ https://issues.apache.org/jira/browse/SPARK-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8639. -- Resolution: Fixed Fix Version/s: 1.5.0 1.4.1 Issue resolved by pull request 7046 [https://github.com/apache/spark/pull/7046] Instructions for executing jekyll in docs/README.md could be slightly more clear, typo in docs/api.md - Key: SPARK-8639 URL: https://issues.apache.org/jira/browse/SPARK-8639 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Rosstin Murphy Priority: Trivial Fix For: 1.4.1, 1.5.0 In docs/README.md, the text states around line 31 Execute 'jekyll' from the 'docs/' directory. Compiling the site with Jekyll will create a directory called '_site' containing index.html as well as the rest of the compiled files. It might be more clear if we said Execute 'jekyll build' from the 'docs/' directory to compile the site. Compiling the site with Jekyll will create a directory called '_site' containing index.html as well as the rest of the compiled files. In docs/api.md: Here you can API docs for Spark and its submodules. should be something like: Here you can read API docs for Spark and its submodules. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1476) 2GB limit in spark for blocks
[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603220#comment-14603220 ] Thomas Graves commented on SPARK-1476: -- we have a lot of jira about the 2G limit. I'm going to dup this to the umbrella jira https://issues.apache.org/jira/browse/SPARK-6235 If someone things something is missing from that, lets add another item there. 2GB limit in spark for blocks - Key: SPARK-1476 URL: https://issues.apache.org/jira/browse/SPARK-1476 Project: Spark Issue Type: Improvement Components: Spark Core Environment: all Reporter: Mridul Muralidharan Priority: Critical Attachments: 2g_fix_proposal.pdf The underlying abstraction for blocks in spark is a ByteBuffer : which limits the size of the block to 2GB. This has implication not just for managed blocks in use, but also for shuffle blocks (memory mapped blocks are limited to 2gig, even though the api allows for long), ser-deser via byte array backed outstreams (SPARK-1391), etc. This is a severe limitation for use of spark when used on non trivial datasets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8405) Show executor logs on Web UI when Yarn log aggregation is enabled
[ https://issues.apache.org/jira/browse/SPARK-8405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603269#comment-14603269 ] Hari Shreedharan commented on SPARK-8405: - Actually I think this is a config issue on your YARN cluster. See: https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/HBGzj_NG9_s and http://stackoverflow.com/questions/24076192/yarn-jobhistory-error-failed-redirect-for-container-140026075-3309-01-0 Show executor logs on Web UI when Yarn log aggregation is enabled - Key: SPARK-8405 URL: https://issues.apache.org/jira/browse/SPARK-8405 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Attachments: SparkLogError.png When running Spark application in Yarn mode and Yarn log aggregation is enabled, customer is not able to view executor logs on the history server Web UI. The only way for customer to view the logs is through the Yarn command yarn logs -applicationId appId. An screenshot of the error is attached. When you click an executor’s log link on the Spark history server, you’ll see the error if Yarn log aggregation is enabled. The log URL redirects user to the node manager’s UI. This works if the logs are located on that node. But since log aggregation is enabled, the local logs are deleted once log aggregation is completed. The logs should be available through the web UIs just like other Hadoop components like MapReduce. For security reasons, end users may not be able to log into the nodes and run the yarn logs -applicationId command. The web UIs can be viewable and exposed through the firewall if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603263#comment-14603263 ] Juliet Hougland edited comment on SPARK-8646 at 6/26/15 5:35 PM: - Results from pi-test are uploaded in the attachment pi-test.log. Still a module missing error, this time it is pandas.algo. was (Author: juliet): Results from pu-test.log PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8667) Improve Spark UI behavior at scale
Patrick Wendell created SPARK-8667: -- Summary: Improve Spark UI behavior at scale Key: SPARK-8667 URL: https://issues.apache.org/jira/browse/SPARK-8667 Project: Spark Issue Type: Improvement Reporter: Patrick Wendell Assignee: Shixiong Zhu This is a parent ticket and we can create child tickets when solving specific issues. The main problem I would like to solve is the fact that the Spark UI has issues at very large scale. The worst issue is when there is a stage page with more than a few thousand tasks. In this case: 1. The page itself is very slow to load and becomes unresponsive with huge number of tasks. 2. The Scala XML output can become so large that it crashes the driver program due to OOM for a page with a huge number of tasks. I am not sure if (1) is caused by javascript slowness, or maybe just the raw amount of data sent over the wire. If it is the latter, it might be possible to add compression to the HTTP payload to help improve load time. It would be nice to reproduce+investigate these issues further and create specific sub tasks to improve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8667) Improve Spark UI behavior at scale
[ https://issues.apache.org/jira/browse/SPARK-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8667: --- Component/s: Web UI Improve Spark UI behavior at scale -- Key: SPARK-8667 URL: https://issues.apache.org/jira/browse/SPARK-8667 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Patrick Wendell Assignee: Shixiong Zhu This is a parent ticket and we can create child tickets when solving specific issues. The main problem I would like to solve is the fact that the Spark UI has issues at very large scale. The worst issue is when there is a stage page with more than a few thousand tasks. In this case: 1. The page itself is very slow to load and becomes unresponsive with huge number of tasks. 2. The Scala XML output can become so large that it crashes the driver program due to OOM for a page with a huge number of tasks. I am not sure if (1) is caused by javascript slowness, or maybe just the raw amount of data sent over the wire. If it is the latter, it might be possible to add compression to the HTTP payload to help improve load time. It would be nice to reproduce+investigate these issues further and create specific sub tasks to improve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8647) Potential issues with the constant hashCode
[ https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603302#comment-14603302 ] Alok Singh edited comment on SPARK-8647 at 6/26/15 5:53 PM: Hi Xiangrui, 1.Same instances = In that case, why not use the scala object to have singleton. Is it since MatrixUDT is used in the pyspark which might work better with class type than object type. Also in java we will have extra $ in the end for the object? But if the goal is to have the same instance, isn't it would be nice to have hashCode to be override def hashCode():Int = org.apache.spark.mllib.linalg.MatrixUDT.hashCode() what are your thoughts? 2.Performance == I think in MatrixUDT case this will not be the pb, as there won't be many classes similar to MatrixUDT with constant hashCode which is also 1994. I was refering to http://java-performance.info/hashcode-method-performance-tuning/ However, if we use the solution of Same Instance section above, we may not have this issue. Summary === for practical purpose it won't be the performance issue, but I think, it would be nicer from aesthetic perspective to use the same instance section i.e [org.apache.spark.mllib.linalg.hashCode()], if we can't use the scala object. Please suggest, should i change just the code docs explaining the reason or as per the 1. above. thanks Alok was (Author: aloknsingh): Hi Xiangrui, 1.Same instances = In that case, why not use the scala object to have singleton. Is it since MatrixUDT is used in the pyspark which might work better with class type than object type. Also in java we will have extra $ in the end for the object? But if the goal is to have the same instance, isn't it would be nice to have hashCode to be override def hashCode():Int = org.apache.spark.mllib.linalg.MatrixUDT.hashCode() what are your thoughts? 2.Performance == I think in MatrixUDT case this will not be the pb, as there won't be many classes similar to MatrixUDT with constant hashCode which is also 1994. I was refering to http://java-performance.info/hashcode-method-performance-tuning/ However, if we use the solution of Same Instance section above, we may not have this issue. Summary === for practical purpose it won't be the performance issue, but I think, it would be nicer from aesthetic perspective to use the same instance section, if we can't use the scala object. Please suggest, should i change just the code docs explaining the reason or as per the 1. above. thanks Alok Potential issues with the constant hashCode Key: SPARK-8647 URL: https://issues.apache.org/jira/browse/SPARK-8647 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Alok Singh Priority: Minor Labels: performance Hi, This may be potential bug or performance issue or just the code docs. The issue is wrt to MatrixUDT class. If we decide to put instance of MatrixUDT into the hash based collection. The hashCode function is returning constant and even though equals method is consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e constant) has been used. I was expecting it to be similar to the other matrix class or the vector class . If there is the reason why we have this code, we should document it properly in the code so that others reading it is fine. regards, Alok Details = a) In reference to the file https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala line 188-197 ie override def equals(o: Any): Boolean = { o match { case v: MatrixUDT = true case _ = false } } override def hashCode(): Int = 1994 b) the commit is https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436 on March 20. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8647) Potential issues with the constant hashCode
[ https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603302#comment-14603302 ] Alok Singh edited comment on SPARK-8647 at 6/26/15 5:53 PM: Hi Xiangrui, 1.Same instances = In that case, why not use the scala object to have singleton. Is it since MatrixUDT is used in the pyspark which might work better with class type than object type. Also in java we will have extra $ in the end for the object? But if the goal is to have the same instance, isn't it would be nice to have hashCode to be override def hashCode():Int = org.apache.spark.mllib.linalg.MatrixUDT.hashCode() what are your thoughts? 2.Performance == I think in MatrixUDT case this will not be the pb, as there won't be many classes similar to MatrixUDT with constant hashCode which is also 1994. I was refering to http://java-performance.info/hashcode-method-performance-tuning/ However, if we use the solution of Same Instance section above, we may not have this issue. Summary === for practical purpose it won't be the performance issue, but I think, it would be nicer from aesthetic perspective to use the same instance section i.e [org.apache.spark.mllib.linalg.MatrixUDT.hashCode()], if we can't use the scala object. Please suggest, should i change just the code docs explaining the reason or as per the 1. above. thanks Alok was (Author: aloknsingh): Hi Xiangrui, 1.Same instances = In that case, why not use the scala object to have singleton. Is it since MatrixUDT is used in the pyspark which might work better with class type than object type. Also in java we will have extra $ in the end for the object? But if the goal is to have the same instance, isn't it would be nice to have hashCode to be override def hashCode():Int = org.apache.spark.mllib.linalg.MatrixUDT.hashCode() what are your thoughts? 2.Performance == I think in MatrixUDT case this will not be the pb, as there won't be many classes similar to MatrixUDT with constant hashCode which is also 1994. I was refering to http://java-performance.info/hashcode-method-performance-tuning/ However, if we use the solution of Same Instance section above, we may not have this issue. Summary === for practical purpose it won't be the performance issue, but I think, it would be nicer from aesthetic perspective to use the same instance section i.e [org.apache.spark.mllib.linalg.hashCode()], if we can't use the scala object. Please suggest, should i change just the code docs explaining the reason or as per the 1. above. thanks Alok Potential issues with the constant hashCode Key: SPARK-8647 URL: https://issues.apache.org/jira/browse/SPARK-8647 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Alok Singh Priority: Minor Labels: performance Hi, This may be potential bug or performance issue or just the code docs. The issue is wrt to MatrixUDT class. If we decide to put instance of MatrixUDT into the hash based collection. The hashCode function is returning constant and even though equals method is consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e constant) has been used. I was expecting it to be similar to the other matrix class or the vector class . If there is the reason why we have this code, we should document it properly in the code so that others reading it is fine. regards, Alok Details = a) In reference to the file https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala line 188-197 ie override def equals(o: Any): Boolean = { o match { case v: MatrixUDT = true case _ = false } } override def hashCode(): Int = 1994 b) the commit is https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436 on March 20. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org