[jira] [Commented] (SPARK-8621) crosstab exception when one of the value is empty
[ https://issues.apache.org/jira/browse/SPARK-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607904#comment-14607904 ] Apache Spark commented on SPARK-8621: - User 'animeshbaranawal' has created a pull request for this issue: https://github.com/apache/spark/pull/7117 crosstab exception when one of the value is empty - Key: SPARK-8621 URL: https://issues.apache.org/jira/browse/SPARK-8621 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Critical I think this happened because some value is empty. {code} scala df1.stat.crosstab(role, lang) org.apache.spark.sql.AnalysisException: syntax error in attribute name: ; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.parseAttributeName(LogicalPlan.scala:145) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:135) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:157) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:603) at org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:394) at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:160) at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:157) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:157) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:147) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:132) at org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:132) at org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:91) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8552) Using incorrect database in multiple sessions
[ https://issues.apache.org/jira/browse/SPARK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607910#comment-14607910 ] Apache Spark commented on SPARK-8552: - User 'navis' has created a pull request for this issue: https://github.com/apache/spark/pull/7118 Using incorrect database in multiple sessions - Key: SPARK-8552 URL: https://issues.apache.org/jira/browse/SPARK-8552 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.4.0 Reporter: Yi Tian Priority: Critical To reproduce this problem: * 1. start thrift server {quote} sbin/start-thriftserver.sh {quote} * 2. first connection execute use test {quote} bin/beeline -u jdbc:hive2://localhost:1/default -n any -p any -e use test {quote} * 3. second connection execute show tables {quote} bin/beeline -u jdbc:hive2://localhost:1/default -n any -p any -e show tables {quote} * 4. you can find the result is the tables in {{test}} database -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8552) Using incorrect database in multiple sessions
[ https://issues.apache.org/jira/browse/SPARK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8552: --- Assignee: (was: Apache Spark) Using incorrect database in multiple sessions - Key: SPARK-8552 URL: https://issues.apache.org/jira/browse/SPARK-8552 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.4.0 Reporter: Yi Tian Priority: Critical To reproduce this problem: * 1. start thrift server {quote} sbin/start-thriftserver.sh {quote} * 2. first connection execute use test {quote} bin/beeline -u jdbc:hive2://localhost:1/default -n any -p any -e use test {quote} * 3. second connection execute show tables {quote} bin/beeline -u jdbc:hive2://localhost:1/default -n any -p any -e show tables {quote} * 4. you can find the result is the tables in {{test}} database -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8552) Using incorrect database in multiple sessions
[ https://issues.apache.org/jira/browse/SPARK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8552: --- Assignee: Apache Spark Using incorrect database in multiple sessions - Key: SPARK-8552 URL: https://issues.apache.org/jira/browse/SPARK-8552 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.4.0 Reporter: Yi Tian Assignee: Apache Spark Priority: Critical To reproduce this problem: * 1. start thrift server {quote} sbin/start-thriftserver.sh {quote} * 2. first connection execute use test {quote} bin/beeline -u jdbc:hive2://localhost:1/default -n any -p any -e use test {quote} * 3. second connection execute show tables {quote} bin/beeline -u jdbc:hive2://localhost:1/default -n any -p any -e show tables {quote} * 4. you can find the result is the tables in {{test}} database -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8728) Add configuration for limiting the maximum number of active stages in a fair scheduling queue
Keuntae Park created SPARK-8728: --- Summary: Add configuration for limiting the maximum number of active stages in a fair scheduling queue Key: SPARK-8728 URL: https://issues.apache.org/jira/browse/SPARK-8728 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Keuntae Park Priority: Minor Currently, every TaskSetManagers in a fair queue are scheduled concurrently. It may harm the interactiveness of every jobs when the number of queued jobs becomes large. I think it is useful to add configuration like 'maxRunningApps' of YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8041) Consistently pass SparkR library directory to SparkR application
[ https://issues.apache.org/jira/browse/SPARK-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607877#comment-14607877 ] Sun Rui commented on SPARK-8041: sorry, this JIRA is obsolete as we are addressing SPARK-6797. You can take a look at it. Consistently pass SparkR library directory to SparkR application Key: SPARK-8041 URL: https://issues.apache.org/jira/browse/SPARK-8041 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui The SparkR package library directory path (RLibDir) is needed for SparkR applications for loading SparkR package and locating R helper files inside the package. Currently, there are some places that the RLibDir needs to be specified. First of all, when you programs a SparkR application, sparkR.init() allows you to pass a RLibDir parameter (by default, it is the same as the SparkR package's libname on the driver host). However, it seems not reasonable to hard-code RLibDir in a program. Instead, it would be more flexible to pass RLibDir via command line or env variable. Additionally, for YARN cluster mode, RRunner depends on SPARK_HOME env variable to get the RLibDir (assume $SPARK_HOME/R/lib). So it would be better to define a consistent way to pass RLibDir to a SparkR application in all deployment modes. It could be a command line option for bin/sparkR or an env variable. It can be passed to a sparkR application, and we can remove the RLibDir parameter of sparkR.init(). When in YARN cluster mode, it can be passed to AM using spark.yarn.appMasterEnv.[EnvironmentVariableName] configuration option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8621) crosstab exception when one of the value is empty
[ https://issues.apache.org/jira/browse/SPARK-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8621: --- Assignee: (was: Apache Spark) crosstab exception when one of the value is empty - Key: SPARK-8621 URL: https://issues.apache.org/jira/browse/SPARK-8621 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Critical I think this happened because some value is empty. {code} scala df1.stat.crosstab(role, lang) org.apache.spark.sql.AnalysisException: syntax error in attribute name: ; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.parseAttributeName(LogicalPlan.scala:145) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:135) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:157) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:603) at org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:394) at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:160) at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:157) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:157) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:147) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:132) at org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:132) at org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:91) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8621) crosstab exception when one of the value is empty
[ https://issues.apache.org/jira/browse/SPARK-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8621: --- Assignee: Apache Spark crosstab exception when one of the value is empty - Key: SPARK-8621 URL: https://issues.apache.org/jira/browse/SPARK-8621 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Priority: Critical I think this happened because some value is empty. {code} scala df1.stat.crosstab(role, lang) org.apache.spark.sql.AnalysisException: syntax error in attribute name: ; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.parseAttributeName(LogicalPlan.scala:145) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:135) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:157) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:603) at org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:394) at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:160) at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:157) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:157) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:147) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:132) at org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:132) at org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:91) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8717) Update mllib-data-types docs to include missing matrix Python examples
[ https://issues.apache.org/jira/browse/SPARK-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8717: - Component/s: PySpark Documentation [~Rosstin] Please set components Update mllib-data-types docs to include missing matrix Python examples Key: SPARK-8717 URL: https://issues.apache.org/jira/browse/SPARK-8717 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Reporter: Rosstin Murphy Priority: Minor Currently, the documentation for MLLib Data Types (docs/mllib-data-types.md in the repo, https://spark.apache.org/docs/latest/mllib-data-types.html in the latest online docs) stops listing Python examples at Labeled point. Local vector and Labeled point have Python examples, however none of the matrix entries have Python examples. The matrix entries could be updated to include python examples. I'm not 100% sure that all the matrices currently have implemented Python equivalents, but I'm pretty sure that at least the first one (Local matrix) could have an entry. from pyspark.mllib.linalg import DenseMatrix dm = DenseMatrix(3, 2, [1.0, 3.0, 5.0, 2.0, 4.0, 6.0]) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8560) The Executors page will have negative if having resubmitted tasks
[ https://issues.apache.org/jira/browse/SPARK-8560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607889#comment-14607889 ] Sean Owen commented on SPARK-8560: -- Please fix the title The Executors page will have negative if having resubmitted tasks - Key: SPARK-8560 URL: https://issues.apache.org/jira/browse/SPARK-8560 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0 Reporter: meiyoula Attachments: screenshot-1.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8271) string function: soundex
[ https://issues.apache.org/jira/browse/SPARK-8271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8271: --- Assignee: Apache Spark (was: Cheng Hao) string function: soundex Key: SPARK-8271 URL: https://issues.apache.org/jira/browse/SPARK-8271 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark soundex(string A): string Returns soundex code of the string. For example, soundex('Miller') results in M460. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8271) string function: soundex
[ https://issues.apache.org/jira/browse/SPARK-8271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607899#comment-14607899 ] Apache Spark commented on SPARK-8271: - User 'HuJiayin' has created a pull request for this issue: https://github.com/apache/spark/pull/7115 string function: soundex Key: SPARK-8271 URL: https://issues.apache.org/jira/browse/SPARK-8271 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao soundex(string A): string Returns soundex code of the string. For example, soundex('Miller') results in M460. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8728) Add configuration for limiting the maximum number of active stages in a fair scheduling queue
[ https://issues.apache.org/jira/browse/SPARK-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8728: --- Assignee: Apache Spark Add configuration for limiting the maximum number of active stages in a fair scheduling queue - Key: SPARK-8728 URL: https://issues.apache.org/jira/browse/SPARK-8728 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Keuntae Park Assignee: Apache Spark Priority: Minor Currently, every TaskSetManagers in a fair queue are scheduled concurrently. It may harm the interactiveness of every jobs when the number of queued jobs becomes large. I think it is useful to add configuration like 'maxRunningApps' of YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8728) Add configuration for limiting the maximum number of active stages in a fair scheduling queue
[ https://issues.apache.org/jira/browse/SPARK-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607985#comment-14607985 ] Apache Spark commented on SPARK-8728: - User 'sirpkt' has created a pull request for this issue: https://github.com/apache/spark/pull/7119 Add configuration for limiting the maximum number of active stages in a fair scheduling queue - Key: SPARK-8728 URL: https://issues.apache.org/jira/browse/SPARK-8728 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Keuntae Park Priority: Minor Currently, every TaskSetManagers in a fair queue are scheduled concurrently. It may harm the interactiveness of every jobs when the number of queued jobs becomes large. I think it is useful to add configuration like 'maxRunningApps' of YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7334) Implement RandomProjection for Dimensionality Reduction
[ https://issues.apache.org/jira/browse/SPARK-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607994#comment-14607994 ] Sebastian Alfers commented on SPARK-7334: - [~josephkb] any progress on this one? Implement RandomProjection for Dimensionality Reduction --- Key: SPARK-7334 URL: https://issues.apache.org/jira/browse/SPARK-7334 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sebastian Alfers Priority: Minor Implement RandomProjection (RP) for dimensionality reduction RP is a popular approach to reduce the amount of data while preserving a reasonable amount of information (pairwise distance) of you data [1][2] - [1] http://www.yaroslavvb.com/papers/achlioptas-database.pdf - [2] http://people.inf.elte.hu/fekete/algoritmusok_msc/dimenzio_csokkentes/randon_projection_kdd.pdf I compared different implementations of that algorithm: - https://github.com/sebastian-alfers/random-projection-python -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7402) JSON serialization of params
[ https://issues.apache.org/jira/browse/SPARK-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607890#comment-14607890 ] Sean Owen commented on SPARK-7402: -- Is this still critical and targeted for 1.4.1 now that the RC is in progress? JSON serialization of params Key: SPARK-7402 URL: https://issues.apache.org/jira/browse/SPARK-7402 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Add JSON support to Param in order to persist parameters with transformer, estimator, and models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2505) Weighted Regularizer
[ https://issues.apache.org/jira/browse/SPARK-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2505: --- Assignee: Apache Spark Weighted Regularizer Key: SPARK-2505 URL: https://issues.apache.org/jira/browse/SPARK-2505 Project: Spark Issue Type: New Feature Components: MLlib Reporter: DB Tsai Assignee: Apache Spark The current implementation of regularization in linear model is using `Updater`, and this design has couple issues as the following. 1) It will penalize all the weights including intercept. In machine learning training process, typically, people don't penalize the intercept. 2) The `Updater` has the logic of adaptive step size for gradient decent, and we would like to clean it up by separating the logic of regularization out from updater to regularizer so in LBFGS optimizer, we don't need the trick for getting the loss and gradient of objective function. In this work, a weighted regularizer will be implemented, and users can exclude the intercept or any weight from regularization by setting that term with zero weighted penalty. Since the regularizer will return a tuple of loss and gradient, the adaptive step size logic, and soft thresholding for L1 in Updater will be moved to SGD optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2505) Weighted Regularizer
[ https://issues.apache.org/jira/browse/SPARK-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2505: --- Assignee: (was: Apache Spark) Weighted Regularizer Key: SPARK-2505 URL: https://issues.apache.org/jira/browse/SPARK-2505 Project: Spark Issue Type: New Feature Components: MLlib Reporter: DB Tsai The current implementation of regularization in linear model is using `Updater`, and this design has couple issues as the following. 1) It will penalize all the weights including intercept. In machine learning training process, typically, people don't penalize the intercept. 2) The `Updater` has the logic of adaptive step size for gradient decent, and we would like to clean it up by separating the logic of regularization out from updater to regularizer so in LBFGS optimizer, we don't need the trick for getting the loss and gradient of objective function. In this work, a weighted regularizer will be implemented, and users can exclude the intercept or any weight from regularization by setting that term with zero weighted penalty. Since the regularizer will return a tuple of loss and gradient, the adaptive step size logic, and soft thresholding for L1 in Updater will be moved to SGD optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8041) Consistently pass SparkR library directory to SparkR application
[ https://issues.apache.org/jira/browse/SPARK-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Rui closed SPARK-8041. -- Resolution: Duplicate This issue is covered by SPARK-6797 Consistently pass SparkR library directory to SparkR application Key: SPARK-8041 URL: https://issues.apache.org/jira/browse/SPARK-8041 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui The SparkR package library directory path (RLibDir) is needed for SparkR applications for loading SparkR package and locating R helper files inside the package. Currently, there are some places that the RLibDir needs to be specified. First of all, when you programs a SparkR application, sparkR.init() allows you to pass a RLibDir parameter (by default, it is the same as the SparkR package's libname on the driver host). However, it seems not reasonable to hard-code RLibDir in a program. Instead, it would be more flexible to pass RLibDir via command line or env variable. Additionally, for YARN cluster mode, RRunner depends on SPARK_HOME env variable to get the RLibDir (assume $SPARK_HOME/R/lib). So it would be better to define a consistent way to pass RLibDir to a SparkR application in all deployment modes. It could be a command line option for bin/sparkR or an env variable. It can be passed to a sparkR application, and we can remove the RLibDir parameter of sparkR.init(). When in YARN cluster mode, it can be passed to AM using spark.yarn.appMasterEnv.[EnvironmentVariableName] configuration option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8699) Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-8699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8699: - Target Version/s: (was: 1.4.0) [~kamlesh.kumar] Please first read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before filing a JIRA. Don't set Target Version; in any event, 1.4.0 is already released and the version you say it affects. Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0 --- Key: SPARK-8699 URL: https://issues.apache.org/jira/browse/SPARK-8699 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Environment: Windows 7, 64 bit Reporter: Kamlesh Kumar Priority: Critical Labels: test I can successfully run Showdf and head on rrdd data frame in R but it throws unexpected error for select commands. R console output after running select command on rrdd data object is following: command head(select(df, df$eruptions)) output: Error in head(select(df, df$eruptions)) : error in evaluating the argument 'x' in selecting a method for function 'head': Error in UseMethod(select_) : no applicable method for 'select_' applied to an object of class DataFrame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8271) string function: soundex
[ https://issues.apache.org/jira/browse/SPARK-8271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8271: --- Assignee: Cheng Hao (was: Apache Spark) string function: soundex Key: SPARK-8271 URL: https://issues.apache.org/jira/browse/SPARK-8271 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao soundex(string A): string Returns soundex code of the string. For example, soundex('Miller') results in M460. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8743) Deregister Codahale metrics for streaming when StreamingContext is closed
Tathagata Das created SPARK-8743: Summary: Deregister Codahale metrics for streaming when StreamingContext is closed Key: SPARK-8743 URL: https://issues.apache.org/jira/browse/SPARK-8743 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.4.1 Reporter: Tathagata Das Currently, when the StreamingContext is closed, the registered metrics are not deregistered. If another streaming context is started, it throws a warning saying that the metrics are already registered. The solution is to deregister the metrics when streamingcontext is stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8529) Set metadata for MinMaxScaler
[ https://issues.apache.org/jira/browse/SPARK-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609497#comment-14609497 ] Joseph K. Bradley commented on SPARK-8529: -- Here's an example of setting the metadata (but for a NominalAttribute): [https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L135] MinMaxScaler should actually use a NumericAttribute, setting its relevant fields. Set metadata for MinMaxScaler - Key: SPARK-8529 URL: https://issues.apache.org/jira/browse/SPARK-8529 Project: Spark Issue Type: Improvement Components: ML Reporter: yuhao yang Priority: Minor Add this as an reminder for complementing the output metadata for transformer MinMaxScaler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8366) When task fails and append a new one, the ExecutorAllocationManager can't sense the new tasks
[ https://issues.apache.org/jira/browse/SPARK-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-8366: Description: I use the *dynamic executor allocation* function. When an executor is killed, all running tasks on it will be failed. Until reach the maxTaskFailures, this failed task will re-run with a new task id. But the `ExecutorAllocationManager` won't concern this new tasks to total and pending tasks, because the total stage task number only set when stage submitted. was: I use the *dynamic executor allocation* function. When an executor is killed, all running tasks on it will be failed. Until reach the maxTaskFailures, this failed task will re-run with a new task id. But the `ExecutorAllocationManager` won't concern this new tasks to pending tasks, because the total stage task number only set when stage submitted. When task fails and append a new one, the ExecutorAllocationManager can't sense the new tasks - Key: SPARK-8366 URL: https://issues.apache.org/jira/browse/SPARK-8366 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: meiyoula I use the *dynamic executor allocation* function. When an executor is killed, all running tasks on it will be failed. Until reach the maxTaskFailures, this failed task will re-run with a new task id. But the `ExecutorAllocationManager` won't concern this new tasks to total and pending tasks, because the total stage task number only set when stage submitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8366) When task fails and append a new one, the ExecutorAllocationManager can't sense the new tasks
[ https://issues.apache.org/jira/browse/SPARK-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-8366: Description: I use the *dynamic executor allocation* function. When an executor is killed, all running tasks on it will be failed. Until reach the maxTaskFailures, this failed task will re-run with a new task id. But the `ExecutorAllocationManager` won't concern this new tasks to pending tasks, because the total stage task number only set when stage submitted. was:I use the *dynamic executor allocation* function. Then one executor is killed, all running tasks on it are failed. When the new tasks are appended, the new executor won't added. When task fails and append a new one, the ExecutorAllocationManager can't sense the new tasks - Key: SPARK-8366 URL: https://issues.apache.org/jira/browse/SPARK-8366 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: meiyoula I use the *dynamic executor allocation* function. When an executor is killed, all running tasks on it will be failed. Until reach the maxTaskFailures, this failed task will re-run with a new task id. But the `ExecutorAllocationManager` won't concern this new tasks to pending tasks, because the total stage task number only set when stage submitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8366) When task fails and append a new one, the ExecutorAllocationManager can't sense the new tasks
[ https://issues.apache.org/jira/browse/SPARK-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-8366: Description: I use the *dynamic executor allocation* function. When an executor is killed, all running tasks on it will be failed. Until reach the maxTaskFailures, this failed task will re-run with a new task id. But the *ExecutorAllocationManager* won't concern this new tasks to total and pending tasks, because the total stage task number only set when stage submitted. was: I use the *dynamic executor allocation* function. When an executor is killed, all running tasks on it will be failed. Until reach the maxTaskFailures, this failed task will re-run with a new task id. But the `ExecutorAllocationManager` won't concern this new tasks to total and pending tasks, because the total stage task number only set when stage submitted. When task fails and append a new one, the ExecutorAllocationManager can't sense the new tasks - Key: SPARK-8366 URL: https://issues.apache.org/jira/browse/SPARK-8366 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: meiyoula I use the *dynamic executor allocation* function. When an executor is killed, all running tasks on it will be failed. Until reach the maxTaskFailures, this failed task will re-run with a new task id. But the *ExecutorAllocationManager* won't concern this new tasks to total and pending tasks, because the total stage task number only set when stage submitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8743) Deregister Codahale metrics for streaming when StreamingContext is closed
[ https://issues.apache.org/jira/browse/SPARK-8743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609437#comment-14609437 ] Neelesh Srinivas Salian commented on SPARK-8743: I would like to work on this JIRA. Could you please assign this to me? Thank you. Deregister Codahale metrics for streaming when StreamingContext is closed -- Key: SPARK-8743 URL: https://issues.apache.org/jira/browse/SPARK-8743 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.4.1 Reporter: Tathagata Das Labels: starter Currently, when the StreamingContext is closed, the registered metrics are not deregistered. If another streaming context is started, it throws a warning saying that the metrics are already registered. The solution is to deregister the metrics when streamingcontext is stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8727) Add missing python api
[ https://issues.apache.org/jira/browse/SPARK-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8727. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7114 [https://github.com/apache/spark/pull/7114] Add missing python api -- Key: SPARK-8727 URL: https://issues.apache.org/jira/browse/SPARK-8727 Project: Spark Issue Type: Improvement Components: SQL Reporter: Tarek Auel Fix For: 1.5.0 Add the python api that is missing for https://issues.apache.org/jira/browse/SPARK-8248 https://issues.apache.org/jira/browse/SPARK-8234 https://issues.apache.org/jira/browse/SPARK-8217 https://issues.apache.org/jira/browse/SPARK-8215 https://issues.apache.org/jira/browse/SPARK-8212 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6892) Recovery from checkpoint will also reuse the application id when write eventLog in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das closed SPARK-6892. Resolution: Not A Problem Recovery from checkpoint will also reuse the application id when write eventLog in yarn-cluster mode Key: SPARK-6892 URL: https://issues.apache.org/jira/browse/SPARK-6892 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: yangping wu Priority: Critical When I recovery from checkpoint in yarn-cluster mode using Spark Streaming, I found it will reuse the application id (In my case is application_1428664056212_0016) before falied to write spark eventLog, But now my application id is application_1428664056212_0017,then spark write eventLog will falied, the stacktrace as follow: {code} 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1388) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) {code} This exception will cause the job falied. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8318) Spark Streaming Starter JIRAs
[ https://issues.apache.org/jira/browse/SPARK-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609399#comment-14609399 ] Tathagata Das commented on SPARK-8318: -- I think the starter label is not very easy to find, and most people search by JIRAs. In this way, we get the benefit of both, starter label as well finding a JIRA. Case in point, the subtasks got solved pretty fast. Spark Streaming Starter JIRAs - Key: SPARK-8318 URL: https://issues.apache.org/jira/browse/SPARK-8318 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Priority: Minor Labels: starter This is a master JIRA to collect together all starter tasks related to Spark Streaming. These are simple tasks that contributors can do to get familiar with the process of contributing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8313) Support Spark Packages containing R code with --packages
[ https://issues.apache.org/jira/browse/SPARK-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609403#comment-14609403 ] Apache Spark commented on SPARK-8313: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/7139 Support Spark Packages containing R code with --packages Key: SPARK-8313 URL: https://issues.apache.org/jira/browse/SPARK-8313 Project: Spark Issue Type: New Feature Components: Spark Submit, SparkR Reporter: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8313) Support Spark Packages containing R code with --packages
[ https://issues.apache.org/jira/browse/SPARK-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8313: --- Assignee: (was: Apache Spark) Support Spark Packages containing R code with --packages Key: SPARK-8313 URL: https://issues.apache.org/jira/browse/SPARK-8313 Project: Spark Issue Type: New Feature Components: Spark Submit, SparkR Reporter: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8313) Support Spark Packages containing R code with --packages
[ https://issues.apache.org/jira/browse/SPARK-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8313: --- Assignee: Apache Spark Support Spark Packages containing R code with --packages Key: SPARK-8313 URL: https://issues.apache.org/jira/browse/SPARK-8313 Project: Spark Issue Type: New Feature Components: Spark Submit, SparkR Reporter: Burak Yavuz Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6990) Add Java linting script
[ https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609509#comment-14609509 ] Yu Ishikawa commented on SPARK-6990: Please assign this issue to me? Add Java linting script --- Key: SPARK-6990 URL: https://issues.apache.org/jira/browse/SPARK-6990 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Josh Rosen Priority: Minor Labels: starter It would be nice to add a {{dev/lint-java}} script to enforce style rules for Spark's Java code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3444) Provide a way to easily change the log level in the Spark shell while running
[ https://issues.apache.org/jira/browse/SPARK-3444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609455#comment-14609455 ] Apache Spark commented on SPARK-3444: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/7140 Provide a way to easily change the log level in the Spark shell while running - Key: SPARK-3444 URL: https://issues.apache.org/jira/browse/SPARK-3444 Project: Spark Issue Type: Improvement Components: Spark Shell Reporter: holdenk Assignee: Holden Karau Priority: Minor Fix For: 1.4.0 Right now its difficult to change the log level while running. Our log messages can be quite verbose at the more detailed levels, and some users want to run at WARN until they encounter an issue and then increase the logging level to debug without restarting the shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8742) Improve SparkR error messages for DataFrame API
Hossein Falaki created SPARK-8742: - Summary: Improve SparkR error messages for DataFrame API Key: SPARK-8742 URL: https://issues.apache.org/jira/browse/SPARK-8742 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.1 Reporter: Hossein Falaki Priority: Blocker Currently all DataFrame API errors result in following generic error: {code} Error: returnStatus == 0 is not TRUE {code} This is because invokeJava in backend.R does not inspect error messages. For most use cases it is critical to return better error messages. Initially, we can return the stack trace from the JVM. In future we can inspect the errors and translate them to human-readable error messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8318) Spark Streaming Starter JIRAs
[ https://issues.apache.org/jira/browse/SPARK-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609399#comment-14609399 ] Tathagata Das edited comment on SPARK-8318 at 7/1/15 1:19 AM: -- I think the starter label is not very easy to find, and most people search by JIRAs. In this way, we get the benefit of both, starter label as well as a easy to find JIRA Case in point, the subtasks got solved pretty fast. was (Author: tdas): I think the starter label is not very easy to find, and most people search by JIRAs. In this way, we get the benefit of both, starter label as well finding a JIRA. Case in point, the subtasks got solved pretty fast. Spark Streaming Starter JIRAs - Key: SPARK-8318 URL: https://issues.apache.org/jira/browse/SPARK-8318 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Priority: Minor Labels: starter This is a master JIRA to collect together all starter tasks related to Spark Streaming. These are simple tasks that contributors can do to get familiar with the process of contributing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8744) StringIndexerModel should have public constructor
Joseph K. Bradley created SPARK-8744: Summary: StringIndexerModel should have public constructor Key: SPARK-8744 URL: https://issues.apache.org/jira/browse/SPARK-8744 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Trivial It would be helpful to allow users to pass a pre-computed index to create an indexer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse
[ https://issues.apache.org/jira/browse/SPARK-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609477#comment-14609477 ] Vinod KC commented on SPARK-8628: - Can you please assign this to me Race condition in AbstractSparkSQLParser.parse -- Key: SPARK-8628 URL: https://issues.apache.org/jira/browse/SPARK-8628 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Santiago M. Mola Priority: Critical Labels: regression Fix For: 1.5.0, 1.4.2 SPARK-5009 introduced the following code in AbstractSparkSQLParser: {code} def parse(input: String): LogicalPlan = { // Initialize the Keywords. lexical.initialize(reservedWords) phrase(start)(new lexical.Scanner(input)) match { case Success(plan, _) = plan case failureOrError = sys.error(failureOrError.toString) } } {code} The corresponding initialize method in SqlLexical is not thread-safe: {code} /* This is a work around to support the lazy setting */ def initialize(keywords: Seq[String]): Unit = { reserved.clear() reserved ++= keywords } {code} I'm hitting this when parsing multiple SQL queries concurrently. When one query parsing starts, it empties the reserved keyword list, then a race-condition occurs and other queries fail to parse because they recognize keywords as identifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6602) Replace direct use of Akka with Spark RPC interface
[ https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609481#comment-14609481 ] Apache Spark commented on SPARK-6602: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/7141 Replace direct use of Akka with Spark RPC interface --- Key: SPARK-6602 URL: https://issues.apache.org/jira/browse/SPARK-6602 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu Priority: Critical Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8535) PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name
[ https://issues.apache.org/jira/browse/SPARK-8535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8535. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7124 [https://github.com/apache/spark/pull/7124] PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name --- Key: SPARK-8535 URL: https://issues.apache.org/jira/browse/SPARK-8535 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Christophe Bourguignat Fix For: 1.5.0 Trying to create a Spark DataFrame from a pandas dataframe with no explicit column name : pandasDF = pd.DataFrame([[1, 2], [5, 6]]) sparkDF = sqlContext.createDataFrame(pandasDF) *** 1 sparkDF = sqlContext.createDataFrame(pandasDF) /usr/local/Cellar/apache-spark/1.4.0/libexec/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio) 344 345 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) -- 346 df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) 347 return DataFrame(df, self) 348 /usr/local/Cellar/apache-spark/1.4.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /usr/local/Cellar/apache-spark/1.4.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o87.applySchemaToPythonRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6990) Add Java linting script
[ https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609531#comment-14609531 ] Yu Ishikawa edited comment on SPARK-6990 at 7/1/15 3:59 AM: I think it would be nice to execute {{mvn checkstyle:checkstyle}} with the checkstyle maven plugin. What do you think about that? And do you have any good idea to realize the linter? was (Author: yuu.ishik...@gmail.com): I think it would be nice to execute `mvn checkstyle: checkstyle` with the checkstyle maven plugin. What do you think about that? And do you have any good idea to realize the linter? Add Java linting script --- Key: SPARK-6990 URL: https://issues.apache.org/jira/browse/SPARK-6990 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Josh Rosen Priority: Minor Labels: starter It would be nice to add a {{dev/lint-java}} script to enforce style rules for Spark's Java code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8742) Improve SparkR error messages for DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-8742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609558#comment-14609558 ] Shivaram Venkataraman commented on SPARK-8742: -- Thanks [~falaki] for creating this. This is a pretty important issue and I think there might be a bunch of things to improve here. I think the most important thing is to filter out the Netty stack trace that comes from the RBackend handler. Typically the netty server throws an error when some other Java function call has failed and the error is rarely in the Netty call itself. One way to do this might be to return an string message that encodes part of the actual exception when the return status is zero. Improve SparkR error messages for DataFrame API --- Key: SPARK-8742 URL: https://issues.apache.org/jira/browse/SPARK-8742 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.1 Reporter: Hossein Falaki Priority: Blocker Currently all DataFrame API errors result in following generic error: {code} Error: returnStatus == 0 is not TRUE {code} This is because invokeJava in backend.R does not inspect error messages. For most use cases it is critical to return better error messages. Initially, we can return the stack trace from the JVM. In future we can inspect the errors and translate them to human-readable error messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
[ https://issues.apache.org/jira/browse/SPARK-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609563#comment-14609563 ] Apache Spark commented on SPARK-8746: - User 'ckadner' has created a pull request for this issue: https://github.com/apache/spark/pull/7144 Need to update download link for Hive 0.13.1 jars (HiveComparisonTest) -- Key: SPARK-8746 URL: https://issues.apache.org/jira/browse/SPARK-8746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Christian Kadner Priority: Trivial Labels: documentation, test Original Estimate: 1h Remaining Estimate: 1h The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) describes how to generate golden answer files for new hive comparison test cases. However the download link for the Hive 0.13.1 jars points to https://hive.apache.org/downloads.html but none of the linked mirror sites still has the 0.13.1 version. We need to update the link to https://archive.apache.org/dist/hive/hive-0.13.1/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8748) Move castability test out from Cast case class into Cast object
Reynold Xin created SPARK-8748: -- Summary: Move castability test out from Cast case class into Cast object Key: SPARK-8748 URL: https://issues.apache.org/jira/browse/SPARK-8748 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin So we can use it as static methods in the analyzer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609569#comment-14609569 ] venu k tangirala edited comment on SPARK-6101 at 7/1/15 5:21 AM: - Hi Chris, does this include writing back to dynamoDB ? Is someone working on this? Does this work in pyspark too? was (Author: venuktan): Hi Chris, does this include writing back to dynamoDB ? Is someone working on this? Create a SparkSQL DataSource API implementation for DynamoDB Key: SPARK-6101 URL: https://issues.apache.org/jira/browse/SPARK-6101 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Chris Fregly Assignee: Chris Fregly Fix For: 1.5.0 similar to https://github.com/databricks/spark-avro and https://github.com/databricks/spark-csv -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8653) Add constraint for Children expression for data type
[ https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609579#comment-14609579 ] Davies Liu commented on SPARK-8653: --- [~rxin] With the new `ExpectsInputTypes`, we still need a way to tell how to do the conversion, it's ugly to do the type switch in eval() or codegen(). Maybe we could improve `AutoCastInputType` to have a method `acceptedTypes`, which returns a list of list of data types, specify those types could be casted into expected types. Be default, it will accept all type types which could be casted to expected types. Add constraint for Children expression for data type Key: SPARK-8653 URL: https://issues.apache.org/jira/browse/SPARK-8653 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, we have trait in Expression like `ExpectsInputTypes` and also the `checkInputDataTypes`, but can not convert the children expressions automatically, except we write the new rules in the `HiveTypeCoercion`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8647) Potential issues with the constant hashCode
[ https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609580#comment-14609580 ] Apache Spark commented on SPARK-8647: - User 'aloknsingh' has created a pull request for this issue: https://github.com/apache/spark/pull/7146 Potential issues with the constant hashCode Key: SPARK-8647 URL: https://issues.apache.org/jira/browse/SPARK-8647 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Alok Singh Priority: Minor Labels: performance Hi, This may be potential bug or performance issue or just the code docs. The issue is wrt to MatrixUDT class. If we decide to put instance of MatrixUDT into the hash based collection. The hashCode function is returning constant and even though equals method is consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e constant) has been used. I was expecting it to be similar to the other matrix class or the vector class . If there is the reason why we have this code, we should document it properly in the code so that others reading it is fine. regards, Alok Details = a) In reference to the file https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala line 188-197 ie override def equals(o: Any): Boolean = { o match { case v: MatrixUDT = true case _ = false } } override def hashCode(): Int = 1994 b) the commit is https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436 on March 20. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8647) Potential issues with the constant hashCode
[ https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8647: --- Assignee: (was: Apache Spark) Potential issues with the constant hashCode Key: SPARK-8647 URL: https://issues.apache.org/jira/browse/SPARK-8647 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Alok Singh Priority: Minor Labels: performance Hi, This may be potential bug or performance issue or just the code docs. The issue is wrt to MatrixUDT class. If we decide to put instance of MatrixUDT into the hash based collection. The hashCode function is returning constant and even though equals method is consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e constant) has been used. I was expecting it to be similar to the other matrix class or the vector class . If there is the reason why we have this code, we should document it properly in the code so that others reading it is fine. regards, Alok Details = a) In reference to the file https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala line 188-197 ie override def equals(o: Any): Boolean = { o match { case v: MatrixUDT = true case _ = false } } override def hashCode(): Int = 1994 b) the commit is https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436 on March 20. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8647) Potential issues with the constant hashCode
[ https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8647: --- Assignee: Apache Spark Potential issues with the constant hashCode Key: SPARK-8647 URL: https://issues.apache.org/jira/browse/SPARK-8647 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Alok Singh Assignee: Apache Spark Priority: Minor Labels: performance Hi, This may be potential bug or performance issue or just the code docs. The issue is wrt to MatrixUDT class. If we decide to put instance of MatrixUDT into the hash based collection. The hashCode function is returning constant and even though equals method is consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e constant) has been used. I was expecting it to be similar to the other matrix class or the vector class . If there is the reason why we have this code, we should document it properly in the code so that others reading it is fine. regards, Alok Details = a) In reference to the file https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala line 188-197 ie override def equals(o: Any): Boolean = { o match { case v: MatrixUDT = true case _ = false } } override def hashCode(): Int = 1994 b) the commit is https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436 on March 20. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only
[ https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608169#comment-14608169 ] Antony Mayi commented on SPARK-8708: The real case is about 13M of users, few hundreds of products and about 500 partitions. The rdd returned by .predictAll() utilizes single partition as in my example (btw. why do you say I have one partition in my toy example? It is using 5 partitions, all of them utilized before it comes to ALS - to me it replicate the real issue I am facing). MatrixFactorizationModel.predictAll() populates single partition only - Key: SPARK-8708 URL: https://issues.apache.org/jira/browse/SPARK-8708 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Antony Mayi When using mllib.recommendation.ALS the RDD returned by .predictAll() has all values pushed into single partition despite using quite high parallelism. This degrades performance of further processing (I can obviously run .partitionBy()) to balance it but that's still too costly (ie if running .predictAll() in loop for thousands of products) and should be possible to do it rather somehow on the model (automatically)). Bellow is an example on tiny sample (same on large dataset): {code:title=pyspark} r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) r4 = (2, 2, 2.0) r5 = (3, 1, 1.0) ratings = sc.parallelize([r1, r2, r3, r4, r5], 5) ratings.getNumPartitions() 5 users = ratings.map(itemgetter(0)).distinct() model = ALS.trainImplicit(ratings, 1, seed=10) predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2))) predictions_for_2.glom().map(len).collect() [0, 0, 3, 0, 0] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8044) Avoid to use directMemory while put or get disk level block from file
[ https://issues.apache.org/jira/browse/SPARK-8044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8044. -- Resolution: Won't Fix Avoid to use directMemory while put or get disk level block from file - Key: SPARK-8044 URL: https://issues.apache.org/jira/browse/SPARK-8044 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: SuYan Priority: Critical 1. I found if we use getChannel to put or get data, it will create DirectBuffer anyway, which is not controllable. according openJDK source code: because it will create a ThreadLocal directBuffer pool, and is not provider a 100% percent way to sure the direct buffer to be released, it will cache in the pool. {code} sun.nio.ch.FileChannelImpl.java public int write(ByteBuffer src) throws IOException { 210 ensureOpen(); 211 if (!writable) 212 throw new NonWritableChannelException(); 213 synchronized (positionLock) { 214 int n = 0; 215 int ti = -1; 216 try { 217 begin(); 218 if (!isOpen()) 219 return 0; 220 ti = threads.add(); 221 if (appending) 222 position(size()); 223 do { 224 n = IOUtil.write(fd, src, -1, nd, positionLock); 225 } while ((n == IOStatus.INTERRUPTED) isOpen()); 226 return IOStatus.normalize(n); 227 } finally { 228 threads.remove(ti); 229 end(n 0); 230 assert IOStatus.check(n); 231 } 232 } 233 } {code} {code} IOUtil.java static int write(FileDescriptor fd, ByteBuffer src, long position, 74 NativeDispatcher nd, Object lock) 75 throws IOException 76 { 77 if (src instanceof DirectBuffer) 78 return writeFromNativeBuffer(fd, src, position, nd, lock); 79 80 // Substitute a native buffer 81 int pos = src.position(); 82 int lim = src.limit(); 83 assert (pos = lim); 84 int rem = (pos = lim ? lim - pos : 0); 85 ByteBuffer bb = null; 86 try { 87 bb = Util.getTemporaryDirectBuffer(rem); 88 bb.put(src); 89 bb.flip(); 90 // Do not update src until we see how many bytes were written 91 src.position(pos); 92 93 int n = writeFromNativeBuffer(fd, bb, position, nd, lock); 94 if (n 0) { 95 // now update src 96 src.position(pos + n); 97 } 98 return n; 99 } finally { 100Util.releaseTemporaryDirectBuffer(bb); 101} 102} {code} {code} Util.java static ByteBuffer getTemporaryDirectBuffer(int size) { 61 ByteBuffer buf = null; 62 // Grab a buffer if available 63 for (int i=0; iTEMP_BUF_POOL_SIZE; i++) { 64 SoftReference ref = (SoftReference)(bufferPool[i].get()); 65 if ((ref != null) ((buf = (ByteBuffer)ref.get()) != null) 66 (buf.capacity() = size)) { 67 buf.rewind(); 68 buf.limit(size); 69 bufferPool[i].set(null); 70 return buf; 71 } 72 } 73 74 // Make a new one 75 return ByteBuffer.allocateDirect(size); 76 } {code} {code} private static final int TEMP_BUF_POOL_SIZE = 3; 50 51 // Per-thread soft cache of the last temporary direct buffer 52 private static ThreadLocal[] bufferPool; 53 54 static { 55 bufferPool = new ThreadLocal[TEMP_BUF_POOL_SIZE]; 56 for (int i=0; iTEMP_BUF_POOL_SIZE; i++) 57 bufferPool[i] = new ThreadLocal(); 58 } 59 60 static ByteBuffer getTemporaryDirectBuffer(int size) { 61 ByteBuffer buf = null; 62 // Grab a buffer if available 63 for (int i=0; iTEMP_BUF_POOL_SIZE; i++) { 64 SoftReference ref = (SoftReference)(bufferPool[i].get()); 65 if ((ref != null) ((buf = (ByteBuffer)ref.get()) != null) 66 (buf.capacity() = size)) { 67 buf.rewind(); 68 buf.limit(size); 69 bufferPool[i].set(null); 70 return buf; 71 } 72 } 73 74 // Make a new one 75 return ByteBuffer.allocateDirect(size); 76 } 77 78 static void releaseTemporaryDirectBuffer(ByteBuffer buf) { 79 if (buf == null) 80 return; 81 // Put it in an empty slot if such
[jira] [Commented] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only
[ https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608214#comment-14608214 ] Antony Mayi commented on SPARK-8708: ok, more detailed example showing there really are 5 partitions used in this case but eventually the .predictAll() pushes everything to just one. This is exactly what I am seeing in production - out of 500 partitions single one gets all the millions of predictions in it, all other partitions are empty. {code} from operator import itemgetter from pyspark.mllib.recommendation import ALS from pyspark import SparkConf sconf = SparkConf() sconf.get('spark.default.parallelism') u'5' r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) r4 = (2, 2, 2.0) r5 = (3, 1, 1.0) ratings = sc.parallelize([r1, r2, r3, r4, r5], 5) ratings.glom().map(len).collect() [1, 1, 1, 1, 1] users = ratings.map(itemgetter(0)).distinct() users.glom().map(len).collect() [0, 1, 1, 1, 0] model = ALS.trainImplicit(ratings, 1, seed=10) predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2))) predictions_for_2.glom().map(len).collect() [0, 0, 3, 0, 0] {code} MatrixFactorizationModel.predictAll() populates single partition only - Key: SPARK-8708 URL: https://issues.apache.org/jira/browse/SPARK-8708 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Antony Mayi When using mllib.recommendation.ALS the RDD returned by .predictAll() has all values pushed into single partition despite using quite high parallelism. This degrades performance of further processing (I can obviously run .partitionBy()) to balance it but that's still too costly (ie if running .predictAll() in loop for thousands of products) and should be possible to do it rather somehow on the model (automatically)). Bellow is an example on tiny sample (same on large dataset): {code:title=pyspark} r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) r4 = (2, 2, 2.0) r5 = (3, 1, 1.0) ratings = sc.parallelize([r1, r2, r3, r4, r5], 5) ratings.getNumPartitions() 5 users = ratings.map(itemgetter(0)).distinct() model = ALS.trainImplicit(ratings, 1, seed=10) predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2))) predictions_for_2.glom().map(len).collect() [0, 0, 3, 0, 0] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only
[ https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608169#comment-14608169 ] Antony Mayi edited comment on SPARK-8708 at 6/30/15 11:55 AM: -- The real case is about 13M of users, few hundreds of products and about 500 partitions. The rdd returned by .predictAll() utilizes single partition as in my example (btw. why do you say I have one partition in my toy example? It is using 5 partitions, all of them utilized before it comes to ALS - to me it replicates the real issue I am facing). was (Author: antonymayi): The real case is about 13M of users, few hundreds of products and about 500 partitions. The rdd returned by .predictAll() utilizes single partition as in my example (btw. why do you say I have one partition in my toy example? It is using 5 partitions, all of them utilized before it comes to ALS - to me it replicate the real issue I am facing). MatrixFactorizationModel.predictAll() populates single partition only - Key: SPARK-8708 URL: https://issues.apache.org/jira/browse/SPARK-8708 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Antony Mayi When using mllib.recommendation.ALS the RDD returned by .predictAll() has all values pushed into single partition despite using quite high parallelism. This degrades performance of further processing (I can obviously run .partitionBy()) to balance it but that's still too costly (ie if running .predictAll() in loop for thousands of products) and should be possible to do it rather somehow on the model (automatically)). Bellow is an example on tiny sample (same on large dataset): {code:title=pyspark} r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) r4 = (2, 2, 2.0) r5 = (3, 1, 1.0) ratings = sc.parallelize([r1, r2, r3, r4, r5], 5) ratings.getNumPartitions() 5 users = ratings.map(itemgetter(0)).distinct() model = ALS.trainImplicit(ratings, 1, seed=10) predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2))) predictions_for_2.glom().map(len).collect() [0, 0, 3, 0, 0] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only
[ https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608194#comment-14608194 ] Sean Owen commented on SPARK-8708: -- Does that actually make 5 partitions? I see that's what's requested, but are the items evenly distributed? The computation doesn't use 1 partition, so the question is why the result would have 1 partition. It might if you have a small number of products that all get into one partition for whatever reason, I think, since the final join is on product. I have one more idea on the PR ... MatrixFactorizationModel.predictAll() populates single partition only - Key: SPARK-8708 URL: https://issues.apache.org/jira/browse/SPARK-8708 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Antony Mayi When using mllib.recommendation.ALS the RDD returned by .predictAll() has all values pushed into single partition despite using quite high parallelism. This degrades performance of further processing (I can obviously run .partitionBy()) to balance it but that's still too costly (ie if running .predictAll() in loop for thousands of products) and should be possible to do it rather somehow on the model (automatically)). Bellow is an example on tiny sample (same on large dataset): {code:title=pyspark} r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) r4 = (2, 2, 2.0) r5 = (3, 1, 1.0) ratings = sc.parallelize([r1, r2, r3, r4, r5], 5) ratings.getNumPartitions() 5 users = ratings.map(itemgetter(0)).distinct() model = ALS.trainImplicit(ratings, 1, seed=10) predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2))) predictions_for_2.glom().map(len).collect() [0, 0, 3, 0, 0] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8437) Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles
[ https://issues.apache.org/jira/browse/SPARK-8437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608174#comment-14608174 ] Apache Spark commented on SPARK-8437: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/7126 Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles - Key: SPARK-8437 URL: https://issues.apache.org/jira/browse/SPARK-8437 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.3.1, 1.4.0 Environment: Ubuntu 15.04 + local filesystem Amazon EMR + S3 + HDFS Reporter: Ewan Leith Assignee: Sean Owen Priority: Minor When calling wholeTextFiles or binaryFiles with a directory path with 10,000s of files in it, Spark hangs for a few minutes before processing the files. If you add a * to the end of the path, there is no delay. This happens for me on Spark 1.3.1 and 1.4 on the local filesystem, HDFS, and on S3. To reproduce, create a directory with 50,000 files in it, then run: val a = sc.binaryFiles(file:/path/to/files/) a.count() val b = sc.binaryFiles(file:/path/to/files/*) b.count() and monitor the different startup times. For example, in the spark-shell these commands are pasted in together, so the delay at f.count() is from 10:11:08 t- 10:13:29 to output Total input paths to process : 4, then until 10:15:42 to being processing files: scala val f = sc.binaryFiles(file:/home/ewan/large/) 15/06/18 10:11:07 INFO MemoryStore: ensureFreeSpace(160616) called with curMem=0, maxMem=278019440 15/06/18 10:11:07 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 156.9 KB, free 265.0 MB) 15/06/18 10:11:08 INFO MemoryStore: ensureFreeSpace(17282) called with curMem=160616, maxMem=278019440 15/06/18 10:11:08 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.9 KB, free 265.0 MB) 15/06/18 10:11:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:40430 (size: 16.9 KB, free: 265.1 MB) 15/06/18 10:11:08 INFO SparkContext: Created broadcast 0 from binaryFiles at console:21 f: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/ BinaryFileRDD[0] at binaryFiles at console:21 scala f.count() 15/06/18 10:13:29 INFO FileInputFormat: Total input paths to process : 4 15/06/18 10:15:42 INFO FileInputFormat: Total input paths to process : 4 15/06/18 10:15:42 INFO CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 0 15/06/18 10:15:42 INFO SparkContext: Starting job: count at console:24 15/06/18 10:15:42 INFO DAGScheduler: Got job 0 (count at console:24) with 4 output partitions (allowLocal=false) 15/06/18 10:15:42 INFO DAGScheduler: Final stage: ResultStage 0(count at console:24) 15/06/18 10:15:42 INFO DAGScheduler: Parents of final stage: List() Adding a * to the end of the path removes the delay: scala val f = sc.binaryFiles(file:/home/ewan/large/*) 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(160616) called with curMem=0, maxMem=278019440 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 156.9 KB, free 265.0 MB) 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(17309) called with curMem=160616, maxMem=278019440 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.9 KB, free 265.0 MB) 15/06/18 10:08:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:42825 (size: 16.9 KB, free: 265.1 MB) 15/06/18 10:08:29 INFO SparkContext: Created broadcast 0 from binaryFiles at console:21 f: org.apache.spark.rdd.RDD[(String, org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/* BinaryFileRDD[0] at binaryFiles at console:21 scala f.count() 15/06/18 10:08:32 INFO FileInputFormat: Total input paths to process : 4 15/06/18 10:08:33 INFO FileInputFormat: Total input paths to process : 4 15/06/18 10:08:35 INFO CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 0 15/06/18 10:08:35 INFO SparkContext: Starting job: count at console:24 15/06/18 10:08:35 INFO DAGScheduler: Got job 0 (count at console:24) with 4 output partitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8707) RDD#toDebugString fails if any cached RDD has invalid partitions
[ https://issues.apache.org/jira/browse/SPARK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608242#comment-14608242 ] Apache Spark commented on SPARK-8707: - User 'navis' has created a pull request for this issue: https://github.com/apache/spark/pull/7127 RDD#toDebugString fails if any cached RDD has invalid partitions Key: SPARK-8707 URL: https://issues.apache.org/jira/browse/SPARK-8707 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0, 1.4.1 Reporter: Aaron Davidson Labels: starter Repro: {code} sc.textFile(/ThisFileDoesNotExist).cache() sc.parallelize(0 until 100).toDebugString {code} Output: {code} java.io.IOException: Not a file: /ThisFileDoesNotExist at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455) at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573) at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607) at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637 {code} This is because toDebugString gets all the partitions from all RDDs, which fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be resilient to other RDDs being invalid (and getRDDStorageInfo should probably also be). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7820) Java8-tests suite compile error under SBT
[ https://issues.apache.org/jira/browse/SPARK-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7820: --- Assignee: (was: Apache Spark) Java8-tests suite compile error under SBT - Key: SPARK-7820 URL: https://issues.apache.org/jira/browse/SPARK-7820 Project: Spark Issue Type: Bug Components: Build, Streaming Affects Versions: 1.4.0 Reporter: Saisai Shao Priority: Critical Lots of compilation error is shown when java 8 test suite is enabled in SBT: {{JAVA_HOME=/usr/java/jdk1.8.0_45 ./sbt/sbt -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Pjava8-tests}} {code} [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:43: error: cannot find symbol [error] public class Java8APISuite extends LocalJavaStreamingContext implements Serializable { [error]^ [error] symbol: class LocalJavaStreamingContext [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55: error: cannot find symbol [error] JavaDStreamString stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); [error] ^ [error] symbol: variable ssc [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55: error: cannot find symbol [error] JavaDStreamString stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); [error] ^ [error] symbol: variable JavaTestUtils [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:57: error: cannot find symbol [error] JavaTestUtils.attachTestOutputStream(letterCount); [error] ^ [error] symbol: variable JavaTestUtils [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58: error: cannot find symbol [error] ListListInteger result = JavaTestUtils.runStreams(ssc, 2, 2); [error] ^ [error] symbol: variable ssc [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58: error: cannot find symbol [error] ListListInteger result = JavaTestUtils.runStreams(ssc, 2, 2); [error] ^ [error] symbol: variable JavaTestUtils [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:73: error: cannot find symbol [error] JavaDStreamString stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); [error] ^ [error] symbol: variable ssc [error] location: class Java8APISuite {code} The class {{JavaAPISuite}} relies on {{LocalJavaStreamingContext}} which exists in streaming test jar. It is OK for maven compile, since it will generate test jar, but will be failed in sbt test compile, sbt do not generate test jar by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7820) Java8-tests suite compile error under SBT
[ https://issues.apache.org/jira/browse/SPARK-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608063#comment-14608063 ] Apache Spark commented on SPARK-7820: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/7120 Java8-tests suite compile error under SBT - Key: SPARK-7820 URL: https://issues.apache.org/jira/browse/SPARK-7820 Project: Spark Issue Type: Bug Components: Build, Streaming Affects Versions: 1.4.0 Reporter: Saisai Shao Priority: Critical Lots of compilation error is shown when java 8 test suite is enabled in SBT: {{JAVA_HOME=/usr/java/jdk1.8.0_45 ./sbt/sbt -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Pjava8-tests}} {code} [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:43: error: cannot find symbol [error] public class Java8APISuite extends LocalJavaStreamingContext implements Serializable { [error]^ [error] symbol: class LocalJavaStreamingContext [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55: error: cannot find symbol [error] JavaDStreamString stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); [error] ^ [error] symbol: variable ssc [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55: error: cannot find symbol [error] JavaDStreamString stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); [error] ^ [error] symbol: variable JavaTestUtils [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:57: error: cannot find symbol [error] JavaTestUtils.attachTestOutputStream(letterCount); [error] ^ [error] symbol: variable JavaTestUtils [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58: error: cannot find symbol [error] ListListInteger result = JavaTestUtils.runStreams(ssc, 2, 2); [error] ^ [error] symbol: variable ssc [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58: error: cannot find symbol [error] ListListInteger result = JavaTestUtils.runStreams(ssc, 2, 2); [error] ^ [error] symbol: variable JavaTestUtils [error] location: class Java8APISuite [error] /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:73: error: cannot find symbol [error] JavaDStreamString stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); [error] ^ [error] symbol: variable ssc [error] location: class Java8APISuite {code} The class {{JavaAPISuite}} relies on {{LocalJavaStreamingContext}} which exists in streaming test jar. It is OK for maven compile, since it will generate test jar, but will be failed in sbt test compile, sbt do not generate test jar by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8731) Beeline doesn't work with -e option when started in background
[ https://issues.apache.org/jira/browse/SPARK-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608079#comment-14608079 ] Wang Yiguang commented on SPARK-8731: - Here is more discussion about this issue https://issues.apache.org/jira/browse/HIVE-6758 I looked into it a bit and will give more information later. Beeline doesn't work with -e option when started in background -- Key: SPARK-8731 URL: https://issues.apache.org/jira/browse/SPARK-8731 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Wang Yiguang Priority: Minor Beeline stops when running back ground like this: beeline -e some query it doesn't work even with the -f switch. For example: this works: beeline -u jdbc:hive2://0.0.0.0:8000 -e show databases; however this not: beeline -u jdbc:hive2://0.0.0.0:8000 -e show databases; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8731) Beeline doesn't work with -e option when started in background
[ https://issues.apache.org/jira/browse/SPARK-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608084#comment-14608084 ] Sean Owen commented on SPARK-8731: -- Is this Spark-specific or just about beeline? Beeline doesn't work with -e option when started in background -- Key: SPARK-8731 URL: https://issues.apache.org/jira/browse/SPARK-8731 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Wang Yiguang Priority: Minor Beeline stops when running back ground like this: beeline -e some query it doesn't work even with the -f switch. For example: this works: beeline -u jdbc:hive2://0.0.0.0:8000 -e show databases; however this not: beeline -u jdbc:hive2://0.0.0.0:8000 -e show databases; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8729) Spark app unable to instantiate the classes using the reflection
[ https://issues.apache.org/jira/browse/SPARK-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608087#comment-14608087 ] Sean Owen commented on SPARK-8729: -- [~kmurt...@gmail.com] This isn't really useful as you have no info about how you are deploying this. I'm going to close it unless you can provide something much more reproducible. Spark app unable to instantiate the classes using the reflection Key: SPARK-8729 URL: https://issues.apache.org/jira/browse/SPARK-8729 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.3.0 Reporter: Murthy Chelankuri Priority: Critical SPARK 1.3.0 unable to instantiate the classes using the reflection (using Class.forName). It says class not found even that class is available in the list jars. The following is the expection i am getting by the executors java.lang.ClassNotFoundException: com.abc.mq.msg.ObjectEncoder at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at kafka.utils.Utils$.createObject(Utils.scala:438) at kafka.producer.Producer.init(Producer.scala:61) The application is working fine with out any issues with 1.2.0 version. I am planing to upgrade to 1.3.0 and found it its not working. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8729) Spark app unable to instantiate the classes using the reflection
[ https://issues.apache.org/jira/browse/SPARK-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608111#comment-14608111 ] Sean Owen commented on SPARK-8729: -- It just sounds like one of your jars is not deployed in the right place. I don't think this code helps analyze that. You need to verify how you are shipping your app and that you package all necessary classes in your app and submit it through spark-submit. Spark app unable to instantiate the classes using the reflection Key: SPARK-8729 URL: https://issues.apache.org/jira/browse/SPARK-8729 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.3.0 Reporter: Murthy Chelankuri Priority: Critical SPARK 1.3.0 unable to instantiate the classes using the reflection (using Class.forName). It says class not found even that class is available in the list jars. The following is the expection i am getting by the executors java.lang.ClassNotFoundException: com.abc.mq.msg.ObjectEncoder at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at kafka.utils.Utils$.createObject(Utils.scala:438) at kafka.producer.Producer.init(Producer.scala:61) The application is working fine with out any issues with 1.2.0 version. I am planing to upgrade to 1.3.0 and found it its not working. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8732) Compilation warning for existentials code
Tijo Thomas created SPARK-8732: -- Summary: Compilation warning for existentials code Key: SPARK-8732 URL: https://issues.apache.org/jira/browse/SPARK-8732 Project: Spark Issue Type: Improvement Components: Build Reporter: Tijo Thomas Priority: Trivial Compilation warning for Scala code for using existential 1. RBackendHandler.scala 2. CatalystTypeConverters.scala Need to add import. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.
[ https://issues.apache.org/jira/browse/SPARK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608280#comment-14608280 ] Thomas Graves commented on SPARK-6735: -- Pull request was up but didn't have time to do rework for some comments if someone else wants to take this over. https://github.com/apache/spark/pull/5449 Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. --- Key: SPARK-6735 URL: https://issues.apache.org/jira/browse/SPARK-6735 Project: Spark Issue Type: Improvement Components: Spark Submit, YARN Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Twinkle Sachdeva Currently there is a setting (spark.yarn.max.executor.failures ) which tells maximum number of executor failures, after which Application fails. For long running applications, user can require not to kill the application at all or will require such setting relative to a window duration. This improvement is ti provide such options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8707) RDD#toDebugString fails if any cached RDD has invalid partitions
[ https://issues.apache.org/jira/browse/SPARK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8707: --- Assignee: (was: Apache Spark) RDD#toDebugString fails if any cached RDD has invalid partitions Key: SPARK-8707 URL: https://issues.apache.org/jira/browse/SPARK-8707 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0, 1.4.1 Reporter: Aaron Davidson Labels: starter Repro: {code} sc.textFile(/ThisFileDoesNotExist).cache() sc.parallelize(0 until 100).toDebugString {code} Output: {code} java.io.IOException: Not a file: /ThisFileDoesNotExist at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455) at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573) at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607) at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637 {code} This is because toDebugString gets all the partitions from all RDDs, which fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be resilient to other RDDs being invalid (and getRDDStorageInfo should probably also be). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8707) RDD#toDebugString fails if any cached RDD has invalid partitions
[ https://issues.apache.org/jira/browse/SPARK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8707: --- Assignee: Apache Spark RDD#toDebugString fails if any cached RDD has invalid partitions Key: SPARK-8707 URL: https://issues.apache.org/jira/browse/SPARK-8707 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0, 1.4.1 Reporter: Aaron Davidson Assignee: Apache Spark Labels: starter Repro: {code} sc.textFile(/ThisFileDoesNotExist).cache() sc.parallelize(0 until 100).toDebugString {code} Output: {code} java.io.IOException: Not a file: /ThisFileDoesNotExist at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455) at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573) at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607) at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637 {code} This is because toDebugString gets all the partitions from all RDDs, which fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be resilient to other RDDs being invalid (and getRDDStorageInfo should probably also be). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8729) Spark app unable to instantiate the classes using the reflection
Murthy Chelankuri created SPARK-8729: Summary: Spark app unable to instantiate the classes using the reflection Key: SPARK-8729 URL: https://issues.apache.org/jira/browse/SPARK-8729 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.3.0 Reporter: Murthy Chelankuri Priority: Critical SPARK 1.3.0 unable to instantiate the classes using the reflection (using Class.forName). It says class not found even that class is available in the list jars. The following is the expection i am getting by the executors java.lang.ClassNotFoundException: com.abc.mq.msg.ObjectEncoder at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at kafka.utils.Utils$.createObject(Utils.scala:438) at kafka.producer.Producer.init(Producer.scala:61) The application is working fine with out any issues with 1.2.0 version. I am planing to upgrade to 1.3.0 and found it its not working. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8731) Beeline doesn't work with -e option when started in background
Wang Yiguang created SPARK-8731: --- Summary: Beeline doesn't work with -e option when started in background Key: SPARK-8731 URL: https://issues.apache.org/jira/browse/SPARK-8731 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Wang Yiguang Priority: Minor Beeline stops when running back ground like this: beeline -e some query it doesn't work even with the -f switch. For example: this works: beeline -u jdbc:hive2://0.0.0.0:8000 -e show databases; however this not: beeline -u jdbc:hive2://0.0.0.0:8000 -e show databases; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8730) Deser primitive class with Java serialization
[ https://issues.apache.org/jira/browse/SPARK-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8730: --- Assignee: Apache Spark Deser primitive class with Java serialization - Key: SPARK-8730 URL: https://issues.apache.org/jira/browse/SPARK-8730 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Eugen Cepoi Assignee: Apache Spark Priority: Critical Objects that contain as property a primitive Class, can not be deserialized using java serde. Class.forName does not work for primitives. Exemple of object: class Foo extends Serializable { val intClass = classOf[Int] } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8730) Deser primitive class with Java serialization
[ https://issues.apache.org/jira/browse/SPARK-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8730: --- Assignee: (was: Apache Spark) Deser primitive class with Java serialization - Key: SPARK-8730 URL: https://issues.apache.org/jira/browse/SPARK-8730 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Eugen Cepoi Priority: Critical Objects that contain as property a primitive Class, can not be deserialized using java serde. Class.forName does not work for primitives. Exemple of object: class Foo extends Serializable { val intClass = classOf[Int] } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8730) Deser primitive class with Java serialization
[ https://issues.apache.org/jira/browse/SPARK-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608076#comment-14608076 ] Apache Spark commented on SPARK-8730: - User 'EugenCepoi' has created a pull request for this issue: https://github.com/apache/spark/pull/7122 Deser primitive class with Java serialization - Key: SPARK-8730 URL: https://issues.apache.org/jira/browse/SPARK-8730 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Eugen Cepoi Priority: Critical Objects that contain as property a primitive Class, can not be deserialized using java serde. Class.forName does not work for primitives. Exemple of object: class Foo extends Serializable { val intClass = classOf[Int] } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6951) History server slow startup if the event log directory is large
[ https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608287#comment-14608287 ] Thomas Graves commented on SPARK-6951: -- This actually happens more then just at start up. If you have large number of files, especially in progress files. Or even just large history files, it takes forever for the history server to pick up new completed ones and show on the UI. History server slow startup if the event log directory is large --- Key: SPARK-6951 URL: https://issues.apache.org/jira/browse/SPARK-6951 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Matt Cheah I started my history server, then navigated to the web UI where I expected to be able to view some completed applications, but the webpage was not available. It turned out that the History Server was not finished parsing all of the event logs in the event log directory that I had specified. I had accumulated a lot of event logs from months of running Spark, so it would have taken a very long time for the History Server to crunch through them all. I purged the event log directory and started from scratch, and the UI loaded immediately. We should have a pagination strategy or parse the directory lazily to avoid needing to wait after starting the history server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8464) Consider separating aggregator and non-aggregator paths in ExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608292#comment-14608292 ] Ilya Ganelin commented on SPARK-8464: - Josh - I'd be happy to look into this, I'll submit a PR shortly. Consider separating aggregator and non-aggregator paths in ExternalSorter - Key: SPARK-8464 URL: https://issues.apache.org/jira/browse/SPARK-8464 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Josh Rosen ExternalSorter is still really complicated and hard to understand. We should investigate whether separating the aggregator and non-aggregator paths into separate files would make the code easier to understand without introducing significant duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8699) Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-8699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609545#comment-14609545 ] Kamlesh Kumar commented on SPARK-8699: -- Thanks Shivaram, it works. Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0 --- Key: SPARK-8699 URL: https://issues.apache.org/jira/browse/SPARK-8699 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Environment: Windows 7, 64 bit Reporter: Kamlesh Kumar Priority: Critical Labels: test I can successfully run Showdf and head on rrdd data frame in R but it throws unexpected error for select commands. R console output after running select command on rrdd data object is following: command head(select(df, df$eruptions)) output: Error in head(select(df, df$eruptions)) : error in evaluating the argument 'x' in selecting a method for function 'head': Error in UseMethod(select_) : no applicable method for 'select_' applied to an object of class DataFrame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8699) Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-8699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8699. -- Resolution: Not A Problem Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0 --- Key: SPARK-8699 URL: https://issues.apache.org/jira/browse/SPARK-8699 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Environment: Windows 7, 64 bit Reporter: Kamlesh Kumar Priority: Critical Labels: test I can successfully run Showdf and head on rrdd data frame in R but it throws unexpected error for select commands. R console output after running select command on rrdd data object is following: command head(select(df, df$eruptions)) output: Error in head(select(df, df$eruptions)) : error in evaluating the argument 'x' in selecting a method for function 'head': Error in UseMethod(select_) : no applicable method for 'select_' applied to an object of class DataFrame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8699) Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-8699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609545#comment-14609545 ] Kamlesh Kumar edited comment on SPARK-8699 at 7/1/15 4:39 AM: -- Thanks Shivaram, it works as some other package was riding over select command. was (Author: kamlesh.kumar): Thanks Shivaram, it works. Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0 --- Key: SPARK-8699 URL: https://issues.apache.org/jira/browse/SPARK-8699 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Environment: Windows 7, 64 bit Reporter: Kamlesh Kumar Priority: Critical Labels: test I can successfully run Showdf and head on rrdd data frame in R but it throws unexpected error for select commands. R console output after running select command on rrdd data object is following: command head(select(df, df$eruptions)) output: Error in head(select(df, df$eruptions)) : error in evaluating the argument 'x' in selecting a method for function 'head': Error in UseMethod(select_) : no applicable method for 'select_' applied to an object of class DataFrame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8699) Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0
[ https://issues.apache.org/jira/browse/SPARK-8699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kamlesh Kumar updated SPARK-8699: - Priority: Trivial (was: Critical) Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0 --- Key: SPARK-8699 URL: https://issues.apache.org/jira/browse/SPARK-8699 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Environment: Windows 7, 64 bit Reporter: Kamlesh Kumar Priority: Trivial Labels: test I can successfully run Showdf and head on rrdd data frame in R but it throws unexpected error for select commands. R console output after running select command on rrdd data object is following: command head(select(df, df$eruptions)) output: Error in head(select(df, df$eruptions)) : error in evaluating the argument 'x' in selecting a method for function 'head': Error in UseMethod(select_) : no applicable method for 'select_' applied to an object of class DataFrame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8747) fix EqualNullSafe for binary type
Wenchen Fan created SPARK-8747: -- Summary: fix EqualNullSafe for binary type Key: SPARK-8747 URL: https://issues.apache.org/jira/browse/SPARK-8747 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8747) fix EqualNullSafe for binary type
[ https://issues.apache.org/jira/browse/SPARK-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609562#comment-14609562 ] Apache Spark commented on SPARK-8747: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7143 fix EqualNullSafe for binary type - Key: SPARK-8747 URL: https://issues.apache.org/jira/browse/SPARK-8747 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8748) Move castability test out from Cast case class into Cast object
[ https://issues.apache.org/jira/browse/SPARK-8748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8748: --- Assignee: Reynold Xin (was: Apache Spark) Move castability test out from Cast case class into Cast object --- Key: SPARK-8748 URL: https://issues.apache.org/jira/browse/SPARK-8748 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin So we can use it as static methods in the analyzer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609598#comment-14609598 ] Murtaza Kanchwala commented on SPARK-6101: -- https://github.com/cfregly/spark-dynamodb Read is implemented, But save is not. I'll prefer you to use the Amazon's DynamoDB Mapper. Create a SparkSQL DataSource API implementation for DynamoDB Key: SPARK-6101 URL: https://issues.apache.org/jira/browse/SPARK-6101 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Chris Fregly Assignee: Chris Fregly Fix For: 1.5.0 similar to https://github.com/databricks/spark-avro and https://github.com/databricks/spark-csv -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8535) PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name
[ https://issues.apache.org/jira/browse/SPARK-8535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609552#comment-14609552 ] Yuri Saito commented on SPARK-8535: --- Could you change assignee from no-assignee to me? PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name --- Key: SPARK-8535 URL: https://issues.apache.org/jira/browse/SPARK-8535 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Christophe Bourguignat Fix For: 1.5.0 Trying to create a Spark DataFrame from a pandas dataframe with no explicit column name : pandasDF = pd.DataFrame([[1, 2], [5, 6]]) sparkDF = sqlContext.createDataFrame(pandasDF) *** 1 sparkDF = sqlContext.createDataFrame(pandasDF) /usr/local/Cellar/apache-spark/1.4.0/libexec/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio) 344 345 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd()) -- 346 df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) 347 return DataFrame(df, self) 348 /usr/local/Cellar/apache-spark/1.4.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /usr/local/Cellar/apache-spark/1.4.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o87.applySchemaToPythonRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
[ https://issues.apache.org/jira/browse/SPARK-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8746: --- Assignee: (was: Apache Spark) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest) -- Key: SPARK-8746 URL: https://issues.apache.org/jira/browse/SPARK-8746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Christian Kadner Priority: Trivial Labels: documentation, test Original Estimate: 1h Remaining Estimate: 1h The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) describes how to generate golden answer files for new hive comparison test cases. However the download link for the Hive 0.13.1 jars points to https://hive.apache.org/downloads.html but none of the linked mirror sites still has the 0.13.1 version. We need to update the link to https://archive.apache.org/dist/hive/hive-0.13.1/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
[ https://issues.apache.org/jira/browse/SPARK-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8746: --- Assignee: Apache Spark Need to update download link for Hive 0.13.1 jars (HiveComparisonTest) -- Key: SPARK-8746 URL: https://issues.apache.org/jira/browse/SPARK-8746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Christian Kadner Assignee: Apache Spark Priority: Trivial Labels: documentation, test Original Estimate: 1h Remaining Estimate: 1h The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) describes how to generate golden answer files for new hive comparison test cases. However the download link for the Hive 0.13.1 jars points to https://hive.apache.org/downloads.html but none of the linked mirror sites still has the 0.13.1 version. We need to update the link to https://archive.apache.org/dist/hive/hive-0.13.1/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609569#comment-14609569 ] venu k tangirala commented on SPARK-6101: - Hi Chris, does this include writing back to dynamoDB ? Is someone working on this? Create a SparkSQL DataSource API implementation for DynamoDB Key: SPARK-6101 URL: https://issues.apache.org/jira/browse/SPARK-6101 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Chris Fregly Assignee: Chris Fregly Fix For: 1.5.0 similar to https://github.com/databricks/spark-avro and https://github.com/databricks/spark-csv -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8748) Move castability test out from Cast case class into Cast object
[ https://issues.apache.org/jira/browse/SPARK-8748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609576#comment-14609576 ] Apache Spark commented on SPARK-8748: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7145 Move castability test out from Cast case class into Cast object --- Key: SPARK-8748 URL: https://issues.apache.org/jira/browse/SPARK-8748 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin So we can use it as static methods in the analyzer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8748) Move castability test out from Cast case class into Cast object
[ https://issues.apache.org/jira/browse/SPARK-8748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8748: --- Assignee: Apache Spark (was: Reynold Xin) Move castability test out from Cast case class into Cast object --- Key: SPARK-8748 URL: https://issues.apache.org/jira/browse/SPARK-8748 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Apache Spark So we can use it as static methods in the analyzer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8749) Remove HiveTypeCoercion trait
[ https://issues.apache.org/jira/browse/SPARK-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8749: --- Assignee: Reynold Xin (was: Apache Spark) Remove HiveTypeCoercion trait - Key: SPARK-8749 URL: https://issues.apache.org/jira/browse/SPARK-8749 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin It is easier to test rules if they are in the companion object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8749) Remove HiveTypeCoercion trait
[ https://issues.apache.org/jira/browse/SPARK-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8749: --- Assignee: Apache Spark (was: Reynold Xin) Remove HiveTypeCoercion trait - Key: SPARK-8749 URL: https://issues.apache.org/jira/browse/SPARK-8749 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Apache Spark It is easier to test rules if they are in the companion object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8749) Remove HiveTypeCoercion trait
[ https://issues.apache.org/jira/browse/SPARK-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609594#comment-14609594 ] Apache Spark commented on SPARK-8749: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7147 Remove HiveTypeCoercion trait - Key: SPARK-8749 URL: https://issues.apache.org/jira/browse/SPARK-8749 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin It is easier to test rules if they are in the companion object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8750) Remove the closure in functions.callUdf
Reynold Xin created SPARK-8750: -- Summary: Remove the closure in functions.callUdf Key: SPARK-8750 URL: https://issues.apache.org/jira/browse/SPARK-8750 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin {code} [warn] /Users/yhuai/Projects/Spark/yin-spark-1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala:1829: Class org.apache.spark.sql.functions$$anonfun$callUDF$1 differs only in case from org.apache.spark.sql.functions$$anonfun$callUdf$1. Such classes will overwrite one another on case-insensitive filesystems. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6990) Add Java linting script
[ https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609531#comment-14609531 ] Yu Ishikawa commented on SPARK-6990: I think it would be nice to execute `mvn checkstyle: checkstyle` with the checkstyle maven plugin. What do you think about that? And do you have any good idea to realize the linter? Add Java linting script --- Key: SPARK-6990 URL: https://issues.apache.org/jira/browse/SPARK-6990 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Josh Rosen Priority: Minor Labels: starter It would be nice to add a {{dev/lint-java}} script to enforce style rules for Spark's Java code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8745) Remove GenerateMutableProjection
Reynold Xin created SPARK-8745: -- Summary: Remove GenerateMutableProjection Key: SPARK-8745 URL: https://issues.apache.org/jira/browse/SPARK-8745 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Based on discussion offline with [~marmbrus], we should remove GenerateMutableProjection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
Christian Kadner created SPARK-8746: --- Summary: Need to update download link for Hive 0.13.1 jars (HiveComparisonTest) Key: SPARK-8746 URL: https://issues.apache.org/jira/browse/SPARK-8746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Christian Kadner Priority: Trivial The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) describes how to generate golden answer files for new hive comparison test cases. However the download link for the Hive 0.13.1 jars points to https://hive.apache.org/downloads.html but none of the linked mirror sites still has the 0.13.1 version. We need to update the link to https://archive.apache.org/dist/hive/hive-0.13.1/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8747) fix EqualNullSafe for binary type
[ https://issues.apache.org/jira/browse/SPARK-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8747: --- Assignee: Apache Spark fix EqualNullSafe for binary type - Key: SPARK-8747 URL: https://issues.apache.org/jira/browse/SPARK-8747 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8747) fix EqualNullSafe for binary type
[ https://issues.apache.org/jira/browse/SPARK-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8747: --- Assignee: (was: Apache Spark) fix EqualNullSafe for binary type - Key: SPARK-8747 URL: https://issues.apache.org/jira/browse/SPARK-8747 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8745) Remove GenerateMutableProjection
[ https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609566#comment-14609566 ] Akhil Thatipamula commented on SPARK-8745: -- [~rxin] I will work on this. Remove GenerateMutableProjection Key: SPARK-8745 URL: https://issues.apache.org/jira/browse/SPARK-8745 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Based on discussion offline with [~marmbrus], we should remove GenerateMutableProjection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8653) Add constraint for Children expression for data type
[ https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609583#comment-14609583 ] Reynold Xin commented on SPARK-8653: Implicit type casts should be up to the query engine itself, not each individual expressions. So we really just need one rule in the TypeCoercion file to handle the implicit type casts, and each expression can simply specify the expected input types. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-AllowedImplicitConversions https://msdn.microsoft.com/en-us/library/ms191530.aspx Add constraint for Children expression for data type Key: SPARK-8653 URL: https://issues.apache.org/jira/browse/SPARK-8653 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Assignee: Reynold Xin Currently, we have trait in Expression like `ExpectsInputTypes` and also the `checkInputDataTypes`, but can not convert the children expressions automatically, except we write the new rules in the `HiveTypeCoercion`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8653) Add constraint for Children expression for data type
[ https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reassigned SPARK-8653: -- Assignee: Reynold Xin Add constraint for Children expression for data type Key: SPARK-8653 URL: https://issues.apache.org/jira/browse/SPARK-8653 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Assignee: Reynold Xin Currently, we have trait in Expression like `ExpectsInputTypes` and also the `checkInputDataTypes`, but can not convert the children expressions automatically, except we write the new rules in the `HiveTypeCoercion`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org