[jira] [Resolved] (SPARK-8604) Parquet data source doesn't write summary file while doing appending
[ https://issues.apache.org/jira/browse/SPARK-8604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-8604. --- Resolution: Fixed Fix Version/s: 1.5.0 1.4.1 Issue resolved by pull request 6998 [https://github.com/apache/spark/pull/6998] Parquet data source doesn't write summary file while doing appending Key: SPARK-8604 URL: https://issues.apache.org/jira/browse/SPARK-8604 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.4.1, 1.5.0 Currently, Parquet and ORC data sources don't set their output format class, as we override the output committer in Spark SQL. However, SPARK-8678 ignores user defined output committer class while doing appending to avoid potential issues brought by direct output committers (e.g. {{DirectParquetOutputCommitter}}). This makes both of these data sources fallback to the default output committer retrieved from {{TextOutputFormat}}, which is {{FileOutputCommitter}}. For ORC, it's totally fine since ORC itself just uses {{FileOutputCommitter}}. But for Parquet, {{ParquetOutputCommitter}} also writes the summary files while committing the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600825#comment-14600825 ] Sam Stoelinga commented on SPARK-8587: -- I also agree that this should have the same API accross the different languages. There is already a function computeCost but the problem is that it doesn't return the index, the problem with predict is that it only returns the index and not the cost. Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn
Bolke de Bruin created SPARK-8623: - Summary: Some queries in spark-sql lead to NullPointerException when using Yarn Key: SPARK-8623 URL: https://issues.apache.org/jira/browse/SPARK-8623 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: Hadoop 2.6, Kerberos Reporter: Bolke de Bruin The following query was executed using spark-sql --master yarn-client on 1.5.0-SNAPSHOT: select * from wcs.geolite_city limit 10; This lead to the following error: 15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException at org.apache.hadoop.conf.Configuration.init(Configuration.java:693) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442) at org.apache.hadoop.mapreduce.Job.init(Job.java:131) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83) at org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89) at org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124) at org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) This does not happen in every case, ie. some queries execute fine, and it is unclear why. Using just spark-sql the query executes fine as well and thus the issue seems to rely in the communication with Yarn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8622) Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath
[ https://issues.apache.org/jira/browse/SPARK-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baswaraj updated SPARK-8622: Description: I ran into an issue that executor not able to pickup my configs/ function from my custom jar in standalone (client/cluster) deploy mode. I have used spark-submit --Jar option to specify all my jars and configs to be used by executors. all these files are placed in working directory of executor, but not in executor classpath. Also, executor working directory is not in executor classpath. I am expecting executor to find all files specified in spark-submit --jar options . in spark 1.3.0 executor working directory is in executor classpath. To successfully run my application with spark 1.3.1 +, i have to use following option (conf/spark-defaults.conf) spark.executor.extraClassPath . Please advice. was: I ran into an issue that executor not able to pickup my configs/ function from my custom jar in standalone (client/cluster) deploy mode. I have used spark-submit --Jar option to specify all my jars and configs to be used by executors. all these files are placed in working directory of executor, but not in executor classpath. Also, executor working directory is not in executor classpath. I am expecting executor to find all files in spark-submit --jar options to be available. in spark 1.3.0 executor working directory is in executor classpath. To successfully run my application with spark 1.3.1 +, i have to use following option (conf/spark-defaults.conf) spark.executor.extraClassPath . Please advice. Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath -- Key: SPARK-8622 URL: https://issues.apache.org/jira/browse/SPARK-8622 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.3.1, 1.4.0 Reporter: Baswaraj I ran into an issue that executor not able to pickup my configs/ function from my custom jar in standalone (client/cluster) deploy mode. I have used spark-submit --Jar option to specify all my jars and configs to be used by executors. all these files are placed in working directory of executor, but not in executor classpath. Also, executor working directory is not in executor classpath. I am expecting executor to find all files specified in spark-submit --jar options . in spark 1.3.0 executor working directory is in executor classpath. To successfully run my application with spark 1.3.1 +, i have to use following option (conf/spark-defaults.conf) spark.executor.extraClassPath . Please advice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8622) Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath
[ https://issues.apache.org/jira/browse/SPARK-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baswaraj updated SPARK-8622: Description: I ran into an issue that executor not able to pickup my configs/ function from my custom jar in standalone (client/cluster) deploy mode. I have used spark-submit --Jar option to specify all my jars and configs to be used by executors. all these files are placed in working directory of executor, but not in executor classpath. Also, executor working directory is not in executor classpath. I am expecting executor to find all files in spark-submit --jar options to be available. in spark 1.3.0 executor working directory is in executor classpath. To successfully run my application with spark 1.3.1 +, i have to use following option (conf/spark-defaults.conf) spark.executor.extraClassPath . Please advice. was: I ran into an issue that executor not able to pickup my configs/ function from my custom jar in standalone (client/cluster) deploy mode. I have used spark-submit --Jar option to specify all my jars and configs to be used by executors. all these files are placed in working directory of executor, but not in executor classpath. Also, executor working directory is not in executor classpath. I am expecting executor to find all files in spark-submit --jar options to be available. in spark 1.3.0 executor working directory is in executor classpath. To successfully run my application with spark 1.3.1 +, i have to add following entry in slaves conf/spark-defaults.conf spark.executor.extraClassPath . Please advice. Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath -- Key: SPARK-8622 URL: https://issues.apache.org/jira/browse/SPARK-8622 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.3.1, 1.4.0 Reporter: Baswaraj I ran into an issue that executor not able to pickup my configs/ function from my custom jar in standalone (client/cluster) deploy mode. I have used spark-submit --Jar option to specify all my jars and configs to be used by executors. all these files are placed in working directory of executor, but not in executor classpath. Also, executor working directory is not in executor classpath. I am expecting executor to find all files in spark-submit --jar options to be available. in spark 1.3.0 executor working directory is in executor classpath. To successfully run my application with spark 1.3.1 +, i have to use following option (conf/spark-defaults.conf) spark.executor.extraClassPath . Please advice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver = NPE
[ https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sjoerd Mulder updated SPARK-8592: - Component/s: Scheduler CoarseGrainedExecutorBackend: Cannot register with driver = NPE Key: SPARK-8592 URL: https://issues.apache.org/jira/browse/SPARK-8592 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.4.0 Environment: Ubuntu 14.04, Scala 2.11, Java 8, Reporter: Sjoerd Mulder Priority: Minor I cannot reproduce this consistently but when submitting jobs just after another finished it will not come up: {code} 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler java.lang.NullPointerException at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273) at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273) at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313) at java.lang.String.valueOf(String.java:2982) at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125) at org.apache.spark.Logging$class.logInfo(Logging.scala:59) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125) at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178) at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127) at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198) at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver = NPE
[ https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sjoerd Mulder updated SPARK-8592: - Component/s: Spark Core CoarseGrainedExecutorBackend: Cannot register with driver = NPE Key: SPARK-8592 URL: https://issues.apache.org/jira/browse/SPARK-8592 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 1.4.0 Environment: Ubuntu 14.04, Scala 2.11, Java 8, Reporter: Sjoerd Mulder Priority: Minor I cannot reproduce this consistently but when submitting jobs just after another finished it will not come up: {code} 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler java.lang.NullPointerException at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273) at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273) at org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313) at java.lang.String.valueOf(String.java:2982) at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125) at org.apache.spark.Logging$class.logInfo(Logging.scala:59) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125) at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178) at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127) at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198) at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8603) In Windows,Not able to create a Spark context from R studio
[ https://issues.apache.org/jira/browse/SPARK-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8603: --- Assignee: Apache Spark In Windows,Not able to create a Spark context from R studio Key: SPARK-8603 URL: https://issues.apache.org/jira/browse/SPARK-8603 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Environment: Windows, R studio Reporter: Prakash Ponshankaarchinnusamy Assignee: Apache Spark Fix For: 1.4.0 Original Estimate: 0.5m Remaining Estimate: 0.5m In windows ,creation of spark context fails using below code from R studio Sys.setenv(SPARK_HOME=C:\\spark\\spark-1.4.0) .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) library(SparkR) sc - sparkR.init(master=spark://localhost:7077, appName=SparkR) Error: JVM is not ready after 10 seconds Reason: Wrong file path computed in client.R. File seperator for windows[\] is not respected by file.Path function by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8603) In Windows,Not able to create a Spark context from R studio
[ https://issues.apache.org/jira/browse/SPARK-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600830#comment-14600830 ] Apache Spark commented on SPARK-8603: - User 'prakashpc' has created a pull request for this issue: https://github.com/apache/spark/pull/7012 In Windows,Not able to create a Spark context from R studio Key: SPARK-8603 URL: https://issues.apache.org/jira/browse/SPARK-8603 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Environment: Windows, R studio Reporter: Prakash Ponshankaarchinnusamy Fix For: 1.4.0 Original Estimate: 0.5m Remaining Estimate: 0.5m In windows ,creation of spark context fails using below code from R studio Sys.setenv(SPARK_HOME=C:\\spark\\spark-1.4.0) .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) library(SparkR) sc - sparkR.init(master=spark://localhost:7077, appName=SparkR) Error: JVM is not ready after 10 seconds Reason: Wrong file path computed in client.R. File seperator for windows[\] is not respected by file.Path function by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8603) In Windows,Not able to create a Spark context from R studio
[ https://issues.apache.org/jira/browse/SPARK-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8603: --- Assignee: (was: Apache Spark) In Windows,Not able to create a Spark context from R studio Key: SPARK-8603 URL: https://issues.apache.org/jira/browse/SPARK-8603 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Environment: Windows, R studio Reporter: Prakash Ponshankaarchinnusamy Fix For: 1.4.0 Original Estimate: 0.5m Remaining Estimate: 0.5m In windows ,creation of spark context fails using below code from R studio Sys.setenv(SPARK_HOME=C:\\spark\\spark-1.4.0) .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) library(SparkR) sc - sparkR.init(master=spark://localhost:7077, appName=SparkR) Error: JVM is not ready after 10 seconds Reason: Wrong file path computed in client.R. File seperator for windows[\] is not respected by file.Path function by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7977) Disallow println
[ https://issues.apache.org/jira/browse/SPARK-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601827#comment-14601827 ] Jon Alter commented on SPARK-7977: -- Working on this. Disallow println Key: SPARK-7977 URL: https://issues.apache.org/jira/browse/SPARK-7977 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Reynold Xin Labels: starter Very often we see pull requests that added println from debugging, but the author forgot to remove it before code review. We can use the regex checker to disallow println. For legitimate use of println, we can then disable the rule where they are used. Add to scalastyle-config.xml file: {code} check customId=println level=error class=org.scalastyle.scalariform.TokenChecker enabled=true parametersparameter name=regex^println$/parameter/parameters customMessage![CDATA[Are you sure you want to println? If yes, wrap the code block with // scalastyle:off println println(...) // scalastyle:on println]]/customMessage /check {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8567) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars
[ https://issues.apache.org/jira/browse/SPARK-8567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601900#comment-14601900 ] Apache Spark commented on SPARK-8567: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/7027 Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars -- Key: SPARK-8567 URL: https://issues.apache.org/jira/browse/SPARK-8567 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 1.4.1 Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Labels: flaky-test Seems tests in HiveSparkSubmitSuite fail with timeout pretty frequently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7287) Flaky test: o.a.s.deploy.SparkSubmitSuite --packages
[ https://issues.apache.org/jira/browse/SPARK-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601899#comment-14601899 ] Apache Spark commented on SPARK-7287: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/7027 Flaky test: o.a.s.deploy.SparkSubmitSuite --packages Key: SPARK-7287 URL: https://issues.apache.org/jira/browse/SPARK-7287 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Burak Yavuz Priority: Critical Labels: flaky-test Error message was not helpful (did not complete within 60 seconds or something). Observed only in master: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/2239/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/2238/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2163/ ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601923#comment-14601923 ] Shay Rojansky commented on SPARK-7736: -- The problem is simply with the YARN status for the application. If a Spark application throws an exception after having instantiated the SparkContext, the application obviously terminates but YARN lists the job as SUCCEEDED. This makes it hard for users to see what happened to their jobs in the YARN UI. Let me know if this is still unclear. Exception not failing Python applications (in yarn cluster mode) Key: SPARK-7736 URL: https://issues.apache.org/jira/browse/SPARK-7736 Project: Spark Issue Type: Bug Components: YARN Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 Reporter: Shay Rojansky It seems that exceptions thrown in Python spark apps after the SparkContext is instantiated don't cause the application to fail, at least in Yarn: the application is marked as SUCCEEDED. Note that any exception right before the SparkContext correctly places the application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8067) Add support for connecting to Hive 1.1
[ https://issues.apache.org/jira/browse/SPARK-8067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8067: --- Assignee: (was: Apache Spark) Add support for connecting to Hive 1.1 -- Key: SPARK-8067 URL: https://issues.apache.org/jira/browse/SPARK-8067 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8066) Add support for connecting to Hive 1.0 (0.14.1)
[ https://issues.apache.org/jira/browse/SPARK-8066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8066: --- Assignee: Apache Spark Add support for connecting to Hive 1.0 (0.14.1) --- Key: SPARK-8066 URL: https://issues.apache.org/jira/browse/SPARK-8066 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8067) Add support for connecting to Hive 1.1
[ https://issues.apache.org/jira/browse/SPARK-8067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601789#comment-14601789 ] Apache Spark commented on SPARK-8067: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/7026 Add support for connecting to Hive 1.1 -- Key: SPARK-8067 URL: https://issues.apache.org/jira/browse/SPARK-8067 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8067) Add support for connecting to Hive 1.1
[ https://issues.apache.org/jira/browse/SPARK-8067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8067: --- Assignee: Apache Spark Add support for connecting to Hive 1.1 -- Key: SPARK-8067 URL: https://issues.apache.org/jira/browse/SPARK-8067 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8066) Add support for connecting to Hive 1.0 (0.14.1)
[ https://issues.apache.org/jira/browse/SPARK-8066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601788#comment-14601788 ] Apache Spark commented on SPARK-8066: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/7026 Add support for connecting to Hive 1.0 (0.14.1) --- Key: SPARK-8066 URL: https://issues.apache.org/jira/browse/SPARK-8066 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8066) Add support for connecting to Hive 1.0 (0.14.1)
[ https://issues.apache.org/jira/browse/SPARK-8066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8066: --- Assignee: (was: Apache Spark) Add support for connecting to Hive 1.0 (0.14.1) --- Key: SPARK-8066 URL: https://issues.apache.org/jira/browse/SPARK-8066 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601914#comment-14601914 ] Neelesh Srinivas Salian commented on SPARK-7736: Could you add more context to the issue? What is the return value / output expected on the applications? Exception not failing Python applications (in yarn cluster mode) Key: SPARK-7736 URL: https://issues.apache.org/jira/browse/SPARK-7736 Project: Spark Issue Type: Bug Components: YARN Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 Reporter: Shay Rojansky It seems that exceptions thrown in Python spark apps after the SparkContext is instantiated don't cause the application to fail, at least in Yarn: the application is marked as SUCCEEDED. Note that any exception right before the SparkContext correctly places the application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8643) local-cluster may not shutdown SparkContext gracefully
Yin Huai created SPARK-8643: --- Summary: local-cluster may not shutdown SparkContext gracefully Key: SPARK-8643 URL: https://issues.apache.org/jira/browse/SPARK-8643 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Yin Huai When I was debugging SPARK-8567, I found that when I was using local-cluster, at the end of an application, executors were first killed and then launched again. From the log (attached), seems the master/driver side does not know it's in the shutdown process. So, it detected executor loss and then called the worker to launch new executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8643) local-cluster may not shutdown SparkContext gracefully
[ https://issues.apache.org/jira/browse/SPARK-8643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-8643: Attachment: HiveSparkSubmitSuite (SPARK-8368).txt local-cluster may not shutdown SparkContext gracefully -- Key: SPARK-8643 URL: https://issues.apache.org/jira/browse/SPARK-8643 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Yin Huai Attachments: HiveSparkSubmitSuite (SPARK-8368).txt When I was debugging SPARK-8567, I found that when I was using local-cluster, at the end of an application, executors were first killed and then launched again. From the log (attached), seems the master/driver side does not know it's in the shutdown process. So, it detected executor loss and then called the worker to launch new executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8651) Lasso with SGD not Converging properly
Albert Azout created SPARK-8651: --- Summary: Lasso with SGD not Converging properly Key: SPARK-8651 URL: https://issues.apache.org/jira/browse/SPARK-8651 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Albert Azout We are having issues getting Lasso with SGD to converge properly. The weights outputted are extremely large values. We have tried multiple miniBatchRatios and still see same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8372) History server shows incorrect information for application not started
[ https://issues.apache.org/jira/browse/SPARK-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602405#comment-14602405 ] Carson Wang commented on SPARK-8372: [~vanzin] The log path name may also end with an attempt id, like application_xxx_xxx_1.inprogress. This happens when running the app in yarn cluster mode. If we still need get the app id from the log path name, the attempt id need to be removed as well if it exists. History server shows incorrect information for application not started -- Key: SPARK-8372 URL: https://issues.apache.org/jira/browse/SPARK-8372 Project: Spark Issue Type: Bug Components: Deploy, Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Assignee: Carson Wang Priority: Minor Fix For: 1.4.1, 1.5.0 Attachments: IncorrectAppInfo.png The history server may show an incorrect App ID for an incomplete application like App ID.inprogress. This app info will never disappear even after the app is completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8620) cleanup CodeGenContext
[ https://issues.apache.org/jira/browse/SPARK-8620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-8620: -- Assignee: Wenchen Fan cleanup CodeGenContext -- Key: SPARK-8620 URL: https://issues.apache.org/jira/browse/SPARK-8620 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8635) improve performance of CatalystTypeConverters
[ https://issues.apache.org/jira/browse/SPARK-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8635. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7018 [https://github.com/apache/spark/pull/7018] improve performance of CatalystTypeConverters - Key: SPARK-8635 URL: https://issues.apache.org/jira/browse/SPARK-8635 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1859) Linear, Ridge and Lasso Regressions with SGD yield unexpected results
[ https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602388#comment-14602388 ] Albert Azout commented on SPARK-1859: - Hi this is still an open issue for us. FYI. Any new resolutions on this? Linear, Ridge and Lasso Regressions with SGD yield unexpected results - Key: SPARK-1859 URL: https://issues.apache.org/jira/browse/SPARK-1859 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 0.9.1 Environment: OS: Ubuntu Server 12.04 x64 PySpark Reporter: Vlad Frolov Labels: algorithm, machine_learning, regression Issue: Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one). Ridge Regression with SGD *sometimes* works ok. Lasso Regression with SGD *sometimes* works ok. Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 : {code:title=regression_example.py} parsedData = sc.parallelize([ array([2400., 1500.]), array([240., 150.]), array([24., 15.]), array([2.4, 1.5]), array([0.24, 0.15]) ]) # Build the model model = LinearRegressionWithSGD.train(parsedData) print model._coeffs {code} So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! :) The resulting model has nan coeffs: {{array([ nan])}}. Furthermore, if you comment records line by line you will get: * [-1.55897475e+296] coeff (the first record is commented), * [-8.62115396e+104] coeff (the first two records are commented), * etc It looks like the implemented regression algorithms diverges somehow. I get almost the same results on Ridge and Lasso. I've also tested these inputs in scikit-learn and it works as expected there. However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8588) Could not use concat with UDF in where clause
[ https://issues.apache.org/jira/browse/SPARK-8588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602401#comment-14602401 ] Wenchen Fan commented on SPARK-8588: cc [~marmbrus] this issue has already been fixed by https://github.com/apache/spark/pull/6145. Could not use concat with UDF in where clause - Key: SPARK-8588 URL: https://issues.apache.org/jira/browse/SPARK-8588 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Centos 7, java 1.7.0_67, scala 2.10.5, run in a spark standalone cluster(or local). Reporter: StanZhai Assignee: Wenchen Fan Priority: Critical After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the following exception when use concat with UDF in where clause: {code} org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年) at org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299) at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at
[jira] [Resolved] (SPARK-8237) misc function: sha2
[ https://issues.apache.org/jira/browse/SPARK-8237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8237. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6934 [https://github.com/apache/spark/pull/6934] misc function: sha2 --- Key: SPARK-8237 URL: https://issues.apache.org/jira/browse/SPARK-8237 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Fix For: 1.5.0 sha2(string/binary, int): string Calculates the SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512) (as of Hive 1.3.0). The first argument is the string or binary to be hashed. The second argument indicates the desired bit length of the result, which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 256). SHA-224 is supported starting from Java 8. If either argument is NULL or the hash length is not one of the permitted values, the return value is NULL. Example: sha2('ABC', 256) = 'b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling
[ https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602411#comment-14602411 ] Animesh Baranawal commented on SPARK-8636: -- So the condition should be : if (l == null || r == null) false else l == r CaseKeyWhen has incorrect NULL handling --- Key: SPARK-8636 URL: https://issues.apache.org/jira/browse/SPARK-8636 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Santiago M. Mola Labels: starter CaseKeyWhen implementation in Spark uses the following equals implementation: {code} private def equalNullSafe(l: Any, r: Any) = { if (l == null r == null) { true } else if (l == null || r == null) { false } else { l == r } } {code} Which is not correct, since in SQL, NULL is never equal to NULL (actually, it is not unequal either). In this case, a NULL value in a CASE WHEN expression should never match. For example, you can execute this in MySQL: {code} SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END FROM DUAL; {code} And the result will be NULL DOES NOT MATCH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8636) CaseKeyWhen has incorrect NULL handling
[ https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602411#comment-14602411 ] Animesh Baranawal edited comment on SPARK-8636 at 6/26/15 5:13 AM: --- So the condition should be : if (l == null || r == null) false else l == r ? was (Author: animeshbaranawal): So the condition should be : if (l == null || r == null) false else l == r CaseKeyWhen has incorrect NULL handling --- Key: SPARK-8636 URL: https://issues.apache.org/jira/browse/SPARK-8636 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Santiago M. Mola Labels: starter CaseKeyWhen implementation in Spark uses the following equals implementation: {code} private def equalNullSafe(l: Any, r: Any) = { if (l == null r == null) { true } else if (l == null || r == null) { false } else { l == r } } {code} Which is not correct, since in SQL, NULL is never equal to NULL (actually, it is not unequal either). In this case, a NULL value in a CASE WHEN expression should never match. For example, you can execute this in MySQL: {code} SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END FROM DUAL; {code} And the result will be NULL DOES NOT MATCH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8647) Potential issues with the constant hashCode
[ https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602431#comment-14602431 ] Xiangrui Meng commented on SPARK-8647: -- All MatrixUDT instances are the same. So the hashCode should return a constant. `1994` is just a random number we picked. Feel free to send a PR to add documentation. However, this is not a bug, and I don't think it would cause performance issues. Potential issues with the constant hashCode Key: SPARK-8647 URL: https://issues.apache.org/jira/browse/SPARK-8647 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Alok Singh Priority: Minor Labels: performance Hi, This may be potential bug or performance issue or just the code docs. The issue is wrt to MatrixUDT class. If we decide to put instance of MatrixUDT into the hash based collection. The hashCode function is returning constant and even though equals method is consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e constant) has been used. I was expecting it to be similar to the other matrix class or the vector class . If there is the reason why we have this code, we should document it properly in the code so that others reading it is fine. regards, Alok Details = a) In reference to the file https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala line 188-197 ie override def equals(o: Any): Boolean = { o match { case v: MatrixUDT = true case _ = false } } override def hashCode(): Int = 1994 b) the commit is https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436 on March 20. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8647) Potential issues with the constant hashCode
[ https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8647: - Issue Type: Improvement (was: Bug) Potential issues with the constant hashCode Key: SPARK-8647 URL: https://issues.apache.org/jira/browse/SPARK-8647 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Alok Singh Priority: Minor Labels: performance Hi, This may be potential bug or performance issue or just the code docs. The issue is wrt to MatrixUDT class. If we decide to put instance of MatrixUDT into the hash based collection. The hashCode function is returning constant and even though equals method is consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e constant) has been used. I was expecting it to be similar to the other matrix class or the vector class . If there is the reason why we have this code, we should document it properly in the code so that others reading it is fine. regards, Alok Details = a) In reference to the file https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala line 188-197 ie override def equals(o: Any): Boolean = { o match { case v: MatrixUDT = true case _ = false } } override def hashCode(): Int = 1994 b) the commit is https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436 on March 20. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8625) Propagate user exceptions in tasks back to driver
[ https://issues.apache.org/jira/browse/SPARK-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8625: --- Assignee: (was: Apache Spark) Propagate user exceptions in tasks back to driver - Key: SPARK-8625 URL: https://issues.apache.org/jira/browse/SPARK-8625 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: Tom White Runtime exceptions that are thrown by user code in Spark are presented to the user as strings (message and stacktrace), rather than the exception object itself. If the exception stores information about the error in fields then these cannot be retrieved. Exceptions are Serializable, so it would be feasible to return the original object back to the driver as the cause field in SparkException. This would allow the client to retrieve information from the original exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8626) ALS model predict error
[ https://issues.apache.org/jira/browse/SPARK-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601004#comment-14601004 ] Subhod Lagade commented on SPARK-8626: -- INFO] Compiling 1 source files to /home/appadmin/disneypoc/target/classes at 1435229668035 [ERROR] /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: error: not enough arguments for method predict: (user: Int, product: Int)Double. [INFO] Unspecified value parameter product. [INFO] val predictions = model.predict(usersProducts) ALS model predict error --- Key: SPARK-8626 URL: https://issues.apache.org/jira/browse/SPARK-8626 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8627) ALS model predict error
[ https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601023#comment-14601023 ] Subhod Lagade commented on SPARK-8627: -- can you help me in resolving this ?? usersProducts is a RDD(int,int) it is still giving me error ALS model predict error --- Key: SPARK-8627 URL: https://issues.apache.org/jira/browse/SPARK-8627 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade /** * Created by subhod lagade on 25/06/15. */ import org.apache.spark.SparkConf import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming._; import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintStream; import java.net.ServerSocket; import java.net.Socket; import java.util.Properties; import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating object SparkStreamKafka { def main(args: Array[String]) { val conf = new SparkConf().setAppName(Simple Application); val sc = new SparkContext(conf); val data = sc.textFile(/home/appadmin/Disney/data.csv); val ratings = data.map(_.split(',') match { case Array(user, product, rate) = Rating(user.toInt, product.toInt, rate.toDouble) }); val rank = 3; val numIterations = 2; val model = ALS.train(ratings,rank,numIterations,0.01); val usersProducts = ratings.map{ case Rating(user, product, rate) = (user, product)} // Build the recommendation model using ALS usersProducts.foreach(println) val predictions = model.predict(usersProducts) } } /* ERROR Message [ERROR] /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: error: not enough arguments for method predict: (user: Int, product: Int)Double. [INFO] Unspecified value parameter product. [INFO] val predictions = model.predict(usersProducts) */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8627) ALS model predict error
[ https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601024#comment-14601024 ] Subhod Lagade commented on SPARK-8627: -- can you help me in resolving this ?? usersProducts is a RDD(int,int) it is still giving me error ALS model predict error --- Key: SPARK-8627 URL: https://issues.apache.org/jira/browse/SPARK-8627 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade /** * Created by subhod lagade on 25/06/15. */ import org.apache.spark.SparkConf import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming._; import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintStream; import java.net.ServerSocket; import java.net.Socket; import java.util.Properties; import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating object SparkStreamKafka { def main(args: Array[String]) { val conf = new SparkConf().setAppName(Simple Application); val sc = new SparkContext(conf); val data = sc.textFile(/home/appadmin/Disney/data.csv); val ratings = data.map(_.split(',') match { case Array(user, product, rate) = Rating(user.toInt, product.toInt, rate.toDouble) }); val rank = 3; val numIterations = 2; val model = ALS.train(ratings,rank,numIterations,0.01); val usersProducts = ratings.map{ case Rating(user, product, rate) = (user, product)} // Build the recommendation model using ALS usersProducts.foreach(println) val predictions = model.predict(usersProducts) } } /* ERROR Message [ERROR] /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: error: not enough arguments for method predict: (user: Int, product: Int)Double. [INFO] Unspecified value parameter product. [INFO] val predictions = model.predict(usersProducts) */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8627) ALS model predict error
[ https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subhod Lagade updated SPARK-8627: - Comment: was deleted (was: can you help me in resolving this ?? usersProducts is a RDD(int,int) it is still giving me error ) ALS model predict error --- Key: SPARK-8627 URL: https://issues.apache.org/jira/browse/SPARK-8627 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade /** * Created by subhod lagade on 25/06/15. */ import org.apache.spark.SparkConf import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming._; import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintStream; import java.net.ServerSocket; import java.net.Socket; import java.util.Properties; import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating object SparkStreamKafka { def main(args: Array[String]) { val conf = new SparkConf().setAppName(Simple Application); val sc = new SparkContext(conf); val data = sc.textFile(/home/appadmin/Disney/data.csv); val ratings = data.map(_.split(',') match { case Array(user, product, rate) = Rating(user.toInt, product.toInt, rate.toDouble) }); val rank = 3; val numIterations = 2; val model = ALS.train(ratings,rank,numIterations,0.01); val usersProducts = ratings.map{ case Rating(user, product, rate) = (user, product)} // Build the recommendation model using ALS usersProducts.foreach(println) val predictions = model.predict(usersProducts) } } /* ERROR Message [ERROR] /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: error: not enough arguments for method predict: (user: Int, product: Int)Double. [INFO] Unspecified value parameter product. [INFO] val predictions = model.predict(usersProducts) */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
[ https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601082#comment-14601082 ] Kousuke Saruta commented on SPARK-5768: --- I can't change assignee field and I don't know why. I'll try to change again later. Spark UI Shows incorrect memory under Yarn -- Key: SPARK-5768 URL: https://issues.apache.org/jira/browse/SPARK-5768 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0, 1.2.1 Environment: Centos 6 Reporter: Al M Priority: Trivial Fix For: 1.4.1, 1.5.0 I am running Spark on Yarn with 2 executors. The executors are running on separate physical machines. I have spark.executor.memory set to '40g'. This is because I want to have 40g of memory used on each machine. I have one executor per machine. When I run my application I see from 'top' that both my executors are using the full 40g of memory I allocated to them. The 'Executors' tab in the Spark UI shows something different. It shows the memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it look like I only have 20GB available per executor when really I have 40GB available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8629) R code in SparkR
Arun created SPARK-8629: --- Summary: R code in SparkR Key: SPARK-8629 URL: https://issues.apache.org/jira/browse/SPARK-8629 Project: Spark Issue Type: Question Components: R Reporter: Arun Priority: Minor Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/13/2013 2-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/14/2013 2-Feb 2013 1 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/15/2013 2-Feb 2013 8 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/16/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/17/2013 2-Feb 2013 19 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/18/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/19/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/20/2013 2-Feb 2013 16 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/21/2013 2-Feb 2013 25 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/22/2013 2-Feb 2013 19 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/23/2013 2-Feb 2013 17 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/24/2013 2-Feb 2013 39 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/25/2013 2-Feb 2013 23 Code i used in R: data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) factors - unique(data$ItemNo) df.allitems - data.frame() for(i in 1:length(factors)) { data1 - filter(data, ItemNo == factors[[i]]) data2- select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select particular columns date2$date - as.Date(date2$date, format = %m/%d/%y) # format the date data3 - data2[order(data2$date), ] # order by assending df.allitems - rbind(data3 , df.allitems) # Append by row bind } write.csv(df.allitems,E:/all_items.csv) --- I have done some SparkR code: data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF factors - distinct(df_1) # removed duplicates #for select i used: df_2 - select(distinctDF ,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select action I dont know how to: 1) create a empty sparkR DF 2) Using for loop in SparkR 3) change the date format. 4) find the lenght() in spark df 5) using rbind in sparkR can you help me out in doing the above code in sparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8629) R code in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun updated SPARK-8629: Description: Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/13/2013 2-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/14/2013 2-Feb 2013 1 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/15/2013 2-Feb 2013 8 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/16/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/17/2013 2-Feb 2013 19 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/18/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/19/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/20/2013 2-Feb 2013 16 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/21/2013 2-Feb 2013 25 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/22/2013 2-Feb 2013 19 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/23/2013 2-Feb 2013 17 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/24/2013 2-Feb 2013 39 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/25/2013 2-Feb 2013 23 Code i used in R: data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) factors - unique(data$ItemNo) df.allitems - data.frame() for(i in 1:length(factors)) { data1 - filter(data, ItemNo == factors[[i]]) data2select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) date2$date - as.Date(date2$date, format = %m/%d/%y) data3 - data2[order(data2$date), ] df.allitems - rbind(data3 , df.allitems) # Append by row bind } write.csv(df.allitems,E:/all_items.csv) --- I have done some SparkR code: data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF factors - distinct(df_1) # removed duplicates #for select i used: df_2 - select(distinctDF ,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select action I dont know how to: 1) create a empty sparkR DF 2) Using for loop in SparkR 3) change the date format. 4) find the lenght() in spark df 5) using rbind in sparkR can you help me out in doing the above code in sparkR. was: Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/13/2013 2-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/14/2013 2-Feb 2013 1 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/15/2013 2-Feb 2013 8 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/16/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/17/2013 2-Feb 2013 19 Hyderabad 11 15012 more. Value Chana Dal 1 Kg.
[jira] [Assigned] (SPARK-8625) Propagate user exceptions in tasks back to driver
[ https://issues.apache.org/jira/browse/SPARK-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8625: --- Assignee: Apache Spark Propagate user exceptions in tasks back to driver - Key: SPARK-8625 URL: https://issues.apache.org/jira/browse/SPARK-8625 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: Tom White Assignee: Apache Spark Runtime exceptions that are thrown by user code in Spark are presented to the user as strings (message and stacktrace), rather than the exception object itself. If the exception stores information about the error in fields then these cannot be retrieved. Exceptions are Serializable, so it would be feasible to return the original object back to the driver as the cause field in SparkException. This would allow the client to retrieve information from the original exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-8627) ALS model predict error
[ https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subhod Lagade reopened SPARK-8627: -- usersProducts is a RDD(int,int) it is still giving me error There is some issue with model.predict ALS model predict error --- Key: SPARK-8627 URL: https://issues.apache.org/jira/browse/SPARK-8627 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade /** * Created by subhod lagade on 25/06/15. */ import org.apache.spark.SparkConf import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming._; import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintStream; import java.net.ServerSocket; import java.net.Socket; import java.util.Properties; import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating object SparkStreamKafka { def main(args: Array[String]) { val conf = new SparkConf().setAppName(Simple Application); val sc = new SparkContext(conf); val data = sc.textFile(/home/appadmin/Disney/data.csv); val ratings = data.map(_.split(',') match { case Array(user, product, rate) = Rating(user.toInt, product.toInt, rate.toDouble) }); val rank = 3; val numIterations = 2; val model = ALS.train(ratings,rank,numIterations,0.01); val usersProducts = ratings.map{ case Rating(user, product, rate) = (user, product)} // Build the recommendation model using ALS usersProducts.foreach(println) val predictions = model.predict(usersProducts) } } /* ERROR Message [ERROR] /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: error: not enough arguments for method predict: (user: Int, product: Int)Double. [INFO] Unspecified value parameter product. [INFO] val predictions = model.predict(usersProducts) */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
[ https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-5768. --- Resolution: Fixed Spark UI Shows incorrect memory under Yarn -- Key: SPARK-5768 URL: https://issues.apache.org/jira/browse/SPARK-5768 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0, 1.2.1 Environment: Centos 6 Reporter: Al M Priority: Trivial Fix For: 1.4.1, 1.5.0 I am running Spark on Yarn with 2 executors. The executors are running on separate physical machines. I have spark.executor.memory set to '40g'. This is because I want to have 40g of memory used on each machine. I have one executor per machine. When I run my application I see from 'top' that both my executors are using the full 40g of memory I allocated to them. The 'Executors' tab in the Spark UI shows something different. It shows the memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it look like I only have 20GB available per executor when really I have 40GB available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
[ https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601098#comment-14601098 ] Sean Owen commented on SPARK-5768: -- I set it, and added you to the Committers role, which should let you change Assignee. I think this is all correct but note that if (unlikely) 1.4.1 is released without another RC then this won't be fixed for 1.4.1. Spark UI Shows incorrect memory under Yarn -- Key: SPARK-5768 URL: https://issues.apache.org/jira/browse/SPARK-5768 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0, 1.2.1 Environment: Centos 6 Reporter: Al M Assignee: Rekha Joshi Priority: Trivial Fix For: 1.4.1, 1.5.0 I am running Spark on Yarn with 2 executors. The executors are running on separate physical machines. I have spark.executor.memory set to '40g'. This is because I want to have 40g of memory used on each machine. I have one executor per machine. When I run my application I see from 'top' that both my executors are using the full 40g of memory I allocated to them. The 'Executors' tab in the Spark UI shows something different. It shows the memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it look like I only have 20GB available per executor when really I have 40GB available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8629) R code in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun updated SPARK-8629: Description: Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/13/2013 2-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/14/2013 2-Feb 2013 1 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/15/2013 2-Feb 2013 8 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/16/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/17/2013 2-Feb 2013 19 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/18/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/19/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/20/2013 2-Feb 2013 16 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/21/2013 2-Feb 2013 25 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/22/2013 2-Feb 2013 19 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/23/2013 2-Feb 2013 17 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/24/2013 2-Feb 2013 39 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/25/2013 2-Feb 2013 23 Code i used in R: data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) factors - unique(data$ItemNo) df.allitems - data.frame() for(i in 1:length(factors)) { data1 - filter(data, ItemNo == factors[[i]]) data2- select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select particular columns date2$date - as.Date(date2$date, format = %m/%d/%y) # format the date data3 - data2[order(data2$date), ] # order by assending df.allitems - rbind(data3 , df.allitems) # Append by row bind } write.csv(df.allitems,E:/all_items.csv) --- I have done some SparkR code: data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF factors - distinct(df_1) # removed duplicates #for select i used: df_2 - select(distinctDF ,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select action I dont know how to: 1) create a empty sparkR DF 2) Using for loop in SparkR 3) change the date format. 4) find the lenght() in spark df 5) using rbind in sparkR can you help me out in doing the above code in sparkR. was: Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/13/2013 2-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/14/2013 2-Feb 2013 1 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/15/2013 2-Feb 2013 8 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/16/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/17/2013
[jira] [Updated] (SPARK-8629) R code in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun updated SPARK-8629: Description: Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/13/2013 2-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/14/2013 2-Feb 2013 1 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/15/2013 2-Feb 2013 8 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/16/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/17/2013 2-Feb 2013 19 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/18/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/19/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/20/2013 2-Feb 2013 16 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/21/2013 2-Feb 2013 25 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/22/2013 2-Feb 2013 19 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/23/2013 2-Feb 2013 17 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/24/2013 2-Feb 2013 39 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/25/2013 2-Feb 2013 23 Code i used in R: data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) factors - unique(data$ItemNo) df.allitems - data.frame() for(i in 1:length(factors)) { data1 - filter(data, ItemNo == factors[[i]]) data2select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) date2$date - as.Date(date2$date, format = %m/%d/%y) data3 - data2[order(data2$date), ] df.allitems - rbind(data3 , df.allitems) # Append by row bind } write.csv(df.allitems,E:/all_items.csv) --- I have done some SparkR code: data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF factors - distinct(df_1) # removed duplicates #for select i used: df_2 - select(distinctDF ,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select action I dont know how to: 1) create a empty sparkR DF 2) Using for loop in SparkR 3) change the date format. 4) find the lenght() in spark df 5) using rbind in sparkR can you help me out in doing the above code in sparkR. was: Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/13/2013 2-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/14/2013 2-Feb 2013 1 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/15/2013 2-Feb 2013 8 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/16/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/17/2013 2-Feb 2013 19 Hyderabad 11 15012 more. Value Chana Dal 1 Kg.
[jira] [Updated] (SPARK-8624) DataFrameReader doesn't respect MERGE_SCHEMA setting for Parquet
[ https://issues.apache.org/jira/browse/SPARK-8624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rex Xiong updated SPARK-8624: - Description: In 1.4.0, parquet is read by DataFrameReader.parquet, when creating ParquetRelation2 object, parameters is hard-coded as Map.empty[String, String], so ParquetRelation2.shouldMergeSchemas is always true (the default value). In previous version, spark.sql.hive.convertMetastoreParquet.mergeSchema config is respected. This bug downgrade performance a lot for a folder with hundreds of parquet files and we don't want a schema merge. was: In 1.4.0, parquet is read by DataFrameReader.parquet, when creating ParquetRelation2 object, Map.empty[String, String] is hard-coded as parameters, so ParquetRelation2.shouldMergeSchemas is always true (the default value). In previous version, spark.sql.hive.convertMetastoreParquet.mergeSchema config is respected. This bug downgrade performance a lot for a folder with hundreds of parquet files and we don't want a schema merge. DataFrameReader doesn't respect MERGE_SCHEMA setting for Parquet Key: SPARK-8624 URL: https://issues.apache.org/jira/browse/SPARK-8624 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Rex Xiong Labels: parquet In 1.4.0, parquet is read by DataFrameReader.parquet, when creating ParquetRelation2 object, parameters is hard-coded as Map.empty[String, String], so ParquetRelation2.shouldMergeSchemas is always true (the default value). In previous version, spark.sql.hive.convertMetastoreParquet.mergeSchema config is respected. This bug downgrade performance a lot for a folder with hundreds of parquet files and we don't want a schema merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse
[ https://issues.apache.org/jira/browse/SPARK-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santiago M. Mola updated SPARK-8628: Description: SPARK-5009 introduced the following code in AbstractSparkSQLParser: {code} def parse(input: String): LogicalPlan = { // Initialize the Keywords. lexical.initialize(reservedWords) phrase(start)(new lexical.Scanner(input)) match { case Success(plan, _) = plan case failureOrError = sys.error(failureOrError.toString) } } {code} The corresponding initialize method in SqlLexical is not thread-safe: {code} /* This is a work around to support the lazy setting */ def initialize(keywords: Seq[String]): Unit = { reserved.clear() reserved ++= keywords } {code} I'm hitting this when parsing multiple SQL queries concurrently. When one query parsing starts, it empties the reserved keyword list, then a race-condition occurs and other queries fail to parse because they recognize keywords as identifiers. was: SPARK-5009 introduced the following code: def parse(input: String): LogicalPlan = { // Initialize the Keywords. lexical.initialize(reservedWords) phrase(start)(new lexical.Scanner(input)) match { case Success(plan, _) = plan case failureOrError = sys.error(failureOrError.toString) } } The corresponding initialize method in SqlLexical is not thread-safe: /* This is a work around to support the lazy setting */ def initialize(keywords: Seq[String]): Unit = { reserved.clear() reserved ++= keywords } I'm hitting this when parsing multiple SQL queries concurrently. When one query parsing starts, it empties the reserved keyword list, then a race-condition occurs and other queries fail to parse because they recognize keywords as identifiers. Race condition in AbstractSparkSQLParser.parse -- Key: SPARK-8628 URL: https://issues.apache.org/jira/browse/SPARK-8628 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Santiago M. Mola Priority: Critical Labels: regression SPARK-5009 introduced the following code in AbstractSparkSQLParser: {code} def parse(input: String): LogicalPlan = { // Initialize the Keywords. lexical.initialize(reservedWords) phrase(start)(new lexical.Scanner(input)) match { case Success(plan, _) = plan case failureOrError = sys.error(failureOrError.toString) } } {code} The corresponding initialize method in SqlLexical is not thread-safe: {code} /* This is a work around to support the lazy setting */ def initialize(keywords: Seq[String]): Unit = { reserved.clear() reserved ++= keywords } {code} I'm hitting this when parsing multiple SQL queries concurrently. When one query parsing starts, it empties the reserved keyword list, then a race-condition occurs and other queries fail to parse because they recognize keywords as identifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8627) ALS model predict error
[ https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601019#comment-14601019 ] Subhod Lagade commented on SPARK-8627: -- can you help me in resolving this ?? usersProducts is a RDD(int,int) it is still giving me error ALS model predict error --- Key: SPARK-8627 URL: https://issues.apache.org/jira/browse/SPARK-8627 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade /** * Created by subhod lagade on 25/06/15. */ import org.apache.spark.SparkConf import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming._; import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintStream; import java.net.ServerSocket; import java.net.Socket; import java.util.Properties; import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating object SparkStreamKafka { def main(args: Array[String]) { val conf = new SparkConf().setAppName(Simple Application); val sc = new SparkContext(conf); val data = sc.textFile(/home/appadmin/Disney/data.csv); val ratings = data.map(_.split(',') match { case Array(user, product, rate) = Rating(user.toInt, product.toInt, rate.toDouble) }); val rank = 3; val numIterations = 2; val model = ALS.train(ratings,rank,numIterations,0.01); val usersProducts = ratings.map{ case Rating(user, product, rate) = (user, product)} // Build the recommendation model using ALS usersProducts.foreach(println) val predictions = model.predict(usersProducts) } } /* ERROR Message [ERROR] /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: error: not enough arguments for method predict: (user: Int, product: Int)Double. [INFO] Unspecified value parameter product. [INFO] val predictions = model.predict(usersProducts) */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
[ https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta closed SPARK-5768. - Resolution: Fixed Fix Version/s: 1.5.0 1.4.1 Target Version/s: 1.5.0 Spark UI Shows incorrect memory under Yarn -- Key: SPARK-5768 URL: https://issues.apache.org/jira/browse/SPARK-5768 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0, 1.2.1 Environment: Centos 6 Reporter: Al M Priority: Trivial Fix For: 1.4.1, 1.5.0 I am running Spark on Yarn with 2 executors. The executors are running on separate physical machines. I have spark.executor.memory set to '40g'. This is because I want to have 40g of memory used on each machine. I have one executor per machine. When I run my application I see from 'top' that both my executors are using the full 40g of memory I allocated to them. The 'Executors' tab in the Spark UI shows something different. It shows the memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it look like I only have 20GB available per executor when really I have 40GB available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse
[ https://issues.apache.org/jira/browse/SPARK-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601086#comment-14601086 ] Apache Spark commented on SPARK-8628: - User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/7015 Race condition in AbstractSparkSQLParser.parse -- Key: SPARK-8628 URL: https://issues.apache.org/jira/browse/SPARK-8628 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Santiago M. Mola Priority: Critical Labels: regression SPARK-5009 introduced the following code in AbstractSparkSQLParser: {code} def parse(input: String): LogicalPlan = { // Initialize the Keywords. lexical.initialize(reservedWords) phrase(start)(new lexical.Scanner(input)) match { case Success(plan, _) = plan case failureOrError = sys.error(failureOrError.toString) } } {code} The corresponding initialize method in SqlLexical is not thread-safe: {code} /* This is a work around to support the lazy setting */ def initialize(keywords: Seq[String]): Unit = { reserved.clear() reserved ++= keywords } {code} I'm hitting this when parsing multiple SQL queries concurrently. When one query parsing starts, it empties the reserved keyword list, then a race-condition occurs and other queries fail to parse because they recognize keywords as identifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse
[ https://issues.apache.org/jira/browse/SPARK-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8628: --- Assignee: (was: Apache Spark) Race condition in AbstractSparkSQLParser.parse -- Key: SPARK-8628 URL: https://issues.apache.org/jira/browse/SPARK-8628 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Santiago M. Mola Priority: Critical Labels: regression SPARK-5009 introduced the following code in AbstractSparkSQLParser: {code} def parse(input: String): LogicalPlan = { // Initialize the Keywords. lexical.initialize(reservedWords) phrase(start)(new lexical.Scanner(input)) match { case Success(plan, _) = plan case failureOrError = sys.error(failureOrError.toString) } } {code} The corresponding initialize method in SqlLexical is not thread-safe: {code} /* This is a work around to support the lazy setting */ def initialize(keywords: Seq[String]): Unit = { reserved.clear() reserved ++= keywords } {code} I'm hitting this when parsing multiple SQL queries concurrently. When one query parsing starts, it empties the reserved keyword list, then a race-condition occurs and other queries fail to parse because they recognize keywords as identifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse
[ https://issues.apache.org/jira/browse/SPARK-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8628: --- Assignee: Apache Spark Race condition in AbstractSparkSQLParser.parse -- Key: SPARK-8628 URL: https://issues.apache.org/jira/browse/SPARK-8628 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Santiago M. Mola Assignee: Apache Spark Priority: Critical Labels: regression SPARK-5009 introduced the following code in AbstractSparkSQLParser: {code} def parse(input: String): LogicalPlan = { // Initialize the Keywords. lexical.initialize(reservedWords) phrase(start)(new lexical.Scanner(input)) match { case Success(plan, _) = plan case failureOrError = sys.error(failureOrError.toString) } } {code} The corresponding initialize method in SqlLexical is not thread-safe: {code} /* This is a work around to support the lazy setting */ def initialize(keywords: Seq[String]): Unit = { reserved.clear() reserved ++= keywords } {code} I'm hitting this when parsing multiple SQL queries concurrently. When one query parsing starts, it empties the reserved keyword list, then a race-condition occurs and other queries fail to parse because they recognize keywords as identifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8631) MLlib predict function error
Subhod Lagade created SPARK-8631: Summary: MLlib predict function error Key: SPARK-8631 URL: https://issues.apache.org/jira/browse/SPARK-8631 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] Predict the rating of many users for many products. The output RDD has an element per each element in the input RDD (including all duplicates) unless a user or product is missing in the training set. usersProducts RDD of (user, product) pairs. returns RDD of Ratings. def predict(user: Int, product: Int): Double Predict the rating of one user for one product. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8625) Propagate user exceptions in tasks back to driver
[ https://issues.apache.org/jira/browse/SPARK-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600997#comment-14600997 ] Apache Spark commented on SPARK-8625: - User 'tomwhite' has created a pull request for this issue: https://github.com/apache/spark/pull/7014 Propagate user exceptions in tasks back to driver - Key: SPARK-8625 URL: https://issues.apache.org/jira/browse/SPARK-8625 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: Tom White Runtime exceptions that are thrown by user code in Spark are presented to the user as strings (message and stacktrace), rather than the exception object itself. If the exception stores information about the error in fields then these cannot be retrieved. Exceptions are Serializable, so it would be feasible to return the original object back to the driver as the cause field in SparkException. This would allow the client to retrieve information from the original exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8626) ALS model predict error
[ https://issues.apache.org/jira/browse/SPARK-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601002#comment-14601002 ] Subhod Lagade commented on SPARK-8626: -- /** * Created by subhod lagade on 25/06/15. */ import org.apache.spark.SparkConf import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming._; import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintStream; import java.net.ServerSocket; import java.net.Socket; import java.util.Properties; import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating object SparkStreamKafka { def main(args: Array[String]) { val conf = new SparkConf().setAppName(Simple Application); val sc = new SparkContext(conf); val data = sc.textFile(/home/appadmin/Disney/data.csv); val ratings = data.map(_.split(',') match { case Array(user, product, rate) = Rating(user.toInt, product.toInt, rate.toDouble) }); val rank = 3; val numIterations = 2; val model = ALS.train(ratings,rank,numIterations,0.01); val usersProducts = ratings.map{ case Rating(user, product, rate) = (user, product)} // Build the recommendation model using ALS usersProducts.foreach(println) val predictions = model.predict(usersProducts) } } ALS model predict error --- Key: SPARK-8626 URL: https://issues.apache.org/jira/browse/SPARK-8626 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
[ https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta reopened SPARK-5768: --- Spark UI Shows incorrect memory under Yarn -- Key: SPARK-5768 URL: https://issues.apache.org/jira/browse/SPARK-5768 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0, 1.2.1 Environment: Centos 6 Reporter: Al M Priority: Trivial Fix For: 1.4.1, 1.5.0 I am running Spark on Yarn with 2 executors. The executors are running on separate physical machines. I have spark.executor.memory set to '40g'. This is because I want to have 40g of memory used on each machine. I have one executor per machine. When I run my application I see from 'top' that both my executors are using the full 40g of memory I allocated to them. The 'Executors' tab in the Spark UI shows something different. It shows the memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it look like I only have 20GB available per executor when really I have 40GB available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8630) Prevent from checkpointing QueueInputDStream
[ https://issues.apache.org/jira/browse/SPARK-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8630: --- Assignee: Apache Spark Prevent from checkpointing QueueInputDStream Key: SPARK-8630 URL: https://issues.apache.org/jira/browse/SPARK-8630 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Shixiong Zhu Assignee: Apache Spark It's better to prevent from checkpointing QueueInputDStream rather than failing the application when recovering `QueueInputDStream`, so that people can find the issue as soon as possible. See SPARK-8553 for example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8629) R code in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun updated SPARK-8629: Description: Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/13/2013 2-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/14/2013 2-Feb 2013 1 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/15/2013 2-Feb 2013 8 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/16/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/17/2013 2-Feb 2013 19 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/18/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/19/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/20/2013 2-Feb 2013 16 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/21/2013 2-Feb 2013 25 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/22/2013 2-Feb 2013 19 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/23/2013 2-Feb 2013 17 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/24/2013 2-Feb 2013 39 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/25/2013 2-Feb 2013 23 Code i used in R: data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) factors - unique(data$ItemNo) df.allitems - data.frame() for(i in 1:length(factors)) { data1 - filter(data, ItemNo == factors[[i]]) data2- select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) date2$date - as.Date(date2$date, format = %m/%d/%y) data3 - data2[order(data2$date), ] df.allitems - rbind(data3 , df.allitems) # Append by row bind } write.csv(df.allitems,E:/all_items.csv) You can see the code clearly in - - http://apache-spark-user-list.1001560.n3.nabble.com/Convert-R-code-into-SparkR-code-for-spark-1-4-version-tp23489.html - I have done some SparkR code: data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF factors - distinct(df_1) # removed duplicates #for select i used: df_2 - select(distinctDF ,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select action I dont know how to: 1) create a empty sparkR DF 2) Using for loop in SparkR 3) change the date format. 4) find the lenght() in spark df 5) using rbind in sparkR can you help me out in doing the above code in sparkR. was: Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg.
[jira] [Resolved] (SPARK-8574) org/apache/spark/unsafe doesn't honor the java source/target versions
[ https://issues.apache.org/jira/browse/SPARK-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-8574. -- Resolution: Fixed Fix Version/s: 1.4.1 org/apache/spark/unsafe doesn't honor the java source/target versions - Key: SPARK-8574 URL: https://issues.apache.org/jira/browse/SPARK-8574 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Thomas Graves Assignee: Thomas Graves Fix For: 1.4.1 I built spark using jdk8 and the default source compatibility in the pom is 1.6 so I expected to be able to run Spark with jdk7, but if fails because the unsafe code doesn't seem to be honoring the source/target compatibility options set in the top level pom. Exception in thread main java.lang.UnsupportedClassVersionError: org/apache/spark/unsafe/memory/MemoryAllocator : Unsupported major.minor version 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:791) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:392) at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:211) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:180) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:74) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:146) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:245) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) 15/06/23 19:48:24 INFO storage.DiskBlockManager: Shutdown hook called -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8629) R code in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8629. -- Resolution: Invalid R code in SparkR Key: SPARK-8629 URL: https://issues.apache.org/jira/browse/SPARK-8629 Project: Spark Issue Type: Question Components: R Reporter: Arun Priority: Minor Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/13/2013 2-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/14/2013 2-Feb 2013 1 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/15/2013 2-Feb 2013 8 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/16/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/17/2013 2-Feb 2013 19 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/18/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/19/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/20/2013 2-Feb 2013 16 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/21/2013 2-Feb 2013 25 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/22/2013 2-Feb 2013 19 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/23/2013 2-Feb 2013 17 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/24/2013 2-Feb 2013 39 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/25/2013 2-Feb 2013 23 Code i used in R: data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) factors - unique(data$ItemNo) df.allitems - data.frame() for(i in 1:length(factors)) { data1 - filter(data, ItemNo == factors[[i]]) data2- select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) date2$date - as.Date(date2$date, format = %m/%d/%y) data3 - data2[order(data2$date), ] df.allitems - rbind(data3 , df.allitems) # Append by row bind } write.csv(df.allitems,E:/all_items.csv) You can see the code clearly in - - http://apache-spark-user-list.1001560.n3.nabble.com/Convert-R-code-into-SparkR-code-for-spark-1-4-version-tp23489.html - I have done some SparkR code: data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF factors - distinct(df_1) # removed duplicates #for select i used: df_2 - select(distinctDF ,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select action I dont know how to: 1) create a empty sparkR DF 2) Using for loop in SparkR 3) change the date format. 4) find the lenght() in spark df 5) using rbind in sparkR can you help me out in doing the above code in sparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8626) ALS model predict error
Subhod Lagade created SPARK-8626: Summary: ALS model predict error Key: SPARK-8626 URL: https://issues.apache.org/jira/browse/SPARK-8626 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8627) ALS model predict error
Subhod Lagade created SPARK-8627: Summary: ALS model predict error Key: SPARK-8627 URL: https://issues.apache.org/jira/browse/SPARK-8627 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade /** * Created by subhod lagade on 25/06/15. */ import org.apache.spark.SparkConf import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming._; import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintStream; import java.net.ServerSocket; import java.net.Socket; import java.util.Properties; import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating object SparkStreamKafka { def main(args: Array[String]) { val conf = new SparkConf().setAppName(Simple Application); val sc = new SparkContext(conf); val data = sc.textFile(/home/appadmin/Disney/data.csv); val ratings = data.map(_.split(',') match { case Array(user, product, rate) = Rating(user.toInt, product.toInt, rate.toDouble) }); val rank = 3; val numIterations = 2; val model = ALS.train(ratings,rank,numIterations,0.01); val usersProducts = ratings.map{ case Rating(user, product, rate) = (user, product)} // Build the recommendation model using ALS usersProducts.foreach(println) val predictions = model.predict(usersProducts) } } /* ERROR Message [ERROR] /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: error: not enough arguments for method predict: (user: Int, product: Int)Double. [INFO] Unspecified value parameter product. [INFO] val predictions = model.predict(usersProducts) */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8621) crosstab exception when one of the value is empty
[ https://issues.apache.org/jira/browse/SPARK-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601008#comment-14601008 ] Animesh Baranawal commented on SPARK-8621: -- How about enclosing the column names and row names in ? crosstab exception when one of the value is empty - Key: SPARK-8621 URL: https://issues.apache.org/jira/browse/SPARK-8621 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Critical I think this happened because some value is empty. {code} scala df1.stat.crosstab(role, lang) org.apache.spark.sql.AnalysisException: syntax error in attribute name: ; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.parseAttributeName(LogicalPlan.scala:145) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:135) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:157) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:603) at org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:394) at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:160) at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:157) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:157) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:147) at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:132) at org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:132) at org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:91) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8627) ALS model predict error
[ https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subhod Lagade updated SPARK-8627: - Comment: was deleted (was: usersProducts is a RDD(int,int) it is still giving me error There is some issue with model.predict) ALS model predict error --- Key: SPARK-8627 URL: https://issues.apache.org/jira/browse/SPARK-8627 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade /** * Created by subhod lagade on 25/06/15. */ import org.apache.spark.SparkConf import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming._; import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintStream; import java.net.ServerSocket; import java.net.Socket; import java.util.Properties; import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating object SparkStreamKafka { def main(args: Array[String]) { val conf = new SparkConf().setAppName(Simple Application); val sc = new SparkContext(conf); val data = sc.textFile(/home/appadmin/Disney/data.csv); val ratings = data.map(_.split(',') match { case Array(user, product, rate) = Rating(user.toInt, product.toInt, rate.toDouble) }); val rank = 3; val numIterations = 2; val model = ALS.train(ratings,rank,numIterations,0.01); val usersProducts = ratings.map{ case Rating(user, product, rate) = (user, product)} // Build the recommendation model using ALS usersProducts.foreach(println) val predictions = model.predict(usersProducts) } } /* ERROR Message [ERROR] /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: error: not enough arguments for method predict: (user: Int, product: Int)Double. [INFO] Unspecified value parameter product. [INFO] val predictions = model.predict(usersProducts) */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8625) Propagate user exceptions in tasks back to driver
Tom White created SPARK-8625: Summary: Propagate user exceptions in tasks back to driver Key: SPARK-8625 URL: https://issues.apache.org/jira/browse/SPARK-8625 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: Tom White Runtime exceptions that are thrown by user code in Spark are presented to the user as strings (message and stacktrace), rather than the exception object itself. If the exception stores information about the error in fields then these cannot be retrieved. Exceptions are Serializable, so it would be feasible to return the original object back to the driver as the cause field in SparkException. This would allow the client to retrieve information from the original exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8627) ALS model predict error
[ https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subhod Lagade updated SPARK-8627: - Comment: was deleted (was: can you help me in resolving this ?? usersProducts is a RDD(int,int) it is still giving me error) ALS model predict error --- Key: SPARK-8627 URL: https://issues.apache.org/jira/browse/SPARK-8627 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade /** * Created by subhod lagade on 25/06/15. */ import org.apache.spark.SparkConf import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming._; import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintStream; import java.net.ServerSocket; import java.net.Socket; import java.util.Properties; import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating object SparkStreamKafka { def main(args: Array[String]) { val conf = new SparkConf().setAppName(Simple Application); val sc = new SparkContext(conf); val data = sc.textFile(/home/appadmin/Disney/data.csv); val ratings = data.map(_.split(',') match { case Array(user, product, rate) = Rating(user.toInt, product.toInt, rate.toDouble) }); val rank = 3; val numIterations = 2; val model = ALS.train(ratings,rank,numIterations,0.01); val usersProducts = ratings.map{ case Rating(user, product, rate) = (user, product)} // Build the recommendation model using ALS usersProducts.foreach(println) val predictions = model.predict(usersProducts) } } /* ERROR Message [ERROR] /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: error: not enough arguments for method predict: (user: Int, product: Int)Double. [INFO] Unspecified value parameter product. [INFO] val predictions = model.predict(usersProducts) */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
[ https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5768: - Assignee: Rekha Joshi Spark UI Shows incorrect memory under Yarn -- Key: SPARK-5768 URL: https://issues.apache.org/jira/browse/SPARK-5768 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0, 1.2.1 Environment: Centos 6 Reporter: Al M Assignee: Rekha Joshi Priority: Trivial Fix For: 1.4.1, 1.5.0 I am running Spark on Yarn with 2 executors. The executors are running on separate physical machines. I have spark.executor.memory set to '40g'. This is because I want to have 40g of memory used on each machine. I have one executor per machine. When I run my application I see from 'top' that both my executors are using the full 40g of memory I allocated to them. The 'Executors' tab in the Spark UI shows something different. It shows the memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it look like I only have 20GB available per executor when really I have 40GB available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8629) R code in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun updated SPARK-8629: Description: Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/13/2013 2-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/14/2013 2-Feb 2013 1 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/15/2013 2-Feb 2013 8 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/16/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/17/2013 2-Feb 2013 19 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/18/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/19/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/20/2013 2-Feb 2013 16 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/21/2013 2-Feb 2013 25 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/22/2013 2-Feb 2013 19 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/23/2013 2-Feb 2013 17 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/24/2013 2-Feb 2013 39 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/25/2013 2-Feb 2013 23 Code i used in R: data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) factors - unique(data$ItemNo) df.allitems - data.frame() for(i in 1:length(factors)) { data1 - filter(data, ItemNo == factors[[i]]) data2select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select particular columns date2$date - as.Date(date2$date, format = %m/%d/%y) # format the date data3 - data2[order(data2$date), ] # order by assending df.allitems - rbind(data3 , df.allitems) # Append by row bind } write.csv(df.allitems,E:/all_items.csv) --- I have done some SparkR code: data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF factors - distinct(df_1) # removed duplicates #for select i used: df_2 - select(distinctDF ,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select action I dont know how to: 1) create a empty sparkR DF 2) Using for loop in SparkR 3) change the date format. 4) find the lenght() in spark df 5) using rbind in sparkR can you help me out in doing the above code in sparkR. was: Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/13/2013 2-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/14/2013 2-Feb 2013 1 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/15/2013 2-Feb 2013 8 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/16/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/17/2013 2-Feb 2013 19
[jira] [Commented] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse
[ https://issues.apache.org/jira/browse/SPARK-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601012#comment-14601012 ] Santiago M. Mola commented on SPARK-8628: - Here is an example of failure with Spark 1.4.0: {code} [1.152] failure: ``union'' expected but identifier OR found SELECT CASE a+1 WHEN b THEN 111 WHEN c THEN 222 WHEN d THEN 333 WHEN e THEN 444 ELSE 555 END, a-b, a FROM t1 WHERE e+d BETWEEN a+b-10 AND c+130 OR ab OR de ^ java.lang.RuntimeException: [1.152] failure: ``union'' expected but identifier OR found SELECT CASE a+1 WHEN b THEN 111 WHEN c THEN 222 WHEN d THEN 333 WHEN e THEN 444 ELSE 555 END, a-b, a FROM t1 WHERE e+d BETWEEN a+b-10 AND c+130 OR ab OR de ^ at scala.sys.package$.error(package.scala:27) {code} Race condition in AbstractSparkSQLParser.parse -- Key: SPARK-8628 URL: https://issues.apache.org/jira/browse/SPARK-8628 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Santiago M. Mola Priority: Critical Labels: regression SPARK-5009 introduced the following code: def parse(input: String): LogicalPlan = { // Initialize the Keywords. lexical.initialize(reservedWords) phrase(start)(new lexical.Scanner(input)) match { case Success(plan, _) = plan case failureOrError = sys.error(failureOrError.toString) } } The corresponding initialize method in SqlLexical is not thread-safe: /* This is a work around to support the lazy setting */ def initialize(keywords: Seq[String]): Unit = { reserved.clear() reserved ++= keywords } I'm hitting this when parsing multiple SQL queries concurrently. When one query parsing starts, it empties the reserved keyword list, then a race-condition occurs and other queries fail to parse because they recognize keywords as identifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8626) ALS model predict error
[ https://issues.apache.org/jira/browse/SPARK-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8626. -- Resolution: Duplicate ... and you opened it twice. Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and take care before opening a JIRA ALS model predict error --- Key: SPARK-8626 URL: https://issues.apache.org/jira/browse/SPARK-8626 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8627) ALS model predict error
[ https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8627. -- Resolution: Invalid This is a compile error in your own code. ALS model predict error --- Key: SPARK-8627 URL: https://issues.apache.org/jira/browse/SPARK-8627 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade /** * Created by subhod lagade on 25/06/15. */ import org.apache.spark.SparkConf import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming._; import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintStream; import java.net.ServerSocket; import java.net.Socket; import java.util.Properties; import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating object SparkStreamKafka { def main(args: Array[String]) { val conf = new SparkConf().setAppName(Simple Application); val sc = new SparkContext(conf); val data = sc.textFile(/home/appadmin/Disney/data.csv); val ratings = data.map(_.split(',') match { case Array(user, product, rate) = Rating(user.toInt, product.toInt, rate.toDouble) }); val rank = 3; val numIterations = 2; val model = ALS.train(ratings,rank,numIterations,0.01); val usersProducts = ratings.map{ case Rating(user, product, rate) = (user, product)} // Build the recommendation model using ALS usersProducts.foreach(println) val predictions = model.predict(usersProducts) } } /* ERROR Message [ERROR] /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: error: not enough arguments for method predict: (user: Int, product: Int)Double. [INFO] Unspecified value parameter product. [INFO] val predictions = model.predict(usersProducts) */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse
Santiago M. Mola created SPARK-8628: --- Summary: Race condition in AbstractSparkSQLParser.parse Key: SPARK-8628 URL: https://issues.apache.org/jira/browse/SPARK-8628 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.3.1, 1.3.0 Reporter: Santiago M. Mola Priority: Critical SPARK-5009 introduced the following code: def parse(input: String): LogicalPlan = { // Initialize the Keywords. lexical.initialize(reservedWords) phrase(start)(new lexical.Scanner(input)) match { case Success(plan, _) = plan case failureOrError = sys.error(failureOrError.toString) } } The corresponding initialize method in SqlLexical is not thread-safe: /* This is a work around to support the lazy setting */ def initialize(keywords: Seq[String]): Unit = { reserved.clear() reserved ++= keywords } I'm hitting this when parsing multiple SQL queries concurrently. When one query parsing starts, it empties the reserved keyword list, then a race-condition occurs and other queries fail to parse because they recognize keywords as identifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8629) R code in SparkR
[ https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun updated SPARK-8629: Description: Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/13/2013 2-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/14/2013 2-Feb 2013 1 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/15/2013 2-Feb 2013 8 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/16/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/17/2013 2-Feb 2013 19 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/18/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/19/2013 2-Feb 2013 18 Hyderabad 11 15012 more. Value Chana Dal 1 Kg. 2/20/2013 2-Feb 2013 16 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/21/2013 2-Feb 2013 25 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/22/2013 2-Feb 2013 19 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/23/2013 2-Feb 2013 17 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/24/2013 2-Feb 2013 39 Hyderabad 11 15013 more. Value Chana Dal 1 Kg. 2/25/2013 2-Feb 2013 23 Code i used in R: data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) factors - unique(data$ItemNo) df.allitems - data.frame() for(i in 1:length(factors)) { data1 - filter(data, ItemNo == factors[[i]]) data2- select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) date2$date - as.Date(date2$date, format = %m/%d/%y) data3 - data2[order(data2$date), ] df.allitems - rbind(data3 , df.allitems) # Append by row bind } write.csv(df.allitems,E:/all_items.csv) --- I have done some SparkR code: data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF factors - distinct(df_1) # removed duplicates #for select i used: df_2 - select(distinctDF ,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select action I dont know how to: 1) create a empty sparkR DF 2) Using for loop in SparkR 3) change the date format. 4) find the lenght() in spark df 5) using rbind in sparkR can you help me out in doing the above code in sparkR. was: Data set: DC_City Dc_Code ItemNo Itemdescription dat Month YearSalesQuantity Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 9/16/2012 9-Sep 2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 12/21/2012 12-Dec2012 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/12/2013 1-Jan 2013 1 Hyderabad 11 15010 more. Value Chana Dal 1 Kg. 1/27/2013 1-Jan 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/1/20132-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/12/2013 2-Feb 2013 3 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/13/2013 2-Feb 2013 2 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/14/2013 2-Feb 2013 1 Hyderabad 11 15011 more. Value Chana Dal 1 Kg. 2/15/2013 2-Feb 2013 8 Hyderabad
[jira] [Created] (SPARK-8630) Prevent from checkpointing QueueInputDStream
Shixiong Zhu created SPARK-8630: --- Summary: Prevent from checkpointing QueueInputDStream Key: SPARK-8630 URL: https://issues.apache.org/jira/browse/SPARK-8630 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Shixiong Zhu It's better to prevent from checkpointing QueueInputDStream rather than failing the application when recovering `QueueInputDStream`, so that people can find the issue as soon as possible. See SPARK-8553 for example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests
[ https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601898#comment-14601898 ] Neelesh Srinivas Salian commented on SPARK-4352: Checking to see if this been resolved? Incorporate locality preferences in dynamic allocation requests --- Key: SPARK-4352 URL: https://issues.apache.org/jira/browse/SPARK-4352 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza Assignee: Saisai Shao Priority: Critical Attachments: Supportpreferrednodelocationindynamicallocation.pdf Currently, achieving data locality in Spark is difficult unless an application takes resources on every node in the cluster. preferredNodeLocalityData provides a sort of hacky workaround that has been broken since 1.0. With dynamic executor allocation, Spark requests executors in response to demand from the application. When this occurs, it would be useful to look at the pending tasks and communicate their location preferences to the cluster resource manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8622) Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath
[ https://issues.apache.org/jira/browse/SPARK-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600942#comment-14600942 ] Sean Owen commented on SPARK-8622: -- I don't think that is intended or even reasonable behavior. This mechanism is for transferring JARs to put on the classpath, not putting arbitrary files on the executor. Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath -- Key: SPARK-8622 URL: https://issues.apache.org/jira/browse/SPARK-8622 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.3.1, 1.4.0 Reporter: Baswaraj I ran into an issue that executor not able to pickup my configs/ function from my custom jar in standalone (client/cluster) deploy mode. I have used spark-submit --Jar option to specify all my jars and configs to be used by executors. all these files are placed in working directory of executor, but not in executor classpath. Also, executor working directory is not in executor classpath. I am expecting executor to find all files specified in spark-submit --jar options . In spark 1.3.0 executor working directory is in executor classpath, so app runs successfully. To successfully run my application with spark 1.3.1 +, i have to use following option (conf/spark-defaults.conf) spark.executor.extraClassPath . Please advice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8615) sql programming guide recommends deprecated code
[ https://issues.apache.org/jira/browse/SPARK-8615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600945#comment-14600945 ] Sean Owen commented on SPARK-8615: -- Sure, open a PR? sql programming guide recommends deprecated code Key: SPARK-8615 URL: https://issues.apache.org/jira/browse/SPARK-8615 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.4.0 Reporter: Gergely Svigruha Priority: Minor The Spark 1.4 sql programming guide has an example code on how to use JDBC tables: https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases sqlContext.load(jdbc, Map(...)) However this code complies with a warning, and recommends to do this: sqlContext.read.format(jdbc).options(Map(...)).load() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8642) Ungraceful failure when yarn client is not configured.
[ https://issues.apache.org/jira/browse/SPARK-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliet Hougland updated SPARK-8642: --- Attachment: yarnretries.log Log file from failed bc of misconfiguration spakr job. counting lines with 9 retires in it gives: cat yarnretries.log | grep 'Already tried 9 time(s);' | wc -l 31 Ungraceful failure when yarn client is not configured. -- Key: SPARK-8642 URL: https://issues.apache.org/jira/browse/SPARK-8642 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0, 1.3.1 Reporter: Juliet Hougland Priority: Minor Attachments: yarnretries.log When HADOOP_CONF_DIR is not configured (ie yarn-site.xml is not available) the yarn client will try to submit an application. No connection to the resource manager will be able to be established. The client will try to connect 10 times (with a max retry of ten), and then do that 30 more time. This takes about 5 minutes before an Error is recorded for spark context initialization, which is caused by a connect exception. I would expect that after the first 1- tries fail, the initialization of the spark context should fail too. At least that is what I would think given the logs. An earlier failure would be ideal/preferred. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8642) Ungraceful failure when yarn client is not configured.
Juliet Hougland created SPARK-8642: -- Summary: Ungraceful failure when yarn client is not configured. Key: SPARK-8642 URL: https://issues.apache.org/jira/browse/SPARK-8642 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.1, 1.3.0 Reporter: Juliet Hougland Priority: Minor When HADOOP_CONF_DIR is not configured (ie yarn-site.xml is not available) the yarn client will try to submit an application. No connection to the resource manager will be able to be established. The client will try to connect 10 times (with a max retry of ten), and then do that 30 more time. This takes about 5 minutes before an Error is recorded for spark context initialization, which is caused by a connect exception. I would expect that after the first 1- tries fail, the initialization of the spark context should fail too. At least that is what I would think given the logs. An earlier failure would be ideal/preferred. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8644) SparkException thrown due to Executor exceptions should include caller site in stack trace
Aaron Davidson created SPARK-8644: - Summary: SparkException thrown due to Executor exceptions should include caller site in stack trace Key: SPARK-8644 URL: https://issues.apache.org/jira/browse/SPARK-8644 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Currently when a job fails due to executor (or other) issues, the exception thrown by Spark has a stack trace which stops at the DAGScheduler EventLoop, which makes it hard to trace back to the user code which submitted the job. It should try to include the user submission stack trace. Example exception today: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.RuntimeException: uh-oh! at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34$$anonfun$apply$mcJ$sp$1.apply(DAGSchedulerSuite.scala:851) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34$$anonfun$apply$mcJ$sp$1.apply(DAGSchedulerSuite.scala:851) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1637) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1285) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1276) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1275) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1275) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:749) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:749) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:749) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1486) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) {code} Here is the part I want to include: {code} at org.apache.spark.rdd.RDD.count(RDD.scala:1095) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34.apply$mcJ$sp(DAGSchedulerSuite.scala:851) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34.apply(DAGSchedulerSuite.scala:851) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34.apply(DAGSchedulerSuite.scala:851) at org.scalatest.Assertions$class.intercept(Assertions.scala:997) at org.scalatest.FunSuite.intercept(FunSuite.scala:1555) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33.apply$mcV$sp(DAGSchedulerSuite.scala:850) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33.apply(DAGSchedulerSuite.scala:849) at org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33.apply(DAGSchedulerSuite.scala:849) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at
[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601273#comment-14601273 ] biao luo commented on SPARK-2883: - peopleSchemaRDD.saveAsOrcFile(people.orc) val orcFile = ctx.orcFile(people.orc) saveAsOrcFile and orcFile is not find spark1.4 sound code。why?DataFrame is not find。where can find this api? Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: New Feature Components: Input/Output, SQL Reporter: Zhan Zhang Assignee: Zhan Zhang Priority: Critical Fix For: 1.4.0 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png, orc.diff Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8546) PMML export for Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8546: - Labels: (was: starter) PMML export for Naive Bayes --- Key: SPARK-8546 URL: https://issues.apache.org/jira/browse/SPARK-8546 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Priority: Minor The naive Bayes section of PMML standard can be found at http://www.dmg.org/v4-1/NaiveBayes.html. We should first figure out how to generate PMML for both binomial and multinomial naive Bayes models using JPMML (maybe [~vfed] can help). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8445: - Description: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add starter label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORDER%20BY%20priority%20DESC]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * LDA improvements (SPARK-5572) * Log-linear model for survival analysis (SPARK-8518) * Improve GLM's scalability on number of features (SPARK-8520) * Tree and ensembles: Move + cleanup code (SPARK-7131), provide class probabilities (SPARK-3727), feature importance (SPARK-5133) * Improve GMM scalability and stability (SPARK-7206) * Frequent pattern mining improvements (SPARK-7211) * R-like stats for ML models (SPARK-7674) * Generalize classification threshold to multiclass (SPARK-8069) * A/B testing (SPARK-3147) h2. Pipeline API * more feature transformers (SPARK-8521) * k-means (SPARK-7879) * naive Bayes (SPARK-8600) * TrainValidationSplit for tuning (SPARK-8484) h2. Model persistence * more PMML export (SPARK-8545) * model save/load (SPARK-4587) * pipeline persistence (SPARK-6725) h2. Python API for ML * List of issues identified during Spark 1.4 QA: (SPARK-7536) * Python API for streaming ML algorithms (SPARK-3258) * Add missing model methods (SPARK-8633) h2. SparkR API for ML * ML Pipeline API in SparkR (SPARK-6805) * model.matrix for DataFrames (SPARK-6823) h2. Documentation * [Search for documentation improvements | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)] was: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully.
[jira] [Created] (SPARK-8634) Fix flaky test StreamingListenerSuite receiver info reporting
Shixiong Zhu created SPARK-8634: --- Summary: Fix flaky test StreamingListenerSuite receiver info reporting Key: SPARK-8634 URL: https://issues.apache.org/jira/browse/SPARK-8634 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Shixiong Zhu Priority: Minor As per the unit test log in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35754/ {code} 15/06/24 23:09:10.210 Thread-3495 INFO ReceiverTracker: Starting 1 receivers 15/06/24 23:09:10.270 Thread-3495 INFO SparkContext: Starting job: apply at Transformer.scala:22 ... 15/06/24 23:09:14.259 ForkJoinPool-4-worker-29 INFO StreamingListenerSuiteReceiver: Started receiver and sleeping 15/06/24 23:09:14.270 ForkJoinPool-4-worker-29 INFO StreamingListenerSuiteReceiver: Reporting error and sleeping {code} it needs at least 4 seconds to receive all receiver events in this slow machine, but `timeout` for `eventually` is only 2 seconds. We can increase `timeout` to make this test stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8631) MLlib predict function error
[ https://issues.apache.org/jira/browse/SPARK-8631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8631. -- Resolution: Invalid This is the third time you have opened this. As I explained this is not a valid JIRA. Please do not open any more. MLlib predict function error Key: SPARK-8631 URL: https://issues.apache.org/jira/browse/SPARK-8631 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] Predict the rating of many users for many products. The output RDD has an element per each element in the input RDD (including all duplicates) unless a user or product is missing in the training set. usersProducts RDD of (user, product) pairs. returns RDD of Ratings. def predict(user: Int, product: Int): Double Predict the rating of one user for one product. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7337) FPGrowth algo throwing OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601348#comment-14601348 ] Xiangrui Meng commented on SPARK-7337: -- How large is the `minSupport`? The number of frequent itemsets grows exponentially as minSupport decreases. So please start with a really large value (close to 1.0) and gradually reduce it. FPGrowth algo throwing OutOfMemoryError --- Key: SPARK-7337 URL: https://issues.apache.org/jira/browse/SPARK-7337 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Environment: Ubuntu Reporter: Amit Gupta Attachments: FPGrowthBug.png When running FPGrowth algo with huge data in GBs and with numPartitions=500 then after some time it throws OutOfMemoryError. Algo runs correctly upto collect at FPGrowth.scala:131 where it creates 500 tasks. It fails at next stage flatMap at FPGrowth.scala:150 where it fails to create 500 tasks and create some internal calculated 17 tasks. Please refer to attachment - print screen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8445: - Description: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add starter label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORDER%20BY%20priority%20DESC]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * LDA improvements (SPARK-5572) * Log-linear model for survival analysis (SPARK-8518) * Improve GLM's scalability on number of features (SPARK-8520) * Tree and ensembles: Move + cleanup code (SPARK-7131), provide class probabilities (SPARK-3727), feature importance (SPARK-5133) * Improve GMM scalability and stability (SPARK-7206) * Frequent pattern mining improvements (SPARK-7211) * R-like stats for ML models (SPARK-7674) * Generalize classification threshold to multiclass (SPARK-8069) * A/B testing (SPARK-3147) h2. Pipeline API * more feature transformers (SPARK-8521) * k-means (SPARK-7879) * naive Bayes (SPARK-8600) * TrainValidationSplit for tuning (SPARK-8484) h2. Model persistence * more PMML export (SPARK-8545) * model save/load (SPARK-4587) * pipeline persistence (SPARK-6725) h2. Python API for ML * List of issues identified during Spark 1.4 QA: (SPARK-7536) h2. SparkR API for ML * ML Pipeline API in SparkR (SPARK-6805) * model.matrix for DataFrames (SPARK-6823) h2. Documentation * [Search for documentation improvements | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)] was: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark
[jira] [Updated] (SPARK-6805) ML Pipeline API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6805: - Assignee: (was: Xiangrui Meng) ML Pipeline API in SparkR - Key: SPARK-6805 URL: https://issues.apache.org/jira/browse/SPARK-6805 Project: Spark Issue Type: Umbrella Components: ML, SparkR Reporter: Xiangrui Meng SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API in SparkR. The implementation should be similar to the pipeline API implementation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8635) improve performance of CatalystTypeConverters
[ https://issues.apache.org/jira/browse/SPARK-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8635: --- Assignee: (was: Apache Spark) improve performance of CatalystTypeConverters - Key: SPARK-8635 URL: https://issues.apache.org/jira/browse/SPARK-8635 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8635) improve performance of CatalystTypeConverters
[ https://issues.apache.org/jira/browse/SPARK-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8635: --- Assignee: Apache Spark improve performance of CatalystTypeConverters - Key: SPARK-8635 URL: https://issues.apache.org/jira/browse/SPARK-8635 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8635) improve performance of CatalystTypeConverters
[ https://issues.apache.org/jira/browse/SPARK-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601375#comment-14601375 ] Apache Spark commented on SPARK-8635: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7018 improve performance of CatalystTypeConverters - Key: SPARK-8635 URL: https://issues.apache.org/jira/browse/SPARK-8635 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)
[ https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601385#comment-14601385 ] Peter Prettenhofer commented on SPARK-5133: --- [~josephkb] definitely - will start compiling a PR for feature importance via decrease in impurity. Feature Importance for Decision Tree (Ensembles) Key: SPARK-5133 URL: https://issues.apache.org/jira/browse/SPARK-5133 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Peter Prettenhofer Original Estimate: 168h Remaining Estimate: 168h Add feature importance to decision tree model and tree ensemble models. If people are interested in this feature I could implement it given a mentor (API decisions, etc). Please find a description of the feature below: Decision trees intrinsically perform feature selection by selecting appropriate split points. This information can be used to assess the relative importance of a feature. Relative feature importance gives valuable insight into a decision tree or tree ensemble and can even be used for feature selection. More information on feature importance (via decrease in impurity) can be found in ESLII (10.13.1) or here [1]. R's randomForest package uses a different technique for assessing variable importance that is based on permutation tests. All necessary information to create relative importance scores should be available in the tree representation (class Node; split, impurity gain, (weighted) nr of samples?). [1] http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8445: - Description: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add starter label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORDER%20BY%20priority%20DESC]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * LDA improvements (SPARK-5572) * Log-linear model for survival analysis (SPARK-8518) * Improve GLM's scalability on number of features (SPARK-8520) * Tree and ensembles: Move + cleanup code (SPARK-7131), provide class probabilities (SPARK-3727), feature importance (SPARK-5133) * Improve GMM scalability and stability (SPARK-7206) * Frequent pattern mining improvements (SPARK-7211) * R-like stats for ML models (SPARK-7674) * Generalize classification threshold to multiclass (SPARK-8069) * A/B testing (SPARK-3147) h2. Pipeline API * more feature transformers (SPARK-8521) * k-means (SPARK-7879) * naive Bayes (SPARK-8600) * TrainValidationSplit for tuning (SPARK-8484) h2. Model persistence * more PMML export (SPARK-8545) * model save/load (SPARK-4587) * pipeline persistence (SPARK-6725) h2. Python API for ML * List of issues identified during Spark 1.4 QA: (SPARK-7536) h2. SparkR API for ML h2. Documentation * [Search for documentation improvements | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)] was: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter
[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601275#comment-14601275 ] biao luo commented on SPARK-2883: - peopleSchemaRDD.saveAsOrcFile(people.orc) val orcFile = ctx.orcFile(people.orc) saveAsOrcFile and orcFile is not find spark1.4 sound code。why?DataFrame is not find。where can find this api? Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: New Feature Components: Input/Output, SQL Reporter: Zhan Zhang Assignee: Zhan Zhang Priority: Critical Fix For: 1.4.0 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png, orc.diff Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4127) Streaming Linear Regression- Python bindings
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4127: - Target Version/s: 1.5.0 Streaming Linear Regression- Python bindings Key: SPARK-4127 URL: https://issues.apache.org/jira/browse/SPARK-4127 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Anant Daksh Asthana Assignee: Manoj Kumar Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found at : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4127) Streaming Linear Regression- Python bindings
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4127: - Assignee: Manoj Kumar Streaming Linear Regression- Python bindings Key: SPARK-4127 URL: https://issues.apache.org/jira/browse/SPARK-4127 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Anant Daksh Asthana Assignee: Manoj Kumar Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found at : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching
[ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601288#comment-14601288 ] Justin Uang commented on SPARK-8632: [~davies], my current plan is to switch to a synchronous model so that we can avoid deadlock. From a quick benchmark on my machine of loading pickled data and converting it to a python object, 95% of time is spent on cPickle and 5% on IO. I think the performance drawbacks of a synchronous model are trivial enough that the conceptual simplicity is worth it. Poor Python UDF performance because of RDD caching -- Key: SPARK-8632 URL: https://issues.apache.org/jira/browse/SPARK-8632 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0 Reporter: Justin Uang {quote} We have been running into performance problems using Python UDFs with DataFrames at large scale. From the implementation of BatchPythonEvaluation, it looks like the goal was to reuse the PythonRDD code. It caches the entire child RDD so that it can do two passes over the data. One to give to the PythonRDD, then one to join the python lambda results with the original row (which may have java objects that should be passed through). In addition, it caches all the columns, even the ones that don't need to be processed by the Python UDF. In the cases I was working with, I had a 500 column table, and i wanted to use a python UDF for one column, and it ended up caching all 500 columns. {quote} http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org