[jira] [Resolved] (SPARK-8604) Parquet data source doesn't write summary file while doing appending

2015-06-25 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-8604.
---
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.1

Issue resolved by pull request 6998
[https://github.com/apache/spark/pull/6998]

 Parquet data source doesn't write summary file while doing appending
 

 Key: SPARK-8604
 URL: https://issues.apache.org/jira/browse/SPARK-8604
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.4.1, 1.5.0


 Currently, Parquet and ORC data sources don't set their output format class, 
 as we override the output committer in Spark SQL. However, SPARK-8678 ignores 
 user defined output committer class while doing appending to avoid potential 
 issues brought by direct output committers (e.g. 
 {{DirectParquetOutputCommitter}}). This makes both of these data sources 
 fallback to the default output committer retrieved from {{TextOutputFormat}}, 
 which is {{FileOutputCommitter}}. For ORC, it's totally fine since ORC itself 
 just uses {{FileOutputCommitter}}. But for Parquet, 
 {{ParquetOutputCommitter}} also writes the summary files while committing the 
 job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-25 Thread Sam Stoelinga (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600825#comment-14600825
 ] 

Sam Stoelinga commented on SPARK-8587:
--

I also agree that this should have the same API accross the different 
languages. There is already a function computeCost but the problem is that it 
doesn't return the index, the problem with predict is that it only returns the 
index and not the cost.

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8623) Some queries in spark-sql lead to NullPointerException when using Yarn

2015-06-25 Thread Bolke de Bruin (JIRA)
Bolke de Bruin created SPARK-8623:
-

 Summary: Some queries in spark-sql lead to NullPointerException 
when using Yarn
 Key: SPARK-8623
 URL: https://issues.apache.org/jira/browse/SPARK-8623
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Hadoop 2.6, Kerberos
Reporter: Bolke de Bruin


The following query was executed using spark-sql --master yarn-client on 
1.5.0-SNAPSHOT:

select * from wcs.geolite_city limit 10;

This lead to the following error:

15/06/25 09:38:37 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
(TID 0, lxhnl008.ad.ing.net): java.lang.NullPointerException
at org.apache.hadoop.conf.Configuration.init(Configuration.java:693)
at org.apache.hadoop.mapred.JobConf.init(JobConf.java:442)
at org.apache.hadoop.mapreduce.Job.init(Job.java:131)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD.getJob(SqlNewHadoopRDD.scala:83)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD.getConf(SqlNewHadoopRDD.scala:89)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.init(SqlNewHadoopRDD.scala:127)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

This does not happen in every case, ie. some queries execute fine, and it is 
unclear why.

Using just spark-sql the query executes fine as well and thus the issue seems 
to rely in the communication with Yarn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8622) Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath

2015-06-25 Thread Baswaraj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baswaraj updated SPARK-8622:

Description: 
I ran into an issue that executor not able to pickup my configs/ function from 
my custom jar in standalone (client/cluster) deploy mode. I have used 
spark-submit --Jar option to specify all my jars and configs to be used by 
executors.

all these files are placed in working directory of executor, but not in 
executor classpath.  Also, executor working directory is not in executor 
classpath.

I am expecting executor to find all files specified in spark-submit --jar 
options .

in spark 1.3.0 executor working directory is in executor classpath.

To successfully run my application with spark 1.3.1 +, i have to use  following 
option  (conf/spark-defaults.conf)

spark.executor.extraClassPath   .

Please advice.

  was:
I ran into an issue that executor not able to pickup my configs/ function from 
my custom jar in standalone (client/cluster) deploy mode. I have used 
spark-submit --Jar option to specify all my jars and configs to be used by 
executors.

all these files are placed in working directory of executor, but not in 
executor classpath.  Also, executor working directory is not in executor 
classpath.

I am expecting executor to find all files in spark-submit --jar options to be 
available.

in spark 1.3.0 executor working directory is in executor classpath.

To successfully run my application with spark 1.3.1 +, i have to use  following 
option  (conf/spark-defaults.conf)

spark.executor.extraClassPath   .

Please advice.


 Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor 
 classpath
 --

 Key: SPARK-8622
 URL: https://issues.apache.org/jira/browse/SPARK-8622
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.3.1, 1.4.0
Reporter: Baswaraj

 I ran into an issue that executor not able to pickup my configs/ function 
 from my custom jar in standalone (client/cluster) deploy mode. I have used 
 spark-submit --Jar option to specify all my jars and configs to be used by 
 executors.
 all these files are placed in working directory of executor, but not in 
 executor classpath.  Also, executor working directory is not in executor 
 classpath.
 I am expecting executor to find all files specified in spark-submit --jar 
 options .
 in spark 1.3.0 executor working directory is in executor classpath.
 To successfully run my application with spark 1.3.1 +, i have to use  
 following option  (conf/spark-defaults.conf)
 spark.executor.extraClassPath   .
 Please advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8622) Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath

2015-06-25 Thread Baswaraj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baswaraj updated SPARK-8622:

Description: 
I ran into an issue that executor not able to pickup my configs/ function from 
my custom jar in standalone (client/cluster) deploy mode. I have used 
spark-submit --Jar option to specify all my jars and configs to be used by 
executors.

all these files are placed in working directory of executor, but not in 
executor classpath.  Also, executor working directory is not in executor 
classpath.

I am expecting executor to find all files in spark-submit --jar options to be 
available.

in spark 1.3.0 executor working directory is in executor classpath.

To successfully run my application with spark 1.3.1 +, i have to use  following 
option  (conf/spark-defaults.conf)

spark.executor.extraClassPath   .

Please advice.

  was:
I ran into an issue that executor not able to pickup my configs/ function from 
my custom jar in standalone (client/cluster) deploy mode. I have used 
spark-submit --Jar option to specify all my jars and configs to be used by 
executors.

all these files are placed in working directory of executor, but not in 
executor classpath.  Also, executor working directory is not in executor 
classpath.

I am expecting executor to find all files in spark-submit --jar options to be 
available.

in spark 1.3.0 executor working directory is in executor classpath.

To successfully run my application with spark 1.3.1 +, i have to add following 
entry in slaves conf/spark-defaults.conf

spark.executor.extraClassPath   .

Please advice.


 Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor 
 classpath
 --

 Key: SPARK-8622
 URL: https://issues.apache.org/jira/browse/SPARK-8622
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.3.1, 1.4.0
Reporter: Baswaraj

 I ran into an issue that executor not able to pickup my configs/ function 
 from my custom jar in standalone (client/cluster) deploy mode. I have used 
 spark-submit --Jar option to specify all my jars and configs to be used by 
 executors.
 all these files are placed in working directory of executor, but not in 
 executor classpath.  Also, executor working directory is not in executor 
 classpath.
 I am expecting executor to find all files in spark-submit --jar options to be 
 available.
 in spark 1.3.0 executor working directory is in executor classpath.
 To successfully run my application with spark 1.3.1 +, i have to use  
 following option  (conf/spark-defaults.conf)
 spark.executor.extraClassPath   .
 Please advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver = NPE

2015-06-25 Thread Sjoerd Mulder (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sjoerd Mulder updated SPARK-8592:
-
Component/s: Scheduler

 CoarseGrainedExecutorBackend: Cannot register with driver = NPE
 

 Key: SPARK-8592
 URL: https://issues.apache.org/jira/browse/SPARK-8592
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.4.0
 Environment: Ubuntu 14.04, Scala 2.11, Java 8, 
Reporter: Sjoerd Mulder
Priority: Minor

 I cannot reproduce this consistently but when submitting jobs just after 
 another finished it will not come up:
 {code}
 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker 
 akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to 
 akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with 
 driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler
 java.lang.NullPointerException
   at 
 org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313)
   at java.lang.String.valueOf(String.java:2982)
   at 
 scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126)
   at 
 scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
   at 
 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
   at 
 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
   at 
 org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8592) CoarseGrainedExecutorBackend: Cannot register with driver = NPE

2015-06-25 Thread Sjoerd Mulder (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sjoerd Mulder updated SPARK-8592:
-
Component/s: Spark Core

 CoarseGrainedExecutorBackend: Cannot register with driver = NPE
 

 Key: SPARK-8592
 URL: https://issues.apache.org/jira/browse/SPARK-8592
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 1.4.0
 Environment: Ubuntu 14.04, Scala 2.11, Java 8, 
Reporter: Sjoerd Mulder
Priority: Minor

 I cannot reproduce this consistently but when submitting jobs just after 
 another finished it will not come up:
 {code}
 15/06/24 14:57:24 INFO WorkerWatcher: Connecting to worker 
 akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
 15/06/24 14:57:24 INFO WorkerWatcher: Successfully connected to 
 akka.tcp://sparkWorker@10.0.7.171:39135/user/Worker
 15/06/24 14:57:24 ERROR CoarseGrainedExecutorBackend: Cannot register with 
 driver: akka.tcp://sparkDriver@172.17.0.109:47462/user/CoarseGrainedScheduler
 java.lang.NullPointerException
   at 
 org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef$lzycompute(AkkaRpcEnv.scala:273)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEndpointRef.actorRef(AkkaRpcEnv.scala:273)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEndpointRef.toString(AkkaRpcEnv.scala:313)
   at java.lang.String.valueOf(String.java:2982)
   at 
 scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(CoarseGrainedSchedulerBackend.scala:125)
   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.logInfo(CoarseGrainedSchedulerBackend.scala:69)
   at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:125)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:178)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:127)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:198)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:126)
   at 
 scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
   at 
 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
   at 
 org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
   at 
 org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
   at 
 org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:93)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8603) In Windows,Not able to create a Spark context from R studio

2015-06-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8603:
---

Assignee: Apache Spark

 In Windows,Not able to create a Spark context from R studio 
 

 Key: SPARK-8603
 URL: https://issues.apache.org/jira/browse/SPARK-8603
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
 Environment: Windows, R studio
Reporter: Prakash Ponshankaarchinnusamy
Assignee: Apache Spark
 Fix For: 1.4.0

   Original Estimate: 0.5m
  Remaining Estimate: 0.5m

 In windows ,creation of spark context fails using below code from R studio
 Sys.setenv(SPARK_HOME=C:\\spark\\spark-1.4.0)
 .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths()))
 library(SparkR)
 sc - sparkR.init(master=spark://localhost:7077, appName=SparkR)
 Error: JVM is not ready after 10 seconds
 Reason: Wrong file path computed in client.R. File seperator for windows[\] 
 is not respected by file.Path function by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8603) In Windows,Not able to create a Spark context from R studio

2015-06-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600830#comment-14600830
 ] 

Apache Spark commented on SPARK-8603:
-

User 'prakashpc' has created a pull request for this issue:
https://github.com/apache/spark/pull/7012

 In Windows,Not able to create a Spark context from R studio 
 

 Key: SPARK-8603
 URL: https://issues.apache.org/jira/browse/SPARK-8603
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
 Environment: Windows, R studio
Reporter: Prakash Ponshankaarchinnusamy
 Fix For: 1.4.0

   Original Estimate: 0.5m
  Remaining Estimate: 0.5m

 In windows ,creation of spark context fails using below code from R studio
 Sys.setenv(SPARK_HOME=C:\\spark\\spark-1.4.0)
 .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths()))
 library(SparkR)
 sc - sparkR.init(master=spark://localhost:7077, appName=SparkR)
 Error: JVM is not ready after 10 seconds
 Reason: Wrong file path computed in client.R. File seperator for windows[\] 
 is not respected by file.Path function by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8603) In Windows,Not able to create a Spark context from R studio

2015-06-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8603:
---

Assignee: (was: Apache Spark)

 In Windows,Not able to create a Spark context from R studio 
 

 Key: SPARK-8603
 URL: https://issues.apache.org/jira/browse/SPARK-8603
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
 Environment: Windows, R studio
Reporter: Prakash Ponshankaarchinnusamy
 Fix For: 1.4.0

   Original Estimate: 0.5m
  Remaining Estimate: 0.5m

 In windows ,creation of spark context fails using below code from R studio
 Sys.setenv(SPARK_HOME=C:\\spark\\spark-1.4.0)
 .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths()))
 library(SparkR)
 sc - sparkR.init(master=spark://localhost:7077, appName=SparkR)
 Error: JVM is not ready after 10 seconds
 Reason: Wrong file path computed in client.R. File seperator for windows[\] 
 is not respected by file.Path function by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7977) Disallow println

2015-06-25 Thread Jon Alter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601827#comment-14601827
 ] 

Jon Alter commented on SPARK-7977:
--

Working on this.

 Disallow println
 

 Key: SPARK-7977
 URL: https://issues.apache.org/jira/browse/SPARK-7977
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Reynold Xin
  Labels: starter

 Very often we see pull requests that added println from debugging, but the 
 author forgot to remove it before code review.
 We can use the regex checker to disallow println. For legitimate use of 
 println, we can then disable the rule where they are used.
 Add to scalastyle-config.xml file:
 {code}
   check customId=println level=error 
 class=org.scalastyle.scalariform.TokenChecker enabled=true
 parametersparameter name=regex^println$/parameter/parameters
 customMessage![CDATA[Are you sure you want to println? If yes, wrap 
 the code block with 
   // scalastyle:off println
   println(...)
   // scalastyle:on println]]/customMessage
   /check
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8567) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars

2015-06-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601900#comment-14601900
 ] 

Apache Spark commented on SPARK-8567:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/7027

 Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars
 --

 Key: SPARK-8567
 URL: https://issues.apache.org/jira/browse/SPARK-8567
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 1.4.1
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical
  Labels: flaky-test

 Seems tests in HiveSparkSubmitSuite fail with timeout pretty frequently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7287) Flaky test: o.a.s.deploy.SparkSubmitSuite --packages

2015-06-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601899#comment-14601899
 ] 

Apache Spark commented on SPARK-7287:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/7027

 Flaky test: o.a.s.deploy.SparkSubmitSuite --packages
 

 Key: SPARK-7287
 URL: https://issues.apache.org/jira/browse/SPARK-7287
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Burak Yavuz
Priority: Critical
  Labels: flaky-test

 Error message was not helpful (did not complete within 60 seconds or 
 something).
 Observed only in master:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/2239/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/2238/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2163/
 ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-06-25 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601923#comment-14601923
 ] 

Shay Rojansky commented on SPARK-7736:
--

The problem is simply with the YARN status for the application. If a Spark 
application throws an exception after having instantiated the SparkContext, the 
application obviously terminates but YARN lists the job as SUCCEEDED. This 
makes it hard for users to see what happened to their jobs in the YARN UI.

Let me know if this is still unclear.

 Exception not failing Python applications (in yarn cluster mode)
 

 Key: SPARK-7736
 URL: https://issues.apache.org/jira/browse/SPARK-7736
 Project: Spark
  Issue Type: Bug
  Components: YARN
 Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
Reporter: Shay Rojansky

 It seems that exceptions thrown in Python spark apps after the SparkContext 
 is instantiated don't cause the application to fail, at least in Yarn: the 
 application is marked as SUCCEEDED.
 Note that any exception right before the SparkContext correctly places the 
 application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8067) Add support for connecting to Hive 1.1

2015-06-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8067:
---

Assignee: (was: Apache Spark)

 Add support for connecting to Hive 1.1
 --

 Key: SPARK-8067
 URL: https://issues.apache.org/jira/browse/SPARK-8067
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8066) Add support for connecting to Hive 1.0 (0.14.1)

2015-06-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8066:
---

Assignee: Apache Spark

 Add support for connecting to Hive 1.0 (0.14.1)
 ---

 Key: SPARK-8066
 URL: https://issues.apache.org/jira/browse/SPARK-8066
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8067) Add support for connecting to Hive 1.1

2015-06-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601789#comment-14601789
 ] 

Apache Spark commented on SPARK-8067:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7026

 Add support for connecting to Hive 1.1
 --

 Key: SPARK-8067
 URL: https://issues.apache.org/jira/browse/SPARK-8067
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8067) Add support for connecting to Hive 1.1

2015-06-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8067:
---

Assignee: Apache Spark

 Add support for connecting to Hive 1.1
 --

 Key: SPARK-8067
 URL: https://issues.apache.org/jira/browse/SPARK-8067
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8066) Add support for connecting to Hive 1.0 (0.14.1)

2015-06-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601788#comment-14601788
 ] 

Apache Spark commented on SPARK-8066:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7026

 Add support for connecting to Hive 1.0 (0.14.1)
 ---

 Key: SPARK-8066
 URL: https://issues.apache.org/jira/browse/SPARK-8066
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8066) Add support for connecting to Hive 1.0 (0.14.1)

2015-06-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8066:
---

Assignee: (was: Apache Spark)

 Add support for connecting to Hive 1.0 (0.14.1)
 ---

 Key: SPARK-8066
 URL: https://issues.apache.org/jira/browse/SPARK-8066
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-06-25 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601914#comment-14601914
 ] 

Neelesh Srinivas Salian commented on SPARK-7736:


Could you add more context to the issue? 
What is the return value / output expected on the applications?



 Exception not failing Python applications (in yarn cluster mode)
 

 Key: SPARK-7736
 URL: https://issues.apache.org/jira/browse/SPARK-7736
 Project: Spark
  Issue Type: Bug
  Components: YARN
 Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
Reporter: Shay Rojansky

 It seems that exceptions thrown in Python spark apps after the SparkContext 
 is instantiated don't cause the application to fail, at least in Yarn: the 
 application is marked as SUCCEEDED.
 Note that any exception right before the SparkContext correctly places the 
 application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8643) local-cluster may not shutdown SparkContext gracefully

2015-06-25 Thread Yin Huai (JIRA)
Yin Huai created SPARK-8643:
---

 Summary: local-cluster may not shutdown SparkContext gracefully
 Key: SPARK-8643
 URL: https://issues.apache.org/jira/browse/SPARK-8643
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Yin Huai


When I was debugging SPARK-8567, I found that when I was using local-cluster, 
at the end of an application, executors were first killed and then launched 
again. From the log (attached), seems the master/driver side does not know it's 
in the shutdown process. So, it detected executor loss and then called the 
worker to launch new executors.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8643) local-cluster may not shutdown SparkContext gracefully

2015-06-25 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8643:

Attachment: HiveSparkSubmitSuite (SPARK-8368).txt

 local-cluster may not shutdown SparkContext gracefully
 --

 Key: SPARK-8643
 URL: https://issues.apache.org/jira/browse/SPARK-8643
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Yin Huai
 Attachments: HiveSparkSubmitSuite (SPARK-8368).txt


 When I was debugging SPARK-8567, I found that when I was using local-cluster, 
 at the end of an application, executors were first killed and then launched 
 again. From the log (attached), seems the master/driver side does not know 
 it's in the shutdown process. So, it detected executor loss and then called 
 the worker to launch new executors.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8651) Lasso with SGD not Converging properly

2015-06-25 Thread Albert Azout (JIRA)
Albert Azout created SPARK-8651:
---

 Summary: Lasso with SGD not Converging properly
 Key: SPARK-8651
 URL: https://issues.apache.org/jira/browse/SPARK-8651
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Albert Azout


We are having issues getting Lasso with SGD to converge properly. The weights 
outputted are extremely large values. We have tried multiple miniBatchRatios 
and still see same issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8372) History server shows incorrect information for application not started

2015-06-25 Thread Carson Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602405#comment-14602405
 ] 

Carson Wang commented on SPARK-8372:


[~vanzin] The log path name may also end with an attempt id, like 
application_xxx_xxx_1.inprogress. This happens when running the app in yarn 
cluster mode. If we still need get the app id from the log path name, the 
attempt id need to be removed as well if it exists. 

 History server shows incorrect information for application not started
 --

 Key: SPARK-8372
 URL: https://issues.apache.org/jira/browse/SPARK-8372
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Web UI
Affects Versions: 1.4.0
Reporter: Carson Wang
Assignee: Carson Wang
Priority: Minor
 Fix For: 1.4.1, 1.5.0

 Attachments: IncorrectAppInfo.png


 The history server may show an incorrect App ID for an incomplete application 
 like App ID.inprogress. This app info will never disappear even after the 
 app is completed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8620) cleanup CodeGenContext

2015-06-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-8620:
--
Assignee: Wenchen Fan

 cleanup CodeGenContext
 --

 Key: SPARK-8620
 URL: https://issues.apache.org/jira/browse/SPARK-8620
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8635) improve performance of CatalystTypeConverters

2015-06-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8635.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7018
[https://github.com/apache/spark/pull/7018]

 improve performance of CatalystTypeConverters
 -

 Key: SPARK-8635
 URL: https://issues.apache.org/jira/browse/SPARK-8635
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1859) Linear, Ridge and Lasso Regressions with SGD yield unexpected results

2015-06-25 Thread Albert Azout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602388#comment-14602388
 ] 

Albert Azout commented on SPARK-1859:
-

Hi this is still an open issue for us. FYI. Any new resolutions on this?

 Linear, Ridge and Lasso Regressions with SGD yield unexpected results
 -

 Key: SPARK-1859
 URL: https://issues.apache.org/jira/browse/SPARK-1859
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 0.9.1
 Environment: OS: Ubuntu Server 12.04 x64
 PySpark
Reporter: Vlad Frolov
  Labels: algorithm, machine_learning, regression

 Issue:
 Linear Regression with SGD don't work as expected on any data, but lpsa.dat 
 (example one).
 Ridge Regression with SGD *sometimes* works ok.
 Lasso Regression with SGD *sometimes* works ok.
 Code example (PySpark) based on 
 http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :
 {code:title=regression_example.py}
 parsedData = sc.parallelize([
 array([2400., 1500.]),
 array([240., 150.]),
 array([24., 15.]),
 array([2.4, 1.5]),
 array([0.24, 0.15])
 ])
 # Build the model
 model = LinearRegressionWithSGD.train(parsedData)
 print model._coeffs
 {code}
 So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! 
 :)
 The resulting model has nan coeffs: {{array([ nan])}}.
 Furthermore, if you comment records line by line you will get:
 * [-1.55897475e+296] coeff (the first record is commented), 
 * [-8.62115396e+104] coeff (the first two records are commented),
 * etc
 It looks like the implemented regression algorithms diverges somehow.
 I get almost the same results on Ridge and Lasso.
 I've also tested these inputs in scikit-learn and it works as expected there.
 However, I'm still not sure whether it's a bug or SGD 'feature'. Should I 
 preprocess my datasets somehow?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8588) Could not use concat with UDF in where clause

2015-06-25 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602401#comment-14602401
 ] 

Wenchen Fan commented on SPARK-8588:


cc [~marmbrus] this issue has already been fixed by 
https://github.com/apache/spark/pull/6145.

 Could not use concat with UDF in where clause
 -

 Key: SPARK-8588
 URL: https://issues.apache.org/jira/browse/SPARK-8588
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Centos 7, java 1.7.0_67, scala 2.10.5, run in a spark 
 standalone cluster(or local).
Reporter: StanZhai
Assignee: Wenchen Fan
Priority: Critical

 After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the 
 following exception when use concat with UDF in where clause: 
 {code}
 org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
 dataType on unresolved object, tree: 
 'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年) 
 at 
 org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82)
  
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
  
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
  
 at 
 scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) 
 at scala.collection.immutable.List.exists(List.scala:84) 
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299)
  
 at 
 org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85)
  
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) 
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) 
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) 
 at scala.collection.AbstractIterator.to(Iterator.scala:1157) 
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) 
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) 
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) 
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) 
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136)
  
 at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
  
 at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
  
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
  
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) 
 at 
 

[jira] [Resolved] (SPARK-8237) misc function: sha2

2015-06-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8237.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6934
[https://github.com/apache/spark/pull/6934]

 misc function: sha2
 ---

 Key: SPARK-8237
 URL: https://issues.apache.org/jira/browse/SPARK-8237
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
 Fix For: 1.5.0


 sha2(string/binary, int): string
 Calculates the SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and 
 SHA-512) (as of Hive 1.3.0). The first argument is the string or binary to be 
 hashed. The second argument indicates the desired bit length of the result, 
 which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 
 256). SHA-224 is supported starting from Java 8. If either argument is NULL 
 or the hash length is not one of the permitted values, the return value is 
 NULL. Example: sha2('ABC', 256) = 
 'b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8636) CaseKeyWhen has incorrect NULL handling

2015-06-25 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602411#comment-14602411
 ] 

Animesh Baranawal commented on SPARK-8636:
--

So the condition should be :
if (l == null || r == null) false
else l == r

 CaseKeyWhen has incorrect NULL handling
 ---

 Key: SPARK-8636
 URL: https://issues.apache.org/jira/browse/SPARK-8636
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Santiago M. Mola
  Labels: starter

 CaseKeyWhen implementation in Spark uses the following equals implementation:
 {code}
   private def equalNullSafe(l: Any, r: Any) = {
 if (l == null  r == null) {
   true
 } else if (l == null || r == null) {
   false
 } else {
   l == r
 }
   }
 {code}
 Which is not correct, since in SQL, NULL is never equal to NULL (actually, it 
 is not unequal either). In this case, a NULL value in a CASE WHEN expression 
 should never match.
 For example, you can execute this in MySQL:
 {code}
 SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END 
 FROM DUAL;
 {code}
 And the result will be NULL DOES NOT MATCH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8636) CaseKeyWhen has incorrect NULL handling

2015-06-25 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602411#comment-14602411
 ] 

Animesh Baranawal edited comment on SPARK-8636 at 6/26/15 5:13 AM:
---

So the condition should be :
if (l == null || r == null) false
else l == r ?


was (Author: animeshbaranawal):
So the condition should be :
if (l == null || r == null) false
else l == r

 CaseKeyWhen has incorrect NULL handling
 ---

 Key: SPARK-8636
 URL: https://issues.apache.org/jira/browse/SPARK-8636
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Santiago M. Mola
  Labels: starter

 CaseKeyWhen implementation in Spark uses the following equals implementation:
 {code}
   private def equalNullSafe(l: Any, r: Any) = {
 if (l == null  r == null) {
   true
 } else if (l == null || r == null) {
   false
 } else {
   l == r
 }
   }
 {code}
 Which is not correct, since in SQL, NULL is never equal to NULL (actually, it 
 is not unequal either). In this case, a NULL value in a CASE WHEN expression 
 should never match.
 For example, you can execute this in MySQL:
 {code}
 SELECT CASE NULL WHEN NULL THEN NULL MATCHES ELSE NULL DOES NOT MATCH END 
 FROM DUAL;
 {code}
 And the result will be NULL DOES NOT MATCH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8647) Potential issues with the constant hashCode

2015-06-25 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602431#comment-14602431
 ] 

Xiangrui Meng commented on SPARK-8647:
--

All MatrixUDT instances are the same. So the hashCode should return a constant. 
`1994` is just a random number we picked. Feel free to send a PR to add 
documentation. However, this is not a bug, and I don't think it would cause 
performance issues.

 Potential issues with the constant hashCode 
 

 Key: SPARK-8647
 URL: https://issues.apache.org/jira/browse/SPARK-8647
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Alok Singh
Priority: Minor
  Labels: performance

 Hi,
 This may be potential bug or performance issue or just the code docs.
 The issue is wrt to MatrixUDT class.
  If we decide to put instance of MatrixUDT into the hash based collection.
 The hashCode function is returning constant and even though equals method is 
 consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e 
 constant) has been used.
 I was expecting it to be similar to the other matrix class or the vector 
 class .
 If there is the reason why we have this code, we should document it properly 
 in the code so that others reading it is fine.
 regards,
 Alok
 Details
 =
 a)
 In reference to the file 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
 line 188-197 ie
  override def equals(o: Any): Boolean = {
 o match {
 case v: MatrixUDT = true
 case _ = false
 }
 }
 override def hashCode(): Int = 1994
 b) the commit is 
 https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436
 on March 20.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8647) Potential issues with the constant hashCode

2015-06-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8647:
-
Issue Type: Improvement  (was: Bug)

 Potential issues with the constant hashCode 
 

 Key: SPARK-8647
 URL: https://issues.apache.org/jira/browse/SPARK-8647
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Alok Singh
Priority: Minor
  Labels: performance

 Hi,
 This may be potential bug or performance issue or just the code docs.
 The issue is wrt to MatrixUDT class.
  If we decide to put instance of MatrixUDT into the hash based collection.
 The hashCode function is returning constant and even though equals method is 
 consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e 
 constant) has been used.
 I was expecting it to be similar to the other matrix class or the vector 
 class .
 If there is the reason why we have this code, we should document it properly 
 in the code so that others reading it is fine.
 regards,
 Alok
 Details
 =
 a)
 In reference to the file 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
 line 188-197 ie
  override def equals(o: Any): Boolean = {
 o match {
 case v: MatrixUDT = true
 case _ = false
 }
 }
 override def hashCode(): Int = 1994
 b) the commit is 
 https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436
 on March 20.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8625) Propagate user exceptions in tasks back to driver

2015-06-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8625:
---

Assignee: (was: Apache Spark)

 Propagate user exceptions in tasks back to driver
 -

 Key: SPARK-8625
 URL: https://issues.apache.org/jira/browse/SPARK-8625
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Tom White

 Runtime exceptions that are thrown by user code in Spark are presented to the 
 user as strings (message and stacktrace), rather than the exception object 
 itself. If the exception stores information about the error in fields then 
 these cannot be retrieved.
 Exceptions are Serializable, so it would be feasible to return the original 
 object back to the driver as the cause field in SparkException. This would 
 allow the client to retrieve information from the original exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8626) ALS model predict error

2015-06-25 Thread Subhod Lagade (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601004#comment-14601004
 ] 

Subhod Lagade commented on SPARK-8626:
--

INFO] Compiling 1 source files to /home/appadmin/disneypoc/target/classes at 
1435229668035
[ERROR] 
/home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: 
error: not enough arguments for method predict: (user: Int, product: Int)Double.
[INFO] Unspecified value parameter product.
[INFO]  val predictions = model.predict(usersProducts)


 ALS model predict error
 ---

 Key: SPARK-8626
 URL: https://issues.apache.org/jira/browse/SPARK-8626
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8627) ALS model predict error

2015-06-25 Thread Subhod Lagade (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601023#comment-14601023
 ] 

Subhod Lagade commented on SPARK-8627:
--

can you help me in resolving this ??
usersProducts is a RDD(int,int) it is still giving me error

 ALS model predict error
 ---

 Key: SPARK-8627
 URL: https://issues.apache.org/jira/browse/SPARK-8627
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade

 /**
  * Created by subhod lagade on 25/06/15.
  */
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming.StreamingContext._
 import org.apache.spark.streaming.{Seconds, StreamingContext}
 import org.apache.spark.streaming._;
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import java.io.BufferedReader;
 import java.io.FileInputStream;
 import java.io.IOException;
 import java.io.InputStreamReader;
 import java.io.PrintStream;
 import java.net.ServerSocket;
 import java.net.Socket;
 import java.util.Properties;
 import org.apache.spark.mllib.recommendation.ALS
 import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
 import org.apache.spark.mllib.recommendation.Rating
 object SparkStreamKafka {
   def main(args: Array[String]) {
   
 val conf = new SparkConf().setAppName(Simple Application);
 val sc = new SparkContext(conf);
   val data = sc.textFile(/home/appadmin/Disney/data.csv);
   val ratings = data.map(_.split(',') match { case Array(user, product, 
 rate) =  Rating(user.toInt, product.toInt, rate.toDouble)  });
   
   
   val rank = 3;
   val numIterations = 2;
   val model = ALS.train(ratings,rank,numIterations,0.01);
   
   val usersProducts = ratings.map{ case Rating(user, product, rate)  = 
 (user, product)}
   // Build the recommendation model using ALS
   usersProducts.foreach(println)
   val predictions = model.predict(usersProducts)
   }
 }
 /*
 ERROR Message
 [ERROR] 
 /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: 
 error: not enough arguments for method predict: (user: Int, product: 
 Int)Double.
 [INFO] Unspecified value parameter product.
 [INFO]  val predictions = model.predict(usersProducts)
 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8627) ALS model predict error

2015-06-25 Thread Subhod Lagade (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601024#comment-14601024
 ] 

Subhod Lagade commented on SPARK-8627:
--

can you help me in resolving this ??
usersProducts is a RDD(int,int) it is still giving me error

 ALS model predict error
 ---

 Key: SPARK-8627
 URL: https://issues.apache.org/jira/browse/SPARK-8627
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade

 /**
  * Created by subhod lagade on 25/06/15.
  */
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming.StreamingContext._
 import org.apache.spark.streaming.{Seconds, StreamingContext}
 import org.apache.spark.streaming._;
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import java.io.BufferedReader;
 import java.io.FileInputStream;
 import java.io.IOException;
 import java.io.InputStreamReader;
 import java.io.PrintStream;
 import java.net.ServerSocket;
 import java.net.Socket;
 import java.util.Properties;
 import org.apache.spark.mllib.recommendation.ALS
 import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
 import org.apache.spark.mllib.recommendation.Rating
 object SparkStreamKafka {
   def main(args: Array[String]) {
   
 val conf = new SparkConf().setAppName(Simple Application);
 val sc = new SparkContext(conf);
   val data = sc.textFile(/home/appadmin/Disney/data.csv);
   val ratings = data.map(_.split(',') match { case Array(user, product, 
 rate) =  Rating(user.toInt, product.toInt, rate.toDouble)  });
   
   
   val rank = 3;
   val numIterations = 2;
   val model = ALS.train(ratings,rank,numIterations,0.01);
   
   val usersProducts = ratings.map{ case Rating(user, product, rate)  = 
 (user, product)}
   // Build the recommendation model using ALS
   usersProducts.foreach(println)
   val predictions = model.predict(usersProducts)
   }
 }
 /*
 ERROR Message
 [ERROR] 
 /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: 
 error: not enough arguments for method predict: (user: Int, product: 
 Int)Double.
 [INFO] Unspecified value parameter product.
 [INFO]  val predictions = model.predict(usersProducts)
 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8627) ALS model predict error

2015-06-25 Thread Subhod Lagade (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subhod Lagade updated SPARK-8627:
-
Comment: was deleted

(was: can you help me in resolving this ??
usersProducts is a RDD(int,int) it is still giving me error 
)

 ALS model predict error
 ---

 Key: SPARK-8627
 URL: https://issues.apache.org/jira/browse/SPARK-8627
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade

 /**
  * Created by subhod lagade on 25/06/15.
  */
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming.StreamingContext._
 import org.apache.spark.streaming.{Seconds, StreamingContext}
 import org.apache.spark.streaming._;
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import java.io.BufferedReader;
 import java.io.FileInputStream;
 import java.io.IOException;
 import java.io.InputStreamReader;
 import java.io.PrintStream;
 import java.net.ServerSocket;
 import java.net.Socket;
 import java.util.Properties;
 import org.apache.spark.mllib.recommendation.ALS
 import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
 import org.apache.spark.mllib.recommendation.Rating
 object SparkStreamKafka {
   def main(args: Array[String]) {
   
 val conf = new SparkConf().setAppName(Simple Application);
 val sc = new SparkContext(conf);
   val data = sc.textFile(/home/appadmin/Disney/data.csv);
   val ratings = data.map(_.split(',') match { case Array(user, product, 
 rate) =  Rating(user.toInt, product.toInt, rate.toDouble)  });
   
   
   val rank = 3;
   val numIterations = 2;
   val model = ALS.train(ratings,rank,numIterations,0.01);
   
   val usersProducts = ratings.map{ case Rating(user, product, rate)  = 
 (user, product)}
   // Build the recommendation model using ALS
   usersProducts.foreach(println)
   val predictions = model.predict(usersProducts)
   }
 }
 /*
 ERROR Message
 [ERROR] 
 /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: 
 error: not enough arguments for method predict: (user: Int, product: 
 Int)Double.
 [INFO] Unspecified value parameter product.
 [INFO]  val predictions = model.predict(usersProducts)
 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5768) Spark UI Shows incorrect memory under Yarn

2015-06-25 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601082#comment-14601082
 ] 

Kousuke Saruta commented on SPARK-5768:
---

I can't change assignee field and I don't know why.
I'll try to change again later.

 Spark UI Shows incorrect memory under Yarn
 --

 Key: SPARK-5768
 URL: https://issues.apache.org/jira/browse/SPARK-5768
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0, 1.2.1
 Environment: Centos 6
Reporter: Al M
Priority: Trivial
 Fix For: 1.4.1, 1.5.0


 I am running Spark on Yarn with 2 executors.  The executors are running on 
 separate physical machines.
 I have spark.executor.memory set to '40g'.  This is because I want to have 
 40g of memory used on each machine.  I have one executor per machine.
 When I run my application I see from 'top' that both my executors are using 
 the full 40g of memory I allocated to them.
 The 'Executors' tab in the Spark UI shows something different.  It shows the 
 memory used as a total of 20GB per executor e.g. x / 20.3GB.  This makes it 
 look like I only have 20GB available per executor when really I have 40GB 
 available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8629) R code in SparkR

2015-06-25 Thread Arun (JIRA)
Arun created SPARK-8629:
---

 Summary: R code in SparkR
 Key: SPARK-8629
 URL: https://issues.apache.org/jira/browse/SPARK-8629
 Project: Spark
  Issue Type: Question
  Components: R
Reporter: Arun
Priority: Minor


Data set:  
  
DC_City Dc_Code ItemNo  Itemdescription dat   
Month YearSalesQuantity 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
9/16/2012   9-Sep 2012   1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
12/21/2012  12-Dec2012 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/12/2013   1-Jan   2013 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/27/2013   1-Jan   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/1/20132-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/12/2013   2-Feb   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/13/2013   2-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/14/2013   2-Feb   2013 1 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/15/2013   2-Feb   2013 8 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/16/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/17/2013   2-Feb   2013 19 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/18/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/19/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/20/2013   2-Feb   2013 16 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/21/2013   2-Feb   2013 25 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/22/2013   2-Feb   2013 19 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/23/2013   2-Feb   2013 17 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/24/2013   2-Feb   2013 39 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/25/2013   2-Feb   2013 23 


Code i used in R:

  data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) 
  factors - unique(data$ItemNo) 
  df.allitems - data.frame() 
  for(i in 1:length(factors)) 
  { 
   data1 - filter(data, ItemNo  == factors[[i]]) 
   data2- select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) 
# select particular columns 
   date2$date - as.Date(date2$date, format = %m/%d/%y) # format the date 
   data3 - data2[order(data2$date), ] # order by assending 
   df.allitems - rbind(data3 , df.allitems)  # Append by row bind 
  } 
  
  write.csv(df.allitems,E:/all_items.csv) 

--- 
  
I have done some SparkR code: 
  data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R 
  df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF 
  factors - distinct(df_1) # removed duplicates 
  
#for select i used: 
  df_2 - select(distinctDF 
,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select 
action 

I dont know how to: 
  1) create a empty sparkR DF 
  2) Using for loop in SparkR 
  3) change the date format. 
  4) find the lenght() in spark df 
  5) using rbind in sparkR 
  
can you help me out in doing the above code in sparkR.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8629) R code in SparkR

2015-06-25 Thread Arun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun updated SPARK-8629:

Description: 
Data set:  
  
DC_City Dc_Code ItemNo  Itemdescription dat   
Month YearSalesQuantity 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
9/16/2012   9-Sep 2012   1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
12/21/2012  12-Dec2012 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/12/2013   1-Jan   2013 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/27/2013   1-Jan   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/1/20132-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/12/2013   2-Feb   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/13/2013   2-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/14/2013   2-Feb   2013 1 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/15/2013   2-Feb   2013 8 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/16/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/17/2013   2-Feb   2013 19 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/18/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/19/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/20/2013   2-Feb   2013 16 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/21/2013   2-Feb   2013 25 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/22/2013   2-Feb   2013 19 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/23/2013   2-Feb   2013 17 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/24/2013   2-Feb   2013 39 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/25/2013   2-Feb   2013 23 


Code i used in R:

  data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) 
 factors - unique(data$ItemNo) 
  df.allitems - data.frame() 
  for(i in 1:length(factors)) 
 
 { 
   data1 - filter(data, ItemNo  == factors[[i]]) 
 data2select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) 
 date2$date - as.Date(date2$date, format = %m/%d/%y)  
 data3 - data2[order(data2$date), ]  
 df.allitems - rbind(data3 , df.allitems)  # Append by row bind 
  } 

  
  write.csv(df.allitems,E:/all_items.csv) 

--- 
  
I have done some SparkR code: 
  data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R 
  df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF 
  factors - distinct(df_1) # removed duplicates 
  
#for select i used: 
  df_2 - select(distinctDF 
,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select 
action 

I dont know how to: 
  1) create a empty sparkR DF 
  2) Using for loop in SparkR 
  3) change the date format. 
  4) find the lenght() in spark df 
  5) using rbind in sparkR 
  
can you help me out in doing the above code in sparkR.


  was:
Data set:  
  
DC_City Dc_Code ItemNo  Itemdescription dat   
Month YearSalesQuantity 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
9/16/2012   9-Sep 2012   1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
12/21/2012  12-Dec2012 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/12/2013   1-Jan   2013 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/27/2013   1-Jan   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/1/20132-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/12/2013   2-Feb   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/13/2013   2-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/14/2013   2-Feb   2013 1 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/15/2013   2-Feb   2013 8 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/16/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/17/2013   2-Feb   2013 19 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg.   

[jira] [Assigned] (SPARK-8625) Propagate user exceptions in tasks back to driver

2015-06-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8625:
---

Assignee: Apache Spark

 Propagate user exceptions in tasks back to driver
 -

 Key: SPARK-8625
 URL: https://issues.apache.org/jira/browse/SPARK-8625
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Tom White
Assignee: Apache Spark

 Runtime exceptions that are thrown by user code in Spark are presented to the 
 user as strings (message and stacktrace), rather than the exception object 
 itself. If the exception stores information about the error in fields then 
 these cannot be retrieved.
 Exceptions are Serializable, so it would be feasible to return the original 
 object back to the driver as the cause field in SparkException. This would 
 allow the client to retrieve information from the original exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-8627) ALS model predict error

2015-06-25 Thread Subhod Lagade (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subhod Lagade reopened SPARK-8627:
--

usersProducts is a RDD(int,int) it is still giving me error

There is some issue with model.predict

 ALS model predict error
 ---

 Key: SPARK-8627
 URL: https://issues.apache.org/jira/browse/SPARK-8627
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade

 /**
  * Created by subhod lagade on 25/06/15.
  */
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming.StreamingContext._
 import org.apache.spark.streaming.{Seconds, StreamingContext}
 import org.apache.spark.streaming._;
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import java.io.BufferedReader;
 import java.io.FileInputStream;
 import java.io.IOException;
 import java.io.InputStreamReader;
 import java.io.PrintStream;
 import java.net.ServerSocket;
 import java.net.Socket;
 import java.util.Properties;
 import org.apache.spark.mllib.recommendation.ALS
 import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
 import org.apache.spark.mllib.recommendation.Rating
 object SparkStreamKafka {
   def main(args: Array[String]) {
   
 val conf = new SparkConf().setAppName(Simple Application);
 val sc = new SparkContext(conf);
   val data = sc.textFile(/home/appadmin/Disney/data.csv);
   val ratings = data.map(_.split(',') match { case Array(user, product, 
 rate) =  Rating(user.toInt, product.toInt, rate.toDouble)  });
   
   
   val rank = 3;
   val numIterations = 2;
   val model = ALS.train(ratings,rank,numIterations,0.01);
   
   val usersProducts = ratings.map{ case Rating(user, product, rate)  = 
 (user, product)}
   // Build the recommendation model using ALS
   usersProducts.foreach(println)
   val predictions = model.predict(usersProducts)
   }
 }
 /*
 ERROR Message
 [ERROR] 
 /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: 
 error: not enough arguments for method predict: (user: Int, product: 
 Int)Double.
 [INFO] Unspecified value parameter product.
 [INFO]  val predictions = model.predict(usersProducts)
 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5768) Spark UI Shows incorrect memory under Yarn

2015-06-25 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-5768.
---
Resolution: Fixed

 Spark UI Shows incorrect memory under Yarn
 --

 Key: SPARK-5768
 URL: https://issues.apache.org/jira/browse/SPARK-5768
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0, 1.2.1
 Environment: Centos 6
Reporter: Al M
Priority: Trivial
 Fix For: 1.4.1, 1.5.0


 I am running Spark on Yarn with 2 executors.  The executors are running on 
 separate physical machines.
 I have spark.executor.memory set to '40g'.  This is because I want to have 
 40g of memory used on each machine.  I have one executor per machine.
 When I run my application I see from 'top' that both my executors are using 
 the full 40g of memory I allocated to them.
 The 'Executors' tab in the Spark UI shows something different.  It shows the 
 memory used as a total of 20GB per executor e.g. x / 20.3GB.  This makes it 
 look like I only have 20GB available per executor when really I have 40GB 
 available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5768) Spark UI Shows incorrect memory under Yarn

2015-06-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601098#comment-14601098
 ] 

Sean Owen commented on SPARK-5768:
--

I set it, and added you to the Committers role, which should let you change 
Assignee. I think this is all correct but note that if (unlikely) 1.4.1 is 
released without another RC then this won't be fixed for 1.4.1.

 Spark UI Shows incorrect memory under Yarn
 --

 Key: SPARK-5768
 URL: https://issues.apache.org/jira/browse/SPARK-5768
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0, 1.2.1
 Environment: Centos 6
Reporter: Al M
Assignee: Rekha Joshi
Priority: Trivial
 Fix For: 1.4.1, 1.5.0


 I am running Spark on Yarn with 2 executors.  The executors are running on 
 separate physical machines.
 I have spark.executor.memory set to '40g'.  This is because I want to have 
 40g of memory used on each machine.  I have one executor per machine.
 When I run my application I see from 'top' that both my executors are using 
 the full 40g of memory I allocated to them.
 The 'Executors' tab in the Spark UI shows something different.  It shows the 
 memory used as a total of 20GB per executor e.g. x / 20.3GB.  This makes it 
 look like I only have 20GB available per executor when really I have 40GB 
 available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8629) R code in SparkR

2015-06-25 Thread Arun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun updated SPARK-8629:

Description: 
Data set:  
  
DC_City Dc_Code ItemNo  Itemdescription dat   
Month YearSalesQuantity 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
9/16/2012   9-Sep 2012   1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
12/21/2012  12-Dec2012 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/12/2013   1-Jan   2013 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/27/2013   1-Jan   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/1/20132-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/12/2013   2-Feb   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/13/2013   2-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/14/2013   2-Feb   2013 1 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/15/2013   2-Feb   2013 8 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/16/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/17/2013   2-Feb   2013 19 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/18/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/19/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/20/2013   2-Feb   2013 16 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/21/2013   2-Feb   2013 25 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/22/2013   2-Feb   2013 19 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/23/2013   2-Feb   2013 17 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/24/2013   2-Feb   2013 39 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/25/2013   2-Feb   2013 23 


Code i used in R:

  data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) 
  factors - unique(data$ItemNo) 
  df.allitems - data.frame() 
  for(i in 1:length(factors)) 
  { 
   data1 - filter(data, ItemNo  == factors[[i]]) 
   
  data2- select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) 
# select particular columns  
   date2$date - as.Date(date2$date, format = %m/%d/%y) # format the date 
   
   data3 - data2[order(data2$date), ] # order by assending 
   df.allitems - rbind(data3 , df.allitems)  # Append by row bind 
  } 
  
  write.csv(df.allitems,E:/all_items.csv) 

--- 
  
I have done some SparkR code: 
  data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R 
  df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF 
  factors - distinct(df_1) # removed duplicates 
  
#for select i used: 
  df_2 - select(distinctDF 
,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select 
action 

I dont know how to: 
  1) create a empty sparkR DF 
  2) Using for loop in SparkR 
  3) change the date format. 
  4) find the lenght() in spark df 
  5) using rbind in sparkR 
  
can you help me out in doing the above code in sparkR.


  was:
Data set:  
  
DC_City Dc_Code ItemNo  Itemdescription dat   
Month YearSalesQuantity 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
9/16/2012   9-Sep 2012   1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
12/21/2012  12-Dec2012 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/12/2013   1-Jan   2013 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/27/2013   1-Jan   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/1/20132-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/12/2013   2-Feb   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/13/2013   2-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/14/2013   2-Feb   2013 1 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/15/2013   2-Feb   2013 8 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/16/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/17/2013   

[jira] [Updated] (SPARK-8629) R code in SparkR

2015-06-25 Thread Arun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun updated SPARK-8629:

Description: 
Data set:  
  
DC_City Dc_Code ItemNo  Itemdescription dat   
Month YearSalesQuantity 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
9/16/2012   9-Sep 2012   1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
12/21/2012  12-Dec2012 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/12/2013   1-Jan   2013 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/27/2013   1-Jan   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/1/20132-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/12/2013   2-Feb   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/13/2013   2-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/14/2013   2-Feb   2013 1 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/15/2013   2-Feb   2013 8 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/16/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/17/2013   2-Feb   2013 19 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/18/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/19/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/20/2013   2-Feb   2013 16 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/21/2013   2-Feb   2013 25 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/22/2013   2-Feb   2013 19 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/23/2013   2-Feb   2013 17 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/24/2013   2-Feb   2013 39 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/25/2013   2-Feb   2013 23 


Code i used in R:

  data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) 
 factors - unique(data$ItemNo) 
  df.allitems - data.frame() 
  for(i in 1:length(factors)) 
  { 
   data1 - filter(data, ItemNo  == factors[[i]]) 
 data2select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) 
 date2$date - as.Date(date2$date, format = %m/%d/%y)  
 data3 - data2[order(data2$date), ]  
 df.allitems - rbind(data3 , df.allitems)  # Append by row bind 
  } 
  
  write.csv(df.allitems,E:/all_items.csv) 

--- 
  
I have done some SparkR code: 
  data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R 
  df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF 
  factors - distinct(df_1) # removed duplicates 
  
#for select i used: 
  df_2 - select(distinctDF 
,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select 
action 

I dont know how to: 
  1) create a empty sparkR DF 
  2) Using for loop in SparkR 
  3) change the date format. 
  4) find the lenght() in spark df 
  5) using rbind in sparkR 
  
can you help me out in doing the above code in sparkR.


  was:
Data set:  
  
DC_City Dc_Code ItemNo  Itemdescription dat   
Month YearSalesQuantity 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
9/16/2012   9-Sep 2012   1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
12/21/2012  12-Dec2012 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/12/2013   1-Jan   2013 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/27/2013   1-Jan   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/1/20132-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/12/2013   2-Feb   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/13/2013   2-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/14/2013   2-Feb   2013 1 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/15/2013   2-Feb   2013 8 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/16/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/17/2013   2-Feb   2013 19 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 

[jira] [Updated] (SPARK-8624) DataFrameReader doesn't respect MERGE_SCHEMA setting for Parquet

2015-06-25 Thread Rex Xiong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rex Xiong updated SPARK-8624:
-
Description: 
In 1.4.0, parquet is read by DataFrameReader.parquet, when creating 
ParquetRelation2 object, parameters is hard-coded as Map.empty[String, 
String], so ParquetRelation2.shouldMergeSchemas is always true (the default 
value).
In previous version, spark.sql.hive.convertMetastoreParquet.mergeSchema config 
is respected.
This bug downgrade performance a lot for a folder with hundreds of parquet 
files and we don't want a schema merge.

  was:
In 1.4.0, parquet is read by DataFrameReader.parquet, when creating 
ParquetRelation2 object, Map.empty[String, String] is hard-coded as 
parameters, so ParquetRelation2.shouldMergeSchemas is always true (the 
default value).
In previous version, spark.sql.hive.convertMetastoreParquet.mergeSchema config 
is respected.
This bug downgrade performance a lot for a folder with hundreds of parquet 
files and we don't want a schema merge.


 DataFrameReader doesn't respect MERGE_SCHEMA setting for Parquet
 

 Key: SPARK-8624
 URL: https://issues.apache.org/jira/browse/SPARK-8624
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Rex Xiong
  Labels: parquet

 In 1.4.0, parquet is read by DataFrameReader.parquet, when creating 
 ParquetRelation2 object, parameters is hard-coded as Map.empty[String, 
 String], so ParquetRelation2.shouldMergeSchemas is always true (the default 
 value).
 In previous version, spark.sql.hive.convertMetastoreParquet.mergeSchema 
 config is respected.
 This bug downgrade performance a lot for a folder with hundreds of parquet 
 files and we don't want a schema merge.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse

2015-06-25 Thread Santiago M. Mola (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santiago M. Mola updated SPARK-8628:

Description: 
SPARK-5009 introduced the following code in AbstractSparkSQLParser:

{code}
def parse(input: String): LogicalPlan = {
// Initialize the Keywords.
lexical.initialize(reservedWords)
phrase(start)(new lexical.Scanner(input)) match {
  case Success(plan, _) = plan
  case failureOrError = sys.error(failureOrError.toString)
}
  }
{code}

The corresponding initialize method in SqlLexical is not thread-safe:

{code}
  /* This is a work around to support the lazy setting */
  def initialize(keywords: Seq[String]): Unit = {
reserved.clear()
reserved ++= keywords
  }
{code}

I'm hitting this when parsing multiple SQL queries concurrently. When one query 
parsing starts, it empties the reserved keyword list, then a race-condition 
occurs and other queries fail to parse because they recognize keywords as 
identifiers.

  was:
SPARK-5009 introduced the following code:

def parse(input: String): LogicalPlan = {
// Initialize the Keywords.
lexical.initialize(reservedWords)
phrase(start)(new lexical.Scanner(input)) match {
  case Success(plan, _) = plan
  case failureOrError = sys.error(failureOrError.toString)
}
  }

The corresponding initialize method in SqlLexical is not thread-safe:

  /* This is a work around to support the lazy setting */
  def initialize(keywords: Seq[String]): Unit = {
reserved.clear()
reserved ++= keywords
  }

I'm hitting this when parsing multiple SQL queries concurrently. When one query 
parsing starts, it empties the reserved keyword list, then a race-condition 
occurs and other queries fail to parse because they recognize keywords as 
identifiers.


 Race condition in AbstractSparkSQLParser.parse
 --

 Key: SPARK-8628
 URL: https://issues.apache.org/jira/browse/SPARK-8628
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Santiago M. Mola
Priority: Critical
  Labels: regression

 SPARK-5009 introduced the following code in AbstractSparkSQLParser:
 {code}
 def parse(input: String): LogicalPlan = {
 // Initialize the Keywords.
 lexical.initialize(reservedWords)
 phrase(start)(new lexical.Scanner(input)) match {
   case Success(plan, _) = plan
   case failureOrError = sys.error(failureOrError.toString)
 }
   }
 {code}
 The corresponding initialize method in SqlLexical is not thread-safe:
 {code}
   /* This is a work around to support the lazy setting */
   def initialize(keywords: Seq[String]): Unit = {
 reserved.clear()
 reserved ++= keywords
   }
 {code}
 I'm hitting this when parsing multiple SQL queries concurrently. When one 
 query parsing starts, it empties the reserved keyword list, then a 
 race-condition occurs and other queries fail to parse because they recognize 
 keywords as identifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8627) ALS model predict error

2015-06-25 Thread Subhod Lagade (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601019#comment-14601019
 ] 

Subhod Lagade commented on SPARK-8627:
--

can you help me in resolving this ??
usersProducts is a RDD(int,int) it is still giving me error 


 ALS model predict error
 ---

 Key: SPARK-8627
 URL: https://issues.apache.org/jira/browse/SPARK-8627
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade

 /**
  * Created by subhod lagade on 25/06/15.
  */
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming.StreamingContext._
 import org.apache.spark.streaming.{Seconds, StreamingContext}
 import org.apache.spark.streaming._;
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import java.io.BufferedReader;
 import java.io.FileInputStream;
 import java.io.IOException;
 import java.io.InputStreamReader;
 import java.io.PrintStream;
 import java.net.ServerSocket;
 import java.net.Socket;
 import java.util.Properties;
 import org.apache.spark.mllib.recommendation.ALS
 import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
 import org.apache.spark.mllib.recommendation.Rating
 object SparkStreamKafka {
   def main(args: Array[String]) {
   
 val conf = new SparkConf().setAppName(Simple Application);
 val sc = new SparkContext(conf);
   val data = sc.textFile(/home/appadmin/Disney/data.csv);
   val ratings = data.map(_.split(',') match { case Array(user, product, 
 rate) =  Rating(user.toInt, product.toInt, rate.toDouble)  });
   
   
   val rank = 3;
   val numIterations = 2;
   val model = ALS.train(ratings,rank,numIterations,0.01);
   
   val usersProducts = ratings.map{ case Rating(user, product, rate)  = 
 (user, product)}
   // Build the recommendation model using ALS
   usersProducts.foreach(println)
   val predictions = model.predict(usersProducts)
   }
 }
 /*
 ERROR Message
 [ERROR] 
 /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: 
 error: not enough arguments for method predict: (user: Int, product: 
 Int)Double.
 [INFO] Unspecified value parameter product.
 [INFO]  val predictions = model.predict(usersProducts)
 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5768) Spark UI Shows incorrect memory under Yarn

2015-06-25 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta closed SPARK-5768.
-
  Resolution: Fixed
   Fix Version/s: 1.5.0
  1.4.1
Target Version/s: 1.5.0

 Spark UI Shows incorrect memory under Yarn
 --

 Key: SPARK-5768
 URL: https://issues.apache.org/jira/browse/SPARK-5768
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0, 1.2.1
 Environment: Centos 6
Reporter: Al M
Priority: Trivial
 Fix For: 1.4.1, 1.5.0


 I am running Spark on Yarn with 2 executors.  The executors are running on 
 separate physical machines.
 I have spark.executor.memory set to '40g'.  This is because I want to have 
 40g of memory used on each machine.  I have one executor per machine.
 When I run my application I see from 'top' that both my executors are using 
 the full 40g of memory I allocated to them.
 The 'Executors' tab in the Spark UI shows something different.  It shows the 
 memory used as a total of 20GB per executor e.g. x / 20.3GB.  This makes it 
 look like I only have 20GB available per executor when really I have 40GB 
 available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse

2015-06-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601086#comment-14601086
 ] 

Apache Spark commented on SPARK-8628:
-

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/7015

 Race condition in AbstractSparkSQLParser.parse
 --

 Key: SPARK-8628
 URL: https://issues.apache.org/jira/browse/SPARK-8628
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Santiago M. Mola
Priority: Critical
  Labels: regression

 SPARK-5009 introduced the following code in AbstractSparkSQLParser:
 {code}
 def parse(input: String): LogicalPlan = {
 // Initialize the Keywords.
 lexical.initialize(reservedWords)
 phrase(start)(new lexical.Scanner(input)) match {
   case Success(plan, _) = plan
   case failureOrError = sys.error(failureOrError.toString)
 }
   }
 {code}
 The corresponding initialize method in SqlLexical is not thread-safe:
 {code}
   /* This is a work around to support the lazy setting */
   def initialize(keywords: Seq[String]): Unit = {
 reserved.clear()
 reserved ++= keywords
   }
 {code}
 I'm hitting this when parsing multiple SQL queries concurrently. When one 
 query parsing starts, it empties the reserved keyword list, then a 
 race-condition occurs and other queries fail to parse because they recognize 
 keywords as identifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse

2015-06-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8628:
---

Assignee: (was: Apache Spark)

 Race condition in AbstractSparkSQLParser.parse
 --

 Key: SPARK-8628
 URL: https://issues.apache.org/jira/browse/SPARK-8628
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Santiago M. Mola
Priority: Critical
  Labels: regression

 SPARK-5009 introduced the following code in AbstractSparkSQLParser:
 {code}
 def parse(input: String): LogicalPlan = {
 // Initialize the Keywords.
 lexical.initialize(reservedWords)
 phrase(start)(new lexical.Scanner(input)) match {
   case Success(plan, _) = plan
   case failureOrError = sys.error(failureOrError.toString)
 }
   }
 {code}
 The corresponding initialize method in SqlLexical is not thread-safe:
 {code}
   /* This is a work around to support the lazy setting */
   def initialize(keywords: Seq[String]): Unit = {
 reserved.clear()
 reserved ++= keywords
   }
 {code}
 I'm hitting this when parsing multiple SQL queries concurrently. When one 
 query parsing starts, it empties the reserved keyword list, then a 
 race-condition occurs and other queries fail to parse because they recognize 
 keywords as identifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse

2015-06-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8628:
---

Assignee: Apache Spark

 Race condition in AbstractSparkSQLParser.parse
 --

 Key: SPARK-8628
 URL: https://issues.apache.org/jira/browse/SPARK-8628
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Santiago M. Mola
Assignee: Apache Spark
Priority: Critical
  Labels: regression

 SPARK-5009 introduced the following code in AbstractSparkSQLParser:
 {code}
 def parse(input: String): LogicalPlan = {
 // Initialize the Keywords.
 lexical.initialize(reservedWords)
 phrase(start)(new lexical.Scanner(input)) match {
   case Success(plan, _) = plan
   case failureOrError = sys.error(failureOrError.toString)
 }
   }
 {code}
 The corresponding initialize method in SqlLexical is not thread-safe:
 {code}
   /* This is a work around to support the lazy setting */
   def initialize(keywords: Seq[String]): Unit = {
 reserved.clear()
 reserved ++= keywords
   }
 {code}
 I'm hitting this when parsing multiple SQL queries concurrently. When one 
 query parsing starts, it empties the reserved keyword list, then a 
 race-condition occurs and other queries fail to parse because they recognize 
 keywords as identifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8631) MLlib predict function error

2015-06-25 Thread Subhod Lagade (JIRA)
Subhod Lagade created SPARK-8631:


 Summary: MLlib predict function error
 Key: SPARK-8631
 URL: https://issues.apache.org/jira/browse/SPARK-8631
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade


def
predict(usersProducts: RDD[(Int, Int)]): RDD[Rating]
Predict the rating of many users for many products. The output RDD has an 
element per each element in the input RDD (including all duplicates) unless a 
user or product is missing in the training set.
usersProducts
RDD of (user, product) pairs.
returns
RDD of Ratings.
def
predict(user: Int, product: Int): Double
Predict the rating of one user for one product.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8625) Propagate user exceptions in tasks back to driver

2015-06-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600997#comment-14600997
 ] 

Apache Spark commented on SPARK-8625:
-

User 'tomwhite' has created a pull request for this issue:
https://github.com/apache/spark/pull/7014

 Propagate user exceptions in tasks back to driver
 -

 Key: SPARK-8625
 URL: https://issues.apache.org/jira/browse/SPARK-8625
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Tom White

 Runtime exceptions that are thrown by user code in Spark are presented to the 
 user as strings (message and stacktrace), rather than the exception object 
 itself. If the exception stores information about the error in fields then 
 these cannot be retrieved.
 Exceptions are Serializable, so it would be feasible to return the original 
 object back to the driver as the cause field in SparkException. This would 
 allow the client to retrieve information from the original exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8626) ALS model predict error

2015-06-25 Thread Subhod Lagade (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601002#comment-14601002
 ] 

Subhod Lagade commented on SPARK-8626:
--

/**
 * Created by subhod lagade on 25/06/15.
 */
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming._;

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._



import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintStream;
import java.net.ServerSocket;
import java.net.Socket;
import java.util.Properties;



import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating


object SparkStreamKafka {
  def main(args: Array[String]) {

val conf = new SparkConf().setAppName(Simple Application);
val sc = new SparkContext(conf);
val data = sc.textFile(/home/appadmin/Disney/data.csv);
val ratings = data.map(_.split(',') match { case Array(user, product, 
rate) =  Rating(user.toInt, product.toInt, rate.toDouble)  });


val rank = 3;
val numIterations = 2;
val model = ALS.train(ratings,rank,numIterations,0.01);


val usersProducts = ratings.map{ case Rating(user, product, rate)  = 
(user, product)}

// Build the recommendation model using ALS
usersProducts.foreach(println)

val predictions = model.predict(usersProducts)
}
}

 ALS model predict error
 ---

 Key: SPARK-8626
 URL: https://issues.apache.org/jira/browse/SPARK-8626
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-5768) Spark UI Shows incorrect memory under Yarn

2015-06-25 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta reopened SPARK-5768:
---

 Spark UI Shows incorrect memory under Yarn
 --

 Key: SPARK-5768
 URL: https://issues.apache.org/jira/browse/SPARK-5768
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0, 1.2.1
 Environment: Centos 6
Reporter: Al M
Priority: Trivial
 Fix For: 1.4.1, 1.5.0


 I am running Spark on Yarn with 2 executors.  The executors are running on 
 separate physical machines.
 I have spark.executor.memory set to '40g'.  This is because I want to have 
 40g of memory used on each machine.  I have one executor per machine.
 When I run my application I see from 'top' that both my executors are using 
 the full 40g of memory I allocated to them.
 The 'Executors' tab in the Spark UI shows something different.  It shows the 
 memory used as a total of 20GB per executor e.g. x / 20.3GB.  This makes it 
 look like I only have 20GB available per executor when really I have 40GB 
 available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8630) Prevent from checkpointing QueueInputDStream

2015-06-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8630:
---

Assignee: Apache Spark

 Prevent from checkpointing QueueInputDStream
 

 Key: SPARK-8630
 URL: https://issues.apache.org/jira/browse/SPARK-8630
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu
Assignee: Apache Spark

 It's better to prevent from checkpointing QueueInputDStream rather than 
 failing the application when recovering `QueueInputDStream`, so that people 
 can find the issue as soon as possible. See SPARK-8553 for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8629) R code in SparkR

2015-06-25 Thread Arun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun updated SPARK-8629:

Description: 
Data set:  
  
DC_City Dc_Code ItemNo  Itemdescription dat   
Month YearSalesQuantity 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
9/16/2012   9-Sep 2012   1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
12/21/2012  12-Dec2012 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/12/2013   1-Jan   2013 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/27/2013   1-Jan   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/1/20132-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/12/2013   2-Feb   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/13/2013   2-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/14/2013   2-Feb   2013 1 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/15/2013   2-Feb   2013 8 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/16/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/17/2013   2-Feb   2013 19 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/18/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/19/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/20/2013   2-Feb   2013 16 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/21/2013   2-Feb   2013 25 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/22/2013   2-Feb   2013 19 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/23/2013   2-Feb   2013 17 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/24/2013   2-Feb   2013 39 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/25/2013   2-Feb   2013 23 


Code i used in R:

  data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) 
 factors - unique(data$ItemNo) 
  df.allitems - data.frame() 
  for(i in 1:length(factors)) 
 
 {  

   data1 -  filter(data, ItemNo  == factors[[i]]) 
 data2- select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) 
 date2$date - as.Date(date2$date, format = %m/%d/%y)  
 data3 - data2[order(data2$date), ]  
 df.allitems - rbind(data3 , df.allitems)  # Append by row bind 
  } 
 

  
  write.csv(df.allitems,E:/all_items.csv) 

You can see the code clearly in -
-
http://apache-spark-user-list.1001560.n3.nabble.com/Convert-R-code-into-SparkR-code-for-spark-1-4-version-tp23489.html
-
  
I have done some SparkR code: 
  data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R 
  df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF 
  factors - distinct(df_1) # removed duplicates 
  
#for select i used: 
  df_2 - select(distinctDF 
,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select 
action 

I dont know how to: 
  1) create a empty sparkR DF 
  2) Using for loop in SparkR 
  3) change the date format. 
  4) find the lenght() in spark df 
  5) using rbind in sparkR 
  
can you help me out in doing the above code in sparkR.


  was:
Data set:  
  
DC_City Dc_Code ItemNo  Itemdescription dat   
Month YearSalesQuantity 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
9/16/2012   9-Sep 2012   1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
12/21/2012  12-Dec2012 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/12/2013   1-Jan   2013 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/27/2013   1-Jan   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/1/20132-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/12/2013   2-Feb   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 

[jira] [Resolved] (SPARK-8574) org/apache/spark/unsafe doesn't honor the java source/target versions

2015-06-25 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-8574.
--
   Resolution: Fixed
Fix Version/s: 1.4.1

 org/apache/spark/unsafe doesn't honor the java source/target versions
 -

 Key: SPARK-8574
 URL: https://issues.apache.org/jira/browse/SPARK-8574
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Fix For: 1.4.1


 I built spark using jdk8 and the default source compatibility in the pom is 
 1.6 so I expected to be able to run Spark with jdk7, but if fails because the 
 unsafe code doesn't seem to be honoring the source/target compatibility 
 options set in the top level pom.
 Exception in thread main java.lang.UnsupportedClassVersionError: 
 org/apache/spark/unsafe/memory/MemoryAllocator : Unsupported major.minor 
 version 52.0
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:791)
 at 
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:392)
 at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:211)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:180)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:74)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:146)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:245)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
 15/06/23 19:48:24 INFO storage.DiskBlockManager: Shutdown hook called



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8629) R code in SparkR

2015-06-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8629.
--
Resolution: Invalid

 R code in SparkR
 

 Key: SPARK-8629
 URL: https://issues.apache.org/jira/browse/SPARK-8629
 Project: Spark
  Issue Type: Question
  Components: R
Reporter: Arun
Priority: Minor

 Data set:  
   
 DC_City   Dc_Code ItemNo  Itemdescription dat   
 Month YearSalesQuantity 
 Hyderabad 11  15010   more. Value Chana Dal 1 Kg. 
 9/16/2012   9-Sep 2012   1 
 Hyderabad 11  15010   more. Value Chana Dal 1 Kg. 
 12/21/2012  12-Dec2012 1 
 Hyderabad 11  15010   more. Value Chana Dal 1 Kg. 
 1/12/2013   1-Jan   2013 1 
 Hyderabad 11  15010   more. Value Chana Dal 1 Kg. 
 1/27/2013   1-Jan   2013 3 
 Hyderabad 11  15011   more. Value Chana Dal 1 Kg. 
 2/1/20132-Feb   2013 2 
 Hyderabad 11  15011   more. Value Chana Dal 1 Kg. 
 2/12/2013   2-Feb   2013 3 
 Hyderabad 11  15011   more. Value Chana Dal 1 Kg. 
 2/13/2013   2-Feb   2013 2 
 Hyderabad 11  15011   more. Value Chana Dal 1 Kg. 
 2/14/2013   2-Feb   2013 1 
 Hyderabad 11  15011   more. Value Chana Dal 1 Kg. 
 2/15/2013   2-Feb   2013 8 
 Hyderabad 11  15012   more. Value Chana Dal 1 Kg. 
 2/16/2013   2-Feb   2013 18 
 Hyderabad 11  15012   more. Value Chana Dal 1 Kg. 
 2/17/2013   2-Feb   2013 19 
 Hyderabad 11  15012   more. Value Chana Dal 1 Kg. 
 2/18/2013   2-Feb   2013 18 
 Hyderabad 11  15012   more. Value Chana Dal 1 Kg. 
 2/19/2013   2-Feb   2013 18 
 Hyderabad 11  15012   more. Value Chana Dal 1 Kg. 
 2/20/2013   2-Feb   2013 16 
 Hyderabad 11  15013   more. Value Chana Dal 1 Kg. 
 2/21/2013   2-Feb   2013 25 
 Hyderabad 11  15013   more. Value Chana Dal 1 Kg. 
 2/22/2013   2-Feb   2013 19 
 Hyderabad 11  15013   more. Value Chana Dal 1 Kg. 
 2/23/2013   2-Feb   2013 17 
 Hyderabad 11  15013   more. Value Chana Dal 1 Kg. 
 2/24/2013   2-Feb   2013 39 
 Hyderabad 11  15013   more. Value Chana Dal 1 Kg. 
 2/25/2013   2-Feb   2013 23 
 Code i used in R:
   data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) 
  factors - unique(data$ItemNo) 
   df.allitems - data.frame() 
   for(i in 1:length(factors)) 
  
  {
   
data1 -  filter(data, ItemNo  == factors[[i]]) 
  data2- 
 select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) 
  date2$date - as.Date(date2$date, format = %m/%d/%y)  
  data3 - data2[order(data2$date), ]  
  df.allitems - rbind(data3 , df.allitems)  # Append by row bind 
   }   

   
   write.csv(df.allitems,E:/all_items.csv) 
 You can see the code clearly in -
 -
 http://apache-spark-user-list.1001560.n3.nabble.com/Convert-R-code-into-SparkR-code-for-spark-1-4-version-tp23489.html
 -
   
 I have done some SparkR code: 
   data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R 
   df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark 
 DF 
   factors - distinct(df_1) # removed duplicates 
   
 #for select i used: 
   df_2 - select(distinctDF 
 ,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select 
 action 
 I dont know how to: 
   1) create a empty sparkR DF 
   2) Using for loop in SparkR 
   3) change the date format. 
   4) find the lenght() in spark df 
   5) using rbind in sparkR 
   
 can you help me out in doing the above code in sparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8626) ALS model predict error

2015-06-25 Thread Subhod Lagade (JIRA)
Subhod Lagade created SPARK-8626:


 Summary: ALS model predict error
 Key: SPARK-8626
 URL: https://issues.apache.org/jira/browse/SPARK-8626
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8627) ALS model predict error

2015-06-25 Thread Subhod Lagade (JIRA)
Subhod Lagade created SPARK-8627:


 Summary: ALS model predict error
 Key: SPARK-8627
 URL: https://issues.apache.org/jira/browse/SPARK-8627
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade


/**
 * Created by subhod lagade on 25/06/15.
 */
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming._;

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._



import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintStream;
import java.net.ServerSocket;
import java.net.Socket;
import java.util.Properties;



import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating


object SparkStreamKafka {
  def main(args: Array[String]) {

val conf = new SparkConf().setAppName(Simple Application);
val sc = new SparkContext(conf);
val data = sc.textFile(/home/appadmin/Disney/data.csv);
val ratings = data.map(_.split(',') match { case Array(user, product, 
rate) =  Rating(user.toInt, product.toInt, rate.toDouble)  });


val rank = 3;
val numIterations = 2;
val model = ALS.train(ratings,rank,numIterations,0.01);


val usersProducts = ratings.map{ case Rating(user, product, rate)  = 
(user, product)}

// Build the recommendation model using ALS
usersProducts.foreach(println)

val predictions = model.predict(usersProducts)
}
}

/*
ERROR Message
[ERROR] 
/home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: 
error: not enough arguments for method predict: (user: Int, product: Int)Double.
[INFO] Unspecified value parameter product.
[INFO]  val predictions = model.predict(usersProducts)
*/






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8621) crosstab exception when one of the value is empty

2015-06-25 Thread Animesh Baranawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601008#comment-14601008
 ] 

Animesh Baranawal commented on SPARK-8621:
--

How about enclosing the column names and row names in   ?

 crosstab exception when one of the value is empty
 -

 Key: SPARK-8621
 URL: https://issues.apache.org/jira/browse/SPARK-8621
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Critical

 I think this happened because some value is empty.
 {code}
 scala df1.stat.crosstab(role, lang)
 org.apache.spark.sql.AnalysisException: syntax error in attribute name: ;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.parseAttributeName(LogicalPlan.scala:145)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:135)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:157)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:603)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:394)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:160)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:157)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:157)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:147)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:132)
   at 
 org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:132)
   at 
 org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:91)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8627) ALS model predict error

2015-06-25 Thread Subhod Lagade (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subhod Lagade updated SPARK-8627:
-
Comment: was deleted

(was: usersProducts is a RDD(int,int) it is still giving me error

There is some issue with model.predict)

 ALS model predict error
 ---

 Key: SPARK-8627
 URL: https://issues.apache.org/jira/browse/SPARK-8627
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade

 /**
  * Created by subhod lagade on 25/06/15.
  */
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming.StreamingContext._
 import org.apache.spark.streaming.{Seconds, StreamingContext}
 import org.apache.spark.streaming._;
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import java.io.BufferedReader;
 import java.io.FileInputStream;
 import java.io.IOException;
 import java.io.InputStreamReader;
 import java.io.PrintStream;
 import java.net.ServerSocket;
 import java.net.Socket;
 import java.util.Properties;
 import org.apache.spark.mllib.recommendation.ALS
 import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
 import org.apache.spark.mllib.recommendation.Rating
 object SparkStreamKafka {
   def main(args: Array[String]) {
   
 val conf = new SparkConf().setAppName(Simple Application);
 val sc = new SparkContext(conf);
   val data = sc.textFile(/home/appadmin/Disney/data.csv);
   val ratings = data.map(_.split(',') match { case Array(user, product, 
 rate) =  Rating(user.toInt, product.toInt, rate.toDouble)  });
   
   
   val rank = 3;
   val numIterations = 2;
   val model = ALS.train(ratings,rank,numIterations,0.01);
   
   val usersProducts = ratings.map{ case Rating(user, product, rate)  = 
 (user, product)}
   // Build the recommendation model using ALS
   usersProducts.foreach(println)
   val predictions = model.predict(usersProducts)
   }
 }
 /*
 ERROR Message
 [ERROR] 
 /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: 
 error: not enough arguments for method predict: (user: Int, product: 
 Int)Double.
 [INFO] Unspecified value parameter product.
 [INFO]  val predictions = model.predict(usersProducts)
 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8625) Propagate user exceptions in tasks back to driver

2015-06-25 Thread Tom White (JIRA)
Tom White created SPARK-8625:


 Summary: Propagate user exceptions in tasks back to driver
 Key: SPARK-8625
 URL: https://issues.apache.org/jira/browse/SPARK-8625
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Tom White


Runtime exceptions that are thrown by user code in Spark are presented to the 
user as strings (message and stacktrace), rather than the exception object 
itself. If the exception stores information about the error in fields then 
these cannot be retrieved.

Exceptions are Serializable, so it would be feasible to return the original 
object back to the driver as the cause field in SparkException. This would 
allow the client to retrieve information from the original exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8627) ALS model predict error

2015-06-25 Thread Subhod Lagade (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subhod Lagade updated SPARK-8627:
-
Comment: was deleted

(was: can you help me in resolving this ??
usersProducts is a RDD(int,int) it is still giving me error)

 ALS model predict error
 ---

 Key: SPARK-8627
 URL: https://issues.apache.org/jira/browse/SPARK-8627
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade

 /**
  * Created by subhod lagade on 25/06/15.
  */
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming.StreamingContext._
 import org.apache.spark.streaming.{Seconds, StreamingContext}
 import org.apache.spark.streaming._;
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import java.io.BufferedReader;
 import java.io.FileInputStream;
 import java.io.IOException;
 import java.io.InputStreamReader;
 import java.io.PrintStream;
 import java.net.ServerSocket;
 import java.net.Socket;
 import java.util.Properties;
 import org.apache.spark.mllib.recommendation.ALS
 import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
 import org.apache.spark.mllib.recommendation.Rating
 object SparkStreamKafka {
   def main(args: Array[String]) {
   
 val conf = new SparkConf().setAppName(Simple Application);
 val sc = new SparkContext(conf);
   val data = sc.textFile(/home/appadmin/Disney/data.csv);
   val ratings = data.map(_.split(',') match { case Array(user, product, 
 rate) =  Rating(user.toInt, product.toInt, rate.toDouble)  });
   
   
   val rank = 3;
   val numIterations = 2;
   val model = ALS.train(ratings,rank,numIterations,0.01);
   
   val usersProducts = ratings.map{ case Rating(user, product, rate)  = 
 (user, product)}
   // Build the recommendation model using ALS
   usersProducts.foreach(println)
   val predictions = model.predict(usersProducts)
   }
 }
 /*
 ERROR Message
 [ERROR] 
 /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: 
 error: not enough arguments for method predict: (user: Int, product: 
 Int)Double.
 [INFO] Unspecified value parameter product.
 [INFO]  val predictions = model.predict(usersProducts)
 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5768) Spark UI Shows incorrect memory under Yarn

2015-06-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5768:
-
Assignee: Rekha Joshi

 Spark UI Shows incorrect memory under Yarn
 --

 Key: SPARK-5768
 URL: https://issues.apache.org/jira/browse/SPARK-5768
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0, 1.2.1
 Environment: Centos 6
Reporter: Al M
Assignee: Rekha Joshi
Priority: Trivial
 Fix For: 1.4.1, 1.5.0


 I am running Spark on Yarn with 2 executors.  The executors are running on 
 separate physical machines.
 I have spark.executor.memory set to '40g'.  This is because I want to have 
 40g of memory used on each machine.  I have one executor per machine.
 When I run my application I see from 'top' that both my executors are using 
 the full 40g of memory I allocated to them.
 The 'Executors' tab in the Spark UI shows something different.  It shows the 
 memory used as a total of 20GB per executor e.g. x / 20.3GB.  This makes it 
 look like I only have 20GB available per executor when really I have 40GB 
 available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8629) R code in SparkR

2015-06-25 Thread Arun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun updated SPARK-8629:

Description: 
Data set:  
  
DC_City Dc_Code ItemNo  Itemdescription dat   
Month YearSalesQuantity 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
9/16/2012   9-Sep 2012   1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
12/21/2012  12-Dec2012 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/12/2013   1-Jan   2013 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/27/2013   1-Jan   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/1/20132-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/12/2013   2-Feb   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/13/2013   2-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/14/2013   2-Feb   2013 1 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/15/2013   2-Feb   2013 8 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/16/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/17/2013   2-Feb   2013 19 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/18/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/19/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/20/2013   2-Feb   2013 16 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/21/2013   2-Feb   2013 25 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/22/2013   2-Feb   2013 19 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/23/2013   2-Feb   2013 17 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/24/2013   2-Feb   2013 39 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/25/2013   2-Feb   2013 23 


Code i used in R:

  data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) 
 factors - unique(data$ItemNo) 
  df.allitems - data.frame() 
  for(i in 1:length(factors)) 
  { 
   data1 - filter(data, ItemNo  == factors[[i]]) 
 data2select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # 
select particular columns 
 date2$date - as.Date(date2$date, format = %m/%d/%y) # format the date 
 data3 - data2[order(data2$date), ] # order by assending 
 df.allitems - rbind(data3 , df.allitems)  # Append by row bind 
  } 
  
  write.csv(df.allitems,E:/all_items.csv) 

--- 
  
I have done some SparkR code: 
  data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R 
  df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF 
  factors - distinct(df_1) # removed duplicates 
  
#for select i used: 
  df_2 - select(distinctDF 
,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select 
action 

I dont know how to: 
  1) create a empty sparkR DF 
  2) Using for loop in SparkR 
  3) change the date format. 
  4) find the lenght() in spark df 
  5) using rbind in sparkR 
  
can you help me out in doing the above code in sparkR.


  was:
Data set:  
  
DC_City Dc_Code ItemNo  Itemdescription dat   
Month YearSalesQuantity 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
9/16/2012   9-Sep 2012   1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
12/21/2012  12-Dec2012 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/12/2013   1-Jan   2013 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/27/2013   1-Jan   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/1/20132-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/12/2013   2-Feb   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/13/2013   2-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/14/2013   2-Feb   2013 1 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/15/2013   2-Feb   2013 8 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/16/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/17/2013   2-Feb   2013 19 

[jira] [Commented] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse

2015-06-25 Thread Santiago M. Mola (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601012#comment-14601012
 ] 

Santiago M. Mola commented on SPARK-8628:
-

Here is an example of failure with Spark 1.4.0:

{code}
[1.152] failure: ``union'' expected but identifier OR found

SELECT CASE a+1 WHEN b THEN 111 WHEN c THEN 222 WHEN d THEN 333 WHEN e THEN 444 
ELSE 555 END, a-b, a FROM t1 WHERE e+d BETWEEN a+b-10 AND c+130 OR ab OR de

   ^
java.lang.RuntimeException: [1.152] failure: ``union'' expected but identifier 
OR found

SELECT CASE a+1 WHEN b THEN 111 WHEN c THEN 222 WHEN d THEN 333 WHEN e THEN 444 
ELSE 555 END, a-b, a FROM t1 WHERE e+d BETWEEN a+b-10 AND c+130 OR ab OR de

   ^
at scala.sys.package$.error(package.scala:27)
{code}

 Race condition in AbstractSparkSQLParser.parse
 --

 Key: SPARK-8628
 URL: https://issues.apache.org/jira/browse/SPARK-8628
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Santiago M. Mola
Priority: Critical
  Labels: regression

 SPARK-5009 introduced the following code:
 def parse(input: String): LogicalPlan = {
 // Initialize the Keywords.
 lexical.initialize(reservedWords)
 phrase(start)(new lexical.Scanner(input)) match {
   case Success(plan, _) = plan
   case failureOrError = sys.error(failureOrError.toString)
 }
   }
 The corresponding initialize method in SqlLexical is not thread-safe:
   /* This is a work around to support the lazy setting */
   def initialize(keywords: Seq[String]): Unit = {
 reserved.clear()
 reserved ++= keywords
   }
 I'm hitting this when parsing multiple SQL queries concurrently. When one 
 query parsing starts, it empties the reserved keyword list, then a 
 race-condition occurs and other queries fail to parse because they recognize 
 keywords as identifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8626) ALS model predict error

2015-06-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8626.
--
Resolution: Duplicate

... and you opened it twice. Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and 
take care before opening a JIRA

 ALS model predict error
 ---

 Key: SPARK-8626
 URL: https://issues.apache.org/jira/browse/SPARK-8626
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8627) ALS model predict error

2015-06-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8627.
--
Resolution: Invalid

This is a compile error in your own code.

 ALS model predict error
 ---

 Key: SPARK-8627
 URL: https://issues.apache.org/jira/browse/SPARK-8627
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade

 /**
  * Created by subhod lagade on 25/06/15.
  */
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming.StreamingContext._
 import org.apache.spark.streaming.{Seconds, StreamingContext}
 import org.apache.spark.streaming._;
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import java.io.BufferedReader;
 import java.io.FileInputStream;
 import java.io.IOException;
 import java.io.InputStreamReader;
 import java.io.PrintStream;
 import java.net.ServerSocket;
 import java.net.Socket;
 import java.util.Properties;
 import org.apache.spark.mllib.recommendation.ALS
 import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
 import org.apache.spark.mllib.recommendation.Rating
 object SparkStreamKafka {
   def main(args: Array[String]) {
   
 val conf = new SparkConf().setAppName(Simple Application);
 val sc = new SparkContext(conf);
   val data = sc.textFile(/home/appadmin/Disney/data.csv);
   val ratings = data.map(_.split(',') match { case Array(user, product, 
 rate) =  Rating(user.toInt, product.toInt, rate.toDouble)  });
   
   
   val rank = 3;
   val numIterations = 2;
   val model = ALS.train(ratings,rank,numIterations,0.01);
   
   val usersProducts = ratings.map{ case Rating(user, product, rate)  = 
 (user, product)}
   // Build the recommendation model using ALS
   usersProducts.foreach(println)
   val predictions = model.predict(usersProducts)
   }
 }
 /*
 ERROR Message
 [ERROR] 
 /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: 
 error: not enough arguments for method predict: (user: Int, product: 
 Int)Double.
 [INFO] Unspecified value parameter product.
 [INFO]  val predictions = model.predict(usersProducts)
 */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse

2015-06-25 Thread Santiago M. Mola (JIRA)
Santiago M. Mola created SPARK-8628:
---

 Summary: Race condition in AbstractSparkSQLParser.parse
 Key: SPARK-8628
 URL: https://issues.apache.org/jira/browse/SPARK-8628
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.3.1, 1.3.0
Reporter: Santiago M. Mola
Priority: Critical


SPARK-5009 introduced the following code:

def parse(input: String): LogicalPlan = {
// Initialize the Keywords.
lexical.initialize(reservedWords)
phrase(start)(new lexical.Scanner(input)) match {
  case Success(plan, _) = plan
  case failureOrError = sys.error(failureOrError.toString)
}
  }

The corresponding initialize method in SqlLexical is not thread-safe:

  /* This is a work around to support the lazy setting */
  def initialize(keywords: Seq[String]): Unit = {
reserved.clear()
reserved ++= keywords
  }

I'm hitting this when parsing multiple SQL queries concurrently. When one query 
parsing starts, it empties the reserved keyword list, then a race-condition 
occurs and other queries fail to parse because they recognize keywords as 
identifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8629) R code in SparkR

2015-06-25 Thread Arun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun updated SPARK-8629:

Description: 
Data set:  
  
DC_City Dc_Code ItemNo  Itemdescription dat   
Month YearSalesQuantity 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
9/16/2012   9-Sep 2012   1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
12/21/2012  12-Dec2012 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/12/2013   1-Jan   2013 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/27/2013   1-Jan   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/1/20132-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/12/2013   2-Feb   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/13/2013   2-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/14/2013   2-Feb   2013 1 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/15/2013   2-Feb   2013 8 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/16/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/17/2013   2-Feb   2013 19 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/18/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/19/2013   2-Feb   2013 18 
Hyderabad   11  15012   more. Value Chana Dal 1 Kg. 
2/20/2013   2-Feb   2013 16 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/21/2013   2-Feb   2013 25 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/22/2013   2-Feb   2013 19 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/23/2013   2-Feb   2013 17 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/24/2013   2-Feb   2013 39 
Hyderabad   11  15013   more. Value Chana Dal 1 Kg. 
2/25/2013   2-Feb   2013 23 


Code i used in R:

  data - read.csv(D:/R/Data_sale_quantity.csv ,stringsAsFactors=FALSE) 
 factors - unique(data$ItemNo) 
  df.allitems - data.frame() 
  for(i in 1:length(factors)) 
 
 {  

   data1 -  filter(data, ItemNo  == factors[[i]]) 
 data2- select(data1,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) 
 date2$date - as.Date(date2$date, format = %m/%d/%y)  
 data3 - data2[order(data2$date), ]  
 df.allitems - rbind(data3 , df.allitems)  # Append by row bind 
  } 
 

  
  write.csv(df.allitems,E:/all_items.csv) 

--- 
  
I have done some SparkR code: 
  data1 - read.csv(D:/Data_sale_quantity_mini.csv) # read in R 
  df_1 - createDataFrame(sqlContext, data2) # converts Rdata.frame to spark DF 
  factors - distinct(df_1) # removed duplicates 
  
#for select i used: 
  df_2 - select(distinctDF 
,DC_City,Itemdescription,ItemNo,date,Year,SalesQuantity) # select 
action 

I dont know how to: 
  1) create a empty sparkR DF 
  2) Using for loop in SparkR 
  3) change the date format. 
  4) find the lenght() in spark df 
  5) using rbind in sparkR 
  
can you help me out in doing the above code in sparkR.


  was:
Data set:  
  
DC_City Dc_Code ItemNo  Itemdescription dat   
Month YearSalesQuantity 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
9/16/2012   9-Sep 2012   1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
12/21/2012  12-Dec2012 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/12/2013   1-Jan   2013 1 
Hyderabad   11  15010   more. Value Chana Dal 1 Kg. 
1/27/2013   1-Jan   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/1/20132-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/12/2013   2-Feb   2013 3 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/13/2013   2-Feb   2013 2 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/14/2013   2-Feb   2013 1 
Hyderabad   11  15011   more. Value Chana Dal 1 Kg. 
2/15/2013   2-Feb   2013 8 
Hyderabad  

[jira] [Created] (SPARK-8630) Prevent from checkpointing QueueInputDStream

2015-06-25 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-8630:
---

 Summary: Prevent from checkpointing QueueInputDStream
 Key: SPARK-8630
 URL: https://issues.apache.org/jira/browse/SPARK-8630
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu


It's better to prevent from checkpointing QueueInputDStream rather than failing 
the application when recovering `QueueInputDStream`, so that people can find 
the issue as soon as possible. See SPARK-8553 for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4352) Incorporate locality preferences in dynamic allocation requests

2015-06-25 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601898#comment-14601898
 ] 

Neelesh Srinivas Salian commented on SPARK-4352:


Checking to see if this been resolved? 



 Incorporate locality preferences in dynamic allocation requests
 ---

 Key: SPARK-4352
 URL: https://issues.apache.org/jira/browse/SPARK-4352
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Saisai Shao
Priority: Critical
 Attachments: Supportpreferrednodelocationindynamicallocation.pdf


 Currently, achieving data locality in Spark is difficult unless an 
 application takes resources on every node in the cluster.  
 preferredNodeLocalityData provides a sort of hacky workaround that has been 
 broken since 1.0.
 With dynamic executor allocation, Spark requests executors in response to 
 demand from the application.  When this occurs, it would be useful to look at 
 the pending tasks and communicate their location preferences to the cluster 
 resource manager. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8622) Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath

2015-06-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600942#comment-14600942
 ] 

Sean Owen commented on SPARK-8622:
--

I don't think that is intended or even reasonable behavior. This mechanism is 
for transferring JARs to put on the classpath, not putting arbitrary files on 
the executor.

 Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor 
 classpath
 --

 Key: SPARK-8622
 URL: https://issues.apache.org/jira/browse/SPARK-8622
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.3.1, 1.4.0
Reporter: Baswaraj

 I ran into an issue that executor not able to pickup my configs/ function 
 from my custom jar in standalone (client/cluster) deploy mode. I have used 
 spark-submit --Jar option to specify all my jars and configs to be used by 
 executors.
 all these files are placed in working directory of executor, but not in 
 executor classpath.  Also, executor working directory is not in executor 
 classpath.
 I am expecting executor to find all files specified in spark-submit --jar 
 options .
 In spark 1.3.0 executor working directory is in executor classpath, so app 
 runs successfully.
 To successfully run my application with spark 1.3.1 +, i have to use  
 following option  (conf/spark-defaults.conf)
 spark.executor.extraClassPath   .
 Please advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8615) sql programming guide recommends deprecated code

2015-06-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600945#comment-14600945
 ] 

Sean Owen commented on SPARK-8615:
--

Sure, open a PR?

 sql programming guide recommends deprecated code
 

 Key: SPARK-8615
 URL: https://issues.apache.org/jira/browse/SPARK-8615
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Gergely Svigruha
Priority: Minor

 The Spark 1.4 sql programming guide has an example code on how to use JDBC 
 tables:
 https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
 sqlContext.load(jdbc, Map(...))
 However this code complies with a warning, and recommends to do this:
  sqlContext.read.format(jdbc).options(Map(...)).load()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8642) Ungraceful failure when yarn client is not configured.

2015-06-25 Thread Juliet Hougland (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliet Hougland updated SPARK-8642:
---
Attachment: yarnretries.log

Log file from failed bc of misconfiguration spakr job. 
counting lines with 9 retires in it gives:
cat yarnretries.log | grep 'Already tried 9 time(s);' | wc -l
31


 Ungraceful failure when yarn client is not configured.
 --

 Key: SPARK-8642
 URL: https://issues.apache.org/jira/browse/SPARK-8642
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0, 1.3.1
Reporter: Juliet Hougland
Priority: Minor
 Attachments: yarnretries.log


 When HADOOP_CONF_DIR is not configured (ie yarn-site.xml is not available) 
 the yarn client will try to submit an application. No connection to the 
 resource manager will be able to be established. The client will try to 
 connect 10 times (with a max retry of ten), and then do that 30 more time. 
 This takes about 5 minutes before an Error is recorded for spark context 
 initialization, which is caused by a connect exception. I would expect that 
 after the first 1- tries fail, the initialization of the spark context should 
 fail too. At least that is what I would think given the logs. An earlier 
 failure would be ideal/preferred.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8642) Ungraceful failure when yarn client is not configured.

2015-06-25 Thread Juliet Hougland (JIRA)
Juliet Hougland created SPARK-8642:
--

 Summary: Ungraceful failure when yarn client is not configured.
 Key: SPARK-8642
 URL: https://issues.apache.org/jira/browse/SPARK-8642
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.1, 1.3.0
Reporter: Juliet Hougland
Priority: Minor


When HADOOP_CONF_DIR is not configured (ie yarn-site.xml is not available) the 
yarn client will try to submit an application. No connection to the resource 
manager will be able to be established. The client will try to connect 10 times 
(with a max retry of ten), and then do that 30 more time. This takes about 5 
minutes before an Error is recorded for spark context initialization, which is 
caused by a connect exception. I would expect that after the first 1- tries 
fail, the initialization of the spark context should fail too. At least that is 
what I would think given the logs. An earlier failure would be ideal/preferred.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8644) SparkException thrown due to Executor exceptions should include caller site in stack trace

2015-06-25 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-8644:
-

 Summary: SparkException thrown due to Executor exceptions should 
include caller site in stack trace
 Key: SPARK-8644
 URL: https://issues.apache.org/jira/browse/SPARK-8644
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson


Currently when a job fails due to executor (or other) issues, the exception 
thrown by Spark has a stack trace which stops at the DAGScheduler EventLoop, 
which makes it hard to trace back to the user code which submitted the job. It 
should try to include the user submission stack trace.

Example exception today:

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.lang.RuntimeException: uh-oh!
at 
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34$$anonfun$apply$mcJ$sp$1.apply(DAGSchedulerSuite.scala:851)
at 
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34$$anonfun$apply$mcJ$sp$1.apply(DAGSchedulerSuite.scala:851)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1637)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1095)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1285)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1276)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1275)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1275)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:749)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:749)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:749)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1486)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
{code}

Here is the part I want to include:

{code}
at org.apache.spark.rdd.RDD.count(RDD.scala:1095)
at 
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34.apply$mcJ$sp(DAGSchedulerSuite.scala:851)
at 
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34.apply(DAGSchedulerSuite.scala:851)
at 
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33$$anonfun$34.apply(DAGSchedulerSuite.scala:851)
at org.scalatest.Assertions$class.intercept(Assertions.scala:997)
at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
at 
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33.apply$mcV$sp(DAGSchedulerSuite.scala:850)
at 
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33.apply(DAGSchedulerSuite.scala:849)
at 
org.apache.spark.scheduler.DAGSchedulerSuite$$anonfun$33.apply(DAGSchedulerSuite.scala:849)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 

[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2015-06-25 Thread biao luo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601273#comment-14601273
 ] 

biao luo commented on SPARK-2883:
-

peopleSchemaRDD.saveAsOrcFile(people.orc)
val orcFile = ctx.orcFile(people.orc)

saveAsOrcFile and orcFile is not find spark1.4 sound code。why?DataFrame is not 
find。where can find this api?

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: New Feature
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Assignee: Zhan Zhang
Priority: Critical
 Fix For: 1.4.0

 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png, orc.diff


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8546) PMML export for Naive Bayes

2015-06-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8546:
-
Labels:   (was: starter)

 PMML export for Naive Bayes
 ---

 Key: SPARK-8546
 URL: https://issues.apache.org/jira/browse/SPARK-8546
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 The naive Bayes section of PMML standard can be found at 
 http://www.dmg.org/v4-1/NaiveBayes.html. We should first figure out how to 
 generate PMML for both binomial and multinomial naive Bayes models using 
 JPMML (maybe [~vfed] can help).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap

2015-06-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8445:
-
Description: 
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Please view this list as a wish list rather than a concrete plan, 
because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter 
task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0]
 rather than a medium/big feature. Based on our experience, mixing the 
development process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (3) JIRAs at the same time. Try to finish them one 
after another.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add starter label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORDER%20BY%20priority%20DESC].
 We only include umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* LDA improvements (SPARK-5572)
* Log-linear model for survival analysis (SPARK-8518)
* Improve GLM's scalability on number of features (SPARK-8520)
* Tree and ensembles: Move + cleanup code (SPARK-7131), provide class 
probabilities (SPARK-3727), feature importance (SPARK-5133)
* Improve GMM scalability and stability (SPARK-7206)
* Frequent pattern mining improvements (SPARK-7211)
* R-like stats for ML models (SPARK-7674)
* Generalize classification threshold to multiclass (SPARK-8069)
* A/B testing (SPARK-3147)

h2. Pipeline API

* more feature transformers (SPARK-8521)
* k-means (SPARK-7879)
* naive Bayes (SPARK-8600)
* TrainValidationSplit for tuning (SPARK-8484)

h2. Model persistence

* more PMML export (SPARK-8545)
* model save/load (SPARK-4587)
* pipeline persistence (SPARK-6725)

h2. Python API for ML

* List of issues identified during Spark 1.4 QA: (SPARK-7536)
* Python API for streaming ML algorithms (SPARK-3258)
* Add missing model methods (SPARK-8633)

h2. SparkR API for ML

* ML Pipeline API in SparkR (SPARK-6805)
* model.matrix for DataFrames (SPARK-6823)

h2. Documentation

* [Search for documentation improvements | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)]

  was:
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Please view this list as a wish list rather than a concrete plan, 
because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. 

[jira] [Created] (SPARK-8634) Fix flaky test StreamingListenerSuite receiver info reporting

2015-06-25 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-8634:
---

 Summary: Fix flaky test StreamingListenerSuite receiver info 
reporting
 Key: SPARK-8634
 URL: https://issues.apache.org/jira/browse/SPARK-8634
 Project: Spark
  Issue Type: Test
  Components: Streaming, Tests
Reporter: Shixiong Zhu
Priority: Minor


As per the unit test log in 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35754/

{code}
15/06/24 23:09:10.210 Thread-3495 INFO ReceiverTracker: Starting 1 receivers
15/06/24 23:09:10.270 Thread-3495 INFO SparkContext: Starting job: apply at 
Transformer.scala:22
...
15/06/24 23:09:14.259 ForkJoinPool-4-worker-29 INFO 
StreamingListenerSuiteReceiver: Started receiver and sleeping
15/06/24 23:09:14.270 ForkJoinPool-4-worker-29 INFO 
StreamingListenerSuiteReceiver: Reporting error and sleeping
{code}

it needs at least 4 seconds to receive all receiver events in this slow 
machine, but `timeout` for `eventually` is only 2 seconds.

We can increase `timeout` to make this test stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8631) MLlib predict function error

2015-06-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8631.
--
Resolution: Invalid

This is the third time you have opened this. As I explained this is not a valid 
JIRA. Please do not open any more. 

 MLlib predict function error
 

 Key: SPARK-8631
 URL: https://issues.apache.org/jira/browse/SPARK-8631
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Subhod Lagade

 def
 predict(usersProducts: RDD[(Int, Int)]): RDD[Rating]
 Predict the rating of many users for many products. The output RDD has an 
 element per each element in the input RDD (including all duplicates) unless a 
 user or product is missing in the training set.
 usersProducts
 RDD of (user, product) pairs.
 returns
 RDD of Ratings.
 def
 predict(user: Int, product: Int): Double
 Predict the rating of one user for one product.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7337) FPGrowth algo throwing OutOfMemoryError

2015-06-25 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601348#comment-14601348
 ] 

Xiangrui Meng commented on SPARK-7337:
--

How large is the `minSupport`? The number of frequent itemsets grows 
exponentially as minSupport decreases. So please start with a really large 
value (close to 1.0) and gradually reduce it.

 FPGrowth algo throwing OutOfMemoryError
 ---

 Key: SPARK-7337
 URL: https://issues.apache.org/jira/browse/SPARK-7337
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
 Environment: Ubuntu
Reporter: Amit Gupta
 Attachments: FPGrowthBug.png


 When running FPGrowth algo with huge data in GBs and with numPartitions=500 
 then after some time it throws OutOfMemoryError.
 Algo runs correctly upto collect at FPGrowth.scala:131 where it creates 500 
 tasks. It fails at next stage flatMap at FPGrowth.scala:150 where it fails 
 to create 500 tasks and create some internal calculated 17 tasks.
 Please refer to attachment - print screen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap

2015-06-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8445:
-
Description: 
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Please view this list as a wish list rather than a concrete plan, 
because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter 
task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0]
 rather than a medium/big feature. Based on our experience, mixing the 
development process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (3) JIRAs at the same time. Try to finish them one 
after another.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add starter label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORDER%20BY%20priority%20DESC].
 We only include umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* LDA improvements (SPARK-5572)
* Log-linear model for survival analysis (SPARK-8518)
* Improve GLM's scalability on number of features (SPARK-8520)
* Tree and ensembles: Move + cleanup code (SPARK-7131), provide class 
probabilities (SPARK-3727), feature importance (SPARK-5133)
* Improve GMM scalability and stability (SPARK-7206)
* Frequent pattern mining improvements (SPARK-7211)
* R-like stats for ML models (SPARK-7674)
* Generalize classification threshold to multiclass (SPARK-8069)
* A/B testing (SPARK-3147)

h2. Pipeline API

* more feature transformers (SPARK-8521)
* k-means (SPARK-7879)
* naive Bayes (SPARK-8600)
* TrainValidationSplit for tuning (SPARK-8484)

h2. Model persistence

* more PMML export (SPARK-8545)
* model save/load (SPARK-4587)
* pipeline persistence (SPARK-6725)

h2. Python API for ML

* List of issues identified during Spark 1.4 QA: (SPARK-7536)

h2. SparkR API for ML

* ML Pipeline API in SparkR (SPARK-6805)
* model.matrix for DataFrames (SPARK-6823)

h2. Documentation

* [Search for documentation improvements | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)]

  was:
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Please view this list as a wish list rather than a concrete plan, 
because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark 

[jira] [Updated] (SPARK-6805) ML Pipeline API in SparkR

2015-06-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6805:
-
Assignee: (was: Xiangrui Meng)

 ML Pipeline API in SparkR
 -

 Key: SPARK-6805
 URL: https://issues.apache.org/jira/browse/SPARK-6805
 Project: Spark
  Issue Type: Umbrella
  Components: ML, SparkR
Reporter: Xiangrui Meng

 SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API 
 in SparkR. The implementation should be similar to the pipeline API 
 implementation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8635) improve performance of CatalystTypeConverters

2015-06-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8635:
---

Assignee: (was: Apache Spark)

 improve performance of CatalystTypeConverters
 -

 Key: SPARK-8635
 URL: https://issues.apache.org/jira/browse/SPARK-8635
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8635) improve performance of CatalystTypeConverters

2015-06-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8635:
---

Assignee: Apache Spark

 improve performance of CatalystTypeConverters
 -

 Key: SPARK-8635
 URL: https://issues.apache.org/jira/browse/SPARK-8635
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8635) improve performance of CatalystTypeConverters

2015-06-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601375#comment-14601375
 ] 

Apache Spark commented on SPARK-8635:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7018

 improve performance of CatalystTypeConverters
 -

 Key: SPARK-8635
 URL: https://issues.apache.org/jira/browse/SPARK-8635
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-06-25 Thread Peter Prettenhofer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601385#comment-14601385
 ] 

Peter Prettenhofer commented on SPARK-5133:
---

[~josephkb] definitely - will start compiling a PR for feature importance via 
decrease in impurity.

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap

2015-06-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8445:
-
Description: 
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Please view this list as a wish list rather than a concrete plan, 
because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter 
task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0]
 rather than a medium/big feature. Based on our experience, mixing the 
development process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (3) JIRAs at the same time. Try to finish them one 
after another.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add starter label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if necessary.

h1. Roadmap (WIP)

This is NOT [a complete list of MLlib JIRAs for 
1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORDER%20BY%20priority%20DESC].
 We only include umbrella JIRAs and high-level tasks.

h2. Algorithms and performance

* LDA improvements (SPARK-5572)
* Log-linear model for survival analysis (SPARK-8518)
* Improve GLM's scalability on number of features (SPARK-8520)
* Tree and ensembles: Move + cleanup code (SPARK-7131), provide class 
probabilities (SPARK-3727), feature importance (SPARK-5133)
* Improve GMM scalability and stability (SPARK-7206)
* Frequent pattern mining improvements (SPARK-7211)
* R-like stats for ML models (SPARK-7674)
* Generalize classification threshold to multiclass (SPARK-8069)
* A/B testing (SPARK-3147)

h2. Pipeline API

* more feature transformers (SPARK-8521)
* k-means (SPARK-7879)
* naive Bayes (SPARK-8600)
* TrainValidationSplit for tuning (SPARK-8484)

h2. Model persistence

* more PMML export (SPARK-8545)
* model save/load (SPARK-4587)
* pipeline persistence (SPARK-6725)

h2. Python API for ML

* List of issues identified during Spark 1.4 QA: (SPARK-7536)

h2. SparkR API for ML

h2. Documentation

* [Search for documentation improvements | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)]

  was:
We expect to see many MLlib contributors for the 1.5 release. To scale out the 
development, we created this master list for MLlib features we plan to have in 
Spark 1.5. Please view this list as a wish list rather than a concrete plan, 
because we don't have an accurate estimate of available resources. Due to 
limited review bandwidth, features appearing on this list will get higher 
priority during code review. But feel free to suggest new items to the list in 
comments. We are experimenting with this process. Your feedback would be 
greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter 

[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2015-06-25 Thread biao luo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601275#comment-14601275
 ] 

biao luo commented on SPARK-2883:
-

peopleSchemaRDD.saveAsOrcFile(people.orc)
val orcFile = ctx.orcFile(people.orc)

saveAsOrcFile and orcFile is not find spark1.4 sound code。why?DataFrame is not 
find。where can find this api?

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: New Feature
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Assignee: Zhan Zhang
Priority: Critical
 Fix For: 1.4.0

 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png, orc.diff


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4127) Streaming Linear Regression- Python bindings

2015-06-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4127:
-
Target Version/s: 1.5.0

 Streaming Linear Regression- Python bindings
 

 Key: SPARK-4127
 URL: https://issues.apache.org/jira/browse/SPARK-4127
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Anant Daksh Asthana
Assignee: Manoj Kumar

 Create python bindings for Streaming Linear Regression (MLlib).
 The Mllib file relevant to this issue can be found at : 
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4127) Streaming Linear Regression- Python bindings

2015-06-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4127:
-
Assignee: Manoj Kumar

 Streaming Linear Regression- Python bindings
 

 Key: SPARK-4127
 URL: https://issues.apache.org/jira/browse/SPARK-4127
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Anant Daksh Asthana
Assignee: Manoj Kumar

 Create python bindings for Streaming Linear Regression (MLlib).
 The Mllib file relevant to this issue can be found at : 
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-06-25 Thread Justin Uang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601288#comment-14601288
 ] 

Justin Uang commented on SPARK-8632:


[~davies], my current plan is to switch to a synchronous model so that we can 
avoid deadlock. From a quick benchmark on my machine of loading pickled data 
and converting it to a python object, 95% of time is spent on cPickle and 5% on 
IO. I think the performance drawbacks of a synchronous model are trivial enough 
that the conceptual simplicity is worth it.

 Poor Python UDF performance because of RDD caching
 --

 Key: SPARK-8632
 URL: https://issues.apache.org/jira/browse/SPARK-8632
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Justin Uang

 {quote}
 We have been running into performance problems using Python UDFs with 
 DataFrames at large scale.
 From the implementation of BatchPythonEvaluation, it looks like the goal was 
 to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
 two passes over the data. One to give to the PythonRDD, then one to join the 
 python lambda results with the original row (which may have java objects that 
 should be passed through).
 In addition, it caches all the columns, even the ones that don't need to be 
 processed by the Python UDF. In the cases I was working with, I had a 500 
 column table, and i wanted to use a python UDF for one column, and it ended 
 up caching all 500 columns. 
 {quote}
 http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >