[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-02 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117918#comment-14117918
 ] 

Patrick Wendell commented on SPARK-1701:


I think that's a straw man. A closer review once there is a patch
might reveal some corner cases.

On Mon, Sep 1, 2014 at 10:59 PM, Nicholas Chammas (JIRA)


 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2633) enhance spark listener API to gather more spark job information

2014-09-02 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated SPARK-2633:
-
Attachment: Spark listener enhancement for Hive on Spark job monitor and 
statistic.docx

 enhance spark listener API to gather more spark job information
 ---

 Key: SPARK-2633
 URL: https://issues.apache.org/jira/browse/SPARK-2633
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li
Priority: Critical
  Labels: hive
 Attachments: Spark listener enhancement for Hive on Spark job monitor 
 and statistic.docx


 Based on Hive on Spark job status monitoring and statistic collection 
 requirement, try to enhance spark listener API to gather more spark job 
 information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2633) enhance spark listener API to gather more spark job information

2014-09-02 Thread Chengxiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated SPARK-2633:
-
Attachment: (was: Spark listener enhancement for Hive on Spark job 
monitor and statistic.docx)

 enhance spark listener API to gather more spark job information
 ---

 Key: SPARK-2633
 URL: https://issues.apache.org/jira/browse/SPARK-2633
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li
Priority: Critical
  Labels: hive
 Attachments: Spark listener enhancement for Hive on Spark job monitor 
 and statistic.docx


 Based on Hive on Spark job status monitoring and statistic collection 
 requirement, try to enhance spark listener API to gather more spark job 
 information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2636) Expose job ID in JobWaiter API

2014-09-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2636.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Expose job ID in JobWaiter API
 --

 Key: SPARK-2636
 URL: https://issues.apache.org/jira/browse/SPARK-2636
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li
Assignee: Chengxiang Li
  Labels: hive
 Fix For: 1.2.0


 In Hive on Spark, we want to track spark job status through Spark API, the 
 basic idea is as following:
 # create an hive-specified spark listener and register it to spark listener 
 bus.
 # hive-specified spark listener generate job status by spark listener events.
 # hive driver track job status through hive-specified spark listener. 
 the current problem is that hive driver need job identifier to track 
 specified job status through spark listener, but there is no spark API to get 
 job identifier(like job id) while submit spark job.
 I think other project whoever try to track job status with spark API would 
 suffer from this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117999#comment-14117999
 ] 

Reynold Xin commented on SPARK-2978:


[~sandyryza] instead of adding a new API, what if Hive project just creates a 
utility function that does partition, sort, and then walking down the sorted 
list to provide grouping similar to MR?

The reason I'm asking about this is that eventually we would want to make 
groupByKey itself support sort and spill. But it is fairly tricky to design as 
you've already pointed out, so it could take a while to finalize that API. 

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1499) Workers continuously produce failing executors

2014-09-02 Thread Alex Burghelea (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118011#comment-14118011
 ] 

Alex Burghelea commented on SPARK-1499:
---

I also encounter the same problem. Is there any workaround/solution for this 
issue ?
I am runny the spark cluster using a vagrant setup with 4 machines. All the 
nodes behave this way.

[~adrian-wang] I can confirm that the worker nodes log the messages posted by 
you.

 Workers continuously produce failing executors
 --

 Key: SPARK-1499
 URL: https://issues.apache.org/jira/browse/SPARK-1499
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 0.9.1, 1.0.0
Reporter: Aaron Davidson

 If a node is in a bad state, such that newly started executors fail on 
 startup or first use, the Standalone Cluster Worker will happily keep 
 spawning new ones. A better behavior would be for a Worker to mark itself as 
 dead if it has had a history of continuously producing erroneous executors, 
 or else to somehow prevent a driver from re-registering executors from the 
 same machine repeatedly.
 Reported on mailing list: 
 http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccal8t0bqjfgtf-vbzjq6yj7ckbl_9p9s0trvew2mvg6zbngx...@mail.gmail.com%3E
 Relevant logs: 
 {noformat}
 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: 
 app-20140411190649-0008/4 is now FAILED (Command exited with code 53)
 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 
 app-20140411190649-0008/4 removed: Command exited with code 53
 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 4 
 disconnected, so removing it
 14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 
 (already removed): Failed to create local directory (bad spark.local.dir?)
 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor added: 
 app-20140411190649-0008/27 on 
 worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 
 (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores
 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Granted executor 
 ID app-20140411190649-0008/27 on hostPort 
 ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM
 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: 
 app-20140411190649-0008/27 is now RUNNING
 14/04/11 19:06:52 INFO storage.BlockManagerMasterActor$BlockManagerInfo: 
 Registering block manager ip-172-31-24-76.us-west-1.compute.internal:50256 
 with 32.7 GB RAM
 14/04/11 19:06:52 INFO metastore.HiveMetaStore: 0: get_table : db=default 
 tbl=wikistats_pd
 14/04/11 19:06:52 INFO HiveMetaStore.audit: ugi=root  ip=unknown-ip-addr  
 cmd=get_table : db=default tbl=wikistats_pd 
 14/04/11 19:06:53 DEBUG hive.log: DDL: struct wikistats_pd { string 
 projectcode, string pagename, i32 pageviews, i32 bytes}
 14/04/11 19:06:53 DEBUG lazy.LazySimpleSerDe: 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe initialized with: 
 columnNames=[projectcode, pagename, pageviews, bytes] columnTypes=[string, 
 string, int, int] separator=[[B@29a81175] nullstring=\N 
 lastColumnTakesRest=false
 shark 14/04/11 19:06:55 INFO cluster.SparkDeploySchedulerBackend: Registered 
 executor: 
 Actor[akka.tcp://sparkexecu...@ip-172-31-19-11.us-west-1.compute.internal:45248/user/Executor#-1002203295]
  with ID 27
 show 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 27 
 disconnected, so removing it
 14/04/11 19:06:56 ERROR scheduler.TaskSchedulerImpl: Lost an executor 27 
 (already removed): remote Akka client disassociated
 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: 
 app-20140411190649-0008/27 is now FAILED (Command exited with code 53)
 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 
 app-20140411190649-0008/27 removed: Command exited with code 53
 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor added: 
 app-20140411190649-0008/28 on 
 worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 
 (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores
 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Granted executor 
 ID app-20140411190649-0008/28 on hostPort 
 ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM
 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: 
 app-20140411190649-0008/28 is now RUNNING
 tables;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2312) Spark Actors do not handle unknown messages in their receive methods

2014-09-02 Thread Isaias Barroso (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isaias Barroso resolved SPARK-2312.
---
Resolution: Fixed

 Spark Actors do not handle unknown messages in their receive methods
 

 Key: SPARK-2312
 URL: https://issues.apache.org/jira/browse/SPARK-2312
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kam Kasravi
Assignee: Isaias Barroso
Priority: Minor
  Labels: starter
 Fix For: 1.1.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 Per akka documentation - an actor should provide a pattern match for all 
 messages including _ otherwise akka.actor.UnhandledMessage will be 
 propagated. 
 Noted actors:
 MapOutputTrackerMasterActor, ClientActor, Master, Worker...
 Should minimally do a 
 logWarning(sReceived unexpected actor system event: $_) so message info is 
 logged in correct actor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3344) Reformat code: add blank lines

2014-09-02 Thread WangTaoTheTonic (JIRA)
WangTaoTheTonic created SPARK-3344:
--

 Summary: Reformat code: add blank lines
 Key: SPARK-3344
 URL: https://issues.apache.org/jira/browse/SPARK-3344
 Project: Spark
  Issue Type: Test
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Trivial


There should have blank lines between test cases, so do some code reformat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3344) Reformat code: add blank lines

2014-09-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118038#comment-14118038
 ] 

Apache Spark commented on SPARK-3344:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/2234

 Reformat code: add blank lines
 --

 Key: SPARK-3344
 URL: https://issues.apache.org/jira/browse/SPARK-3344
 Project: Spark
  Issue Type: Test
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Trivial

 There should have blank lines between test cases, so do some code reformat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()

2014-09-02 Thread Xu Zhongxing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118069#comment-14118069
 ] 

Xu Zhongxing commented on SPARK-3005:
-

By tasks themselves already died and exited, I mean that even if we do 
nothing in killTasks(), there won't be any zombie tasks left on the slaves. 
This is what I get from testing. If I'm wrong, please correct me. But the logic 
here is incomplete or inconsistent, and needs to be fixed.

 Spark with Mesos fine-grained mode throws UnsupportedOperationException in 
 MesosSchedulerBackend.killTask()
 ---

 Key: SPARK-3005
 URL: https://issues.apache.org/jira/browse/SPARK-3005
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector
Reporter: Xu Zhongxing
 Attachments: SPARK-3005_1.diff


 I am using Spark, Mesos, spark-cassandra-connector to do some work on a 
 cassandra cluster.
 During the job running, I killed the Cassandra daemon to simulate some 
 failure cases. This results in task failures.
 If I run the job in Mesos coarse-grained mode, the spark driver program 
 throws an exception and shutdown cleanly.
 But when I run the job in Mesos fine-grained mode, the spark driver program 
 hangs.
 The spark log is: 
 {code}
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 
 Logging.scala (line 58) Cancelling stage 1
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 
 Logging.scala (line 79) Could not cancel tasks for stage 1
 java.lang.UnsupportedOperationException
   at 
 org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at 

[jira] [Comment Edited] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()

2014-09-02 Thread Xu Zhongxing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118069#comment-14118069
 ] 

Xu Zhongxing edited comment on SPARK-3005 at 9/2/14 9:59 AM:
-

By tasks themselves already died and exited, I mean that even if we do 
nothing in killTasks(), there won't be any zombie tasks left on the slaves. 
This is what I get from testing the Mesos fine-grained mode. If I'm wrong, 
please correct me. But the logic here is incomplete or inconsistent, and needs 
to be fixed.


was (Author: xuzhongxing):
By tasks themselves already died and exited, I mean that even if we do 
nothing in killTasks(), there won't be any zombie tasks left on the slaves. 
This is what I get from testing. If I'm wrong, please correct me. But the logic 
here is incomplete or inconsistent, and needs to be fixed.

 Spark with Mesos fine-grained mode throws UnsupportedOperationException in 
 MesosSchedulerBackend.killTask()
 ---

 Key: SPARK-3005
 URL: https://issues.apache.org/jira/browse/SPARK-3005
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector
Reporter: Xu Zhongxing
 Attachments: SPARK-3005_1.diff


 I am using Spark, Mesos, spark-cassandra-connector to do some work on a 
 cassandra cluster.
 During the job running, I killed the Cassandra daemon to simulate some 
 failure cases. This results in task failures.
 If I run the job in Mesos coarse-grained mode, the spark driver program 
 throws an exception and shutdown cleanly.
 But when I run the job in Mesos fine-grained mode, the spark driver program 
 hangs.
 The spark log is: 
 {code}
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 
 Logging.scala (line 58) Cancelling stage 1
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 
 Logging.scala (line 79) Could not cancel tasks for stage 1
 java.lang.UnsupportedOperationException
   at 
 org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)

[jira] [Created] (SPARK-3345) Do correct parameters for ShuffleFileGroup

2014-09-02 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-3345:
--

 Summary: Do correct parameters for ShuffleFileGroup
 Key: SPARK-3345
 URL: https://issues.apache.org/jira/browse/SPARK-3345
 Project: Spark
  Issue Type: Bug
Reporter: Liang-Chi Hsieh
Priority: Minor


In the method newFileGroup of class FileShuffleBlockManager, the parameters for 
creating new ShuffleFileGroup object is in wrong order.

Wrong: new ShuffleFileGroup(fileId, shuffleId, files)
Corrent: new ShuffleFileGroup(shuffleId, fileId, files)

Because in current codes, the parameters shuffleId and fileId are not used. So 
it doesn't cause problem now. However it should be corrected for readability 
and avoid future problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3168) The ServletContextHandler of webui lacks a SessionManager

2014-09-02 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118173#comment-14118173
 ] 

Thomas Graves commented on SPARK-3168:
--

Spark is using the ServletContextHandler.  Which has an option to create 
sessionManager but is not there by default.  I never added support for it as 
many basic filters do not need it. 

I do not know anything about CAS and haven't done much with sessions 
themselves.  It looks like we would have to investigate more what is really 
needed by CAS. Is any session manager ok, specific ones, what else on the 
backend do we need to do with clearing sessions or saving them, etc.  I would 
consider this a new feature.



 The ServletContextHandler of webui lacks a SessionManager
 -

 Key: SPARK-3168
 URL: https://issues.apache.org/jira/browse/SPARK-3168
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
 Environment: CAS
Reporter: meiyoula

 When i use CAS to realize single sign of webui, it occurs a exception:
 {code}
 WARN  [qtp1076146544-24] / 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:561)
 java.lang.IllegalStateException: No SessionManager
 at org.eclipse.jetty.server.Request.getSession(Request.java:1269)
 at org.eclipse.jetty.server.Request.getSession(Request.java:1248)
 at 
 org.jasig.cas.client.validation.AbstractTicketValidationFilter.doFilter(AbstractTicketValidationFilter.java:178)
 at 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
 at 
 org.jasig.cas.client.authentication.AuthenticationFilter.doFilter(AuthenticationFilter.java:116)
 at 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
 at 
 org.jasig.cas.client.session.SingleSignOutFilter.doFilter(SingleSignOutFilter.java:76)
 at 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
 at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:370)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
 at 
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at 
 org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3347) Spark alpha doesn't compile due to SPARK-2889

2014-09-02 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3347:


 Summary: Spark alpha doesn't compile due to SPARK-2889
 Key: SPARK-3347
 URL: https://issues.apache.org/jira/browse/SPARK-3347
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves
Assignee: Marcelo Vanzin
Priority: Blocker


Spark on yarn alpha doesn't compile after SPARK-2889 was merged in.

[ERROR] 
/home/tgraves/tgravescs-spark/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:43:
 not found: value SparkHadoopUtil



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition

2014-09-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118282#comment-14118282
 ] 

Nicholas Chammas commented on SPARK-1701:
-

Oh absolutely; sorry, didn't mean to mischaracterize your recommendations. 

I think this simplistic summary should help direct an initial PR which can then 
be reviewed in more detail.

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Priority: Minor
  Labels: starter

 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-02 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118286#comment-14118286
 ] 

Sandy Ryza commented on SPARK-2978:
---

IIUC, that would require using ShuffledRDD directly.  Would we be comfortable 
taking off the DeveloperAPI tag?

Another option that would allow us to avoid making the groupBy decision would 
be exposing a repartitionAndSortWithinPartition transform.  Then Hive would 
handle the grouping on the sorted stream.

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3347) Spark on yarn alpha doesn't compile due to SPARK-2889

2014-09-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118374#comment-14118374
 ] 

Apache Spark commented on SPARK-3347:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2236

 Spark on yarn alpha doesn't compile due to SPARK-2889
 -

 Key: SPARK-3347
 URL: https://issues.apache.org/jira/browse/SPARK-3347
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves
Assignee: Marcelo Vanzin
Priority: Blocker

 Spark on yarn alpha doesn't compile after SPARK-2889 was merged in.
 [ERROR] 
 /home/tgraves/tgravescs-spark/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:43:
  not found: value SparkHadoopUtil



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-02 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118397#comment-14118397
 ] 

Guoqiang Li commented on SPARK-3098:


The following RDD operation seem to have this problem: 
{{zip}},{{zipWithIndex}},{{zipWithUniqueId}},{{zipPartitions}},{{groupBy}},{{groupByKey}}.

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 The reproduce code:
 {code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
   (1 to 1).toSeq.map(p = i * 6000 + p)
 }.distinct().zipWithIndex() 
 c.join(c).filter(t = t._2._1 != t._2._2).take(3)
 {code}
  = 
 {code}
  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
 (36579712,(13,14)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3331) PEP8 tests fail because they check unzipped py4j code

2014-09-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3331.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 
[https://github.com/apache/spark/pull/]

 PEP8 tests fail because they check unzipped py4j code
 -

 Key: SPARK-3331
 URL: https://issues.apache.org/jira/browse/SPARK-3331
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen
Priority: Minor
 Fix For: 1.2.0


 PEP8 tests run on files under ./python, but unzipped py4j code is found at 
 ./python/build/py4j. Py4J code fails style checks and can fail 
 ./dev/run-tests if this code is present locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3061) Maven build fails in Windows OS

2014-09-02 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118418#comment-14118418
 ] 

Andrew Or commented on SPARK-3061:
--

Here's a friendly reminder for myself to back port this into branch-1.1 after 
the release. Others please disregard this comment. :)

 Maven build fails in Windows OS
 ---

 Key: SPARK-3061
 URL: https://issues.apache.org/jira/browse/SPARK-3061
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
 Environment: Windows
Reporter: Masayoshi TSUZUKI
Assignee: Josh Rosen
Priority: Minor

 Maven build fails in Windows OS with this error message.
 {noformat}
 [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec 
 (default) on project spark-core_2.10: Command execution failed. Cannot run 
 program unzip (in directory C:\path\to\gitofspark\python): CreateProcess 
 error=2, Žw’肳‚ꂽƒtƒ@ƒ - [Help 1]
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3061) Maven build fails in Windows OS

2014-09-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3061:
-
Fix Version/s: 1.2.0

 Maven build fails in Windows OS
 ---

 Key: SPARK-3061
 URL: https://issues.apache.org/jira/browse/SPARK-3061
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
 Environment: Windows
Reporter: Masayoshi TSUZUKI
Assignee: Josh Rosen
Priority: Minor
 Fix For: 1.2.0


 Maven build fails in Windows OS with this error message.
 {noformat}
 [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec 
 (default) on project spark-core_2.10: Command execution failed. Cannot run 
 program unzip (in directory C:\path\to\gitofspark\python): CreateProcess 
 error=2, Žw’肳‚ꂽƒtƒ@ƒ - [Help 1]
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1919) In Windows, Spark shell cannot load classes in spark.jars (--jars)

2014-09-02 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118421#comment-14118421
 ] 

Andrew Or commented on SPARK-1919:
--

Here's a reminder for myself to back port this into branch-1.1 after the 
release.

 In Windows, Spark shell cannot load classes in spark.jars (--jars)
 --

 Key: SPARK-1919
 URL: https://issues.apache.org/jira/browse/SPARK-1919
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.2.0


 Not sure what the issue is, but Spark submit does not have the same problem, 
 even if the jars specified are the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1919) In Windows, Spark shell cannot load classes in spark.jars (--jars)

2014-09-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1919:
-
Fix Version/s: 1.2.0

 In Windows, Spark shell cannot load classes in spark.jars (--jars)
 --

 Key: SPARK-1919
 URL: https://issues.apache.org/jira/browse/SPARK-1919
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.2.0


 Not sure what the issue is, but Spark submit does not have the same problem, 
 even if the jars specified are the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM

2014-09-02 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118427#comment-14118427
 ] 

Reza Zadeh commented on SPARK-2885:
---

Hi Clive,
Can you please post the code you used to generate the error? I am having 
trouble reproducing this. There is a test for sparse vectors in the PR, which 
is not catching this.
Reza

 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh
Assignee: Reza Zadeh

 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that are above the similarity 
 threshold, which makes the problem feasible.
 DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
 accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
 computation vs accuracy from low computation to high accuracy. For a very 
 large gamma, all cosine similarities are computed exactly with no sampling.
 Current PR:
 https://github.com/apache/spark/pull/1778
 Justification for adding to MLlib:
 - All-pairs similarity is missing from MLlib and has been requested several 
 times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
 https://github.com/apache/spark/pull/1778#issuecomment-51300825)
 - Algorithm is used in large-scale production at Twitter. e.g. see 
 https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
  . Twitter also open-sourced their version in scalding: 
 https://github.com/twitter/scalding/pull/833
 - When used with the gamma parameter set high, this algorithm becomes the 
 normalized gramian matrix, which is useful in RowMatrix alongside the 
 computeGramianMatrix method already in RowMatrix
 More details about usage at Twitter: 
 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
 For correctness proof, see Theorem 4.3 in 
 http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM

2014-09-02 Thread Clive Cox (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clive Cox updated SPARK-2885:
-
Attachment: SimilarItemsSmallTest.java

This is a test I used. Sorry its not cleaned up, but hopefully you can 
reproduce the error.

 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh
Assignee: Reza Zadeh
 Attachments: SimilarItemsSmallTest.java


 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that are above the similarity 
 threshold, which makes the problem feasible.
 DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
 accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
 computation vs accuracy from low computation to high accuracy. For a very 
 large gamma, all cosine similarities are computed exactly with no sampling.
 Current PR:
 https://github.com/apache/spark/pull/1778
 Justification for adding to MLlib:
 - All-pairs similarity is missing from MLlib and has been requested several 
 times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
 https://github.com/apache/spark/pull/1778#issuecomment-51300825)
 - Algorithm is used in large-scale production at Twitter. e.g. see 
 https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
  . Twitter also open-sourced their version in scalding: 
 https://github.com/twitter/scalding/pull/833
 - When used with the gamma parameter set high, this algorithm becomes the 
 normalized gramian matrix, which is useful in RowMatrix alongside the 
 computeGramianMatrix method already in RowMatrix
 More details about usage at Twitter: 
 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
 For correctness proof, see Theorem 4.3 in 
 http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM

2014-09-02 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118445#comment-14118445
 ] 

Reza Zadeh commented on SPARK-2885:
---

I don't see your code. Looking at the exception however, when generating the 
random sparse vectors, did you remember to sort by indices? This is a 
requirement of breeze, and it looks like the example you have violate it: 
http://www.scalanlp.org/api/breeze/index.html#breeze.linalg.SparseVector

 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh
Assignee: Reza Zadeh
 Attachments: SimilarItemsSmallTest.java


 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that are above the similarity 
 threshold, which makes the problem feasible.
 DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
 accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
 computation vs accuracy from low computation to high accuracy. For a very 
 large gamma, all cosine similarities are computed exactly with no sampling.
 Current PR:
 https://github.com/apache/spark/pull/1778
 Justification for adding to MLlib:
 - All-pairs similarity is missing from MLlib and has been requested several 
 times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
 https://github.com/apache/spark/pull/1778#issuecomment-51300825)
 - Algorithm is used in large-scale production at Twitter. e.g. see 
 https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
  . Twitter also open-sourced their version in scalding: 
 https://github.com/twitter/scalding/pull/833
 - When used with the gamma parameter set high, this algorithm becomes the 
 normalized gramian matrix, which is useful in RowMatrix alongside the 
 computeGramianMatrix method already in RowMatrix
 More details about usage at Twitter: 
 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
 For correctness proof, see Theorem 4.3 in 
 http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-02 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118455#comment-14118455
 ] 

Patrick Wendell commented on SPARK-:


Hey [~nchammas] unfortunately we couldn't reproduce this at a smaller scale.

The reproduction here is a slightly pathological case (24,000 empty tasks) so 
it's not totally clear to me that this would affect any actual workload. Also 
by default I think Spark launches with a very small amount of driver memory, so 
it might be that there is GC happening due to an increase in the amount of 
memory required for tasks, the web UI, or other meta-data and that's why it's 
slower. It would be good to log GC data by setting 
spark.driver.extraJavaOptions in to -XX:+printGCDetails in spark-defaults.conf.

I'll make a call soon about whether to let this block the release. If we can't 
narrow it down in time, it's not worth holding a bunch of other features for 
this... we can fix issues in a patch release shortly if we find any.

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas
Priority: Blocker

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at 
 

[jira] [Comment Edited] (SPARK-3333) Large number of partitions causes OOM

2014-09-02 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118455#comment-14118455
 ] 

Patrick Wendell edited comment on SPARK- at 9/2/14 6:15 PM:


Hey [~nchammas] unfortunately we couldn't reproduce this at a smaller scale.

The reproduction here is a slightly pathological case (24,000 empty tasks) so 
it's not totally clear to me that this would affect any actual workload. Also 
by default I think Spark launches with a very small amount of driver memory, so 
it might be that there is GC happening due to an increase in the amount of 
memory required for tasks, the web UI, or other meta-data and that's why it's 
slower. It would be good to log GC data by setting 
spark.driver.extraJavaOptions in to -XX:+printGCDetails in spark-defaults.conf 
and see if you launch this with something more sane (a 10GB heap, e.g.) does it 
still show a difference.

I'll make a call soon about whether to let this block the release. If we can't 
narrow it down in time, it's not worth holding a bunch of other features for 
this... we can fix issues in a patch release shortly if we find any.


was (Author: pwendell):
Hey [~nchammas] unfortunately we couldn't reproduce this at a smaller scale.

The reproduction here is a slightly pathological case (24,000 empty tasks) so 
it's not totally clear to me that this would affect any actual workload. Also 
by default I think Spark launches with a very small amount of driver memory, so 
it might be that there is GC happening due to an increase in the amount of 
memory required for tasks, the web UI, or other meta-data and that's why it's 
slower. It would be good to log GC data by setting 
spark.driver.extraJavaOptions in to -XX:+printGCDetails in spark-defaults.conf.

I'll make a call soon about whether to let this block the release. If we can't 
narrow it down in time, it's not worth holding a bunch of other features for 
this... we can fix issues in a patch release shortly if we find any.

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas
Priority: Blocker

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)

[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM

2014-09-02 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118464#comment-14118464
 ] 

Reza Zadeh commented on SPARK-2885:
---

Yes, looking at your code it looks like you need to sort the indices.

 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh
Assignee: Reza Zadeh
 Attachments: SimilarItemsSmallTest.java


 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that are above the similarity 
 threshold, which makes the problem feasible.
 DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
 accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
 computation vs accuracy from low computation to high accuracy. For a very 
 large gamma, all cosine similarities are computed exactly with no sampling.
 Current PR:
 https://github.com/apache/spark/pull/1778
 Justification for adding to MLlib:
 - All-pairs similarity is missing from MLlib and has been requested several 
 times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
 https://github.com/apache/spark/pull/1778#issuecomment-51300825)
 - Algorithm is used in large-scale production at Twitter. e.g. see 
 https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
  . Twitter also open-sourced their version in scalding: 
 https://github.com/twitter/scalding/pull/833
 - When used with the gamma parameter set high, this algorithm becomes the 
 normalized gramian matrix, which is useful in RowMatrix alongside the 
 computeGramianMatrix method already in RowMatrix
 More details about usage at Twitter: 
 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
 For correctness proof, see Theorem 4.3 in 
 http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2885) All-pairs similarity via DIMSUM

2014-09-02 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118464#comment-14118464
 ] 

Reza Zadeh edited comment on SPARK-2885 at 9/2/14 6:20 PM:
---

Yes, looking at your code it looks like you need to sort by indices.


was (Author: rezazadeh):
Yes, looking at your code it looks like you need to sort the indices.

 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh
Assignee: Reza Zadeh
 Attachments: SimilarItemsSmallTest.java


 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that are above the similarity 
 threshold, which makes the problem feasible.
 DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
 accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
 computation vs accuracy from low computation to high accuracy. For a very 
 large gamma, all cosine similarities are computed exactly with no sampling.
 Current PR:
 https://github.com/apache/spark/pull/1778
 Justification for adding to MLlib:
 - All-pairs similarity is missing from MLlib and has been requested several 
 times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
 https://github.com/apache/spark/pull/1778#issuecomment-51300825)
 - Algorithm is used in large-scale production at Twitter. e.g. see 
 https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
  . Twitter also open-sourced their version in scalding: 
 https://github.com/twitter/scalding/pull/833
 - When used with the gamma parameter set high, this algorithm becomes the 
 normalized gramian matrix, which is useful in RowMatrix alongside the 
 computeGramianMatrix method already in RowMatrix
 More details about usage at Twitter: 
 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
 For correctness proof, see Theorem 4.3 in 
 http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3349) [Spark SQL] Incorrect partitioning after LIMIT operator

2014-09-02 Thread Eric Liang (JIRA)
Eric Liang created SPARK-3349:
-

 Summary: [Spark SQL] Incorrect partitioning after LIMIT operator
 Key: SPARK-3349
 URL: https://issues.apache.org/jira/browse/SPARK-3349
 Project: Spark
  Issue Type: Bug
Reporter: Eric Liang


Reproduced by the following example:

import org.apache.spark.sql.catalyst.plans.Inner
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
import org.apache.spark.sql.catalyst.expressions._

val series = sql(select distinct year from sales order by year asc limit 10)
val results = sql(select * from sales)
series.registerTempTable(series)
results.registerTempTable(results)

sql(select * from results inner join series where results.year = 
series.year).count

---

java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:56)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
at 
org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:189)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:189)
at 
org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:298)
at 
org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:310)
at 
org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:246)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:723)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1333)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3350) Strange anomaly trying to write a SchemaRDD into an Avro file

2014-09-02 Thread David Greco (JIRA)
David Greco created SPARK-3350:
--

 Summary: Strange anomaly trying to write a SchemaRDD into an Avro 
file
 Key: SPARK-3350
 URL: https://issues.apache.org/jira/browse/SPARK-3350
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
 Environment: jdk1.7, macosx
Reporter: David Greco


I found a way to automatically save a SchemaRDD in Avro format, similarly to 
what Spark does with parquet file.
I attached a test case to this issue. The code fails with a NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3347) Spark on yarn alpha doesn't compile due to SPARK-2889

2014-09-02 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3347.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 Spark on yarn alpha doesn't compile due to SPARK-2889
 -

 Key: SPARK-3347
 URL: https://issues.apache.org/jira/browse/SPARK-3347
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves
Assignee: Marcelo Vanzin
Priority: Blocker
 Fix For: 1.2.0


 Spark on yarn alpha doesn't compile after SPARK-2889 was merged in.
 [ERROR] 
 /home/tgraves/tgravescs-spark/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:43:
  not found: value SparkHadoopUtil



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3350) Strange anomaly trying to write a SchemaRDD into an Avro file

2014-09-02 Thread David Greco (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Greco updated SPARK-3350:
---
Attachment: AvroWriteTestCase.scala

This is the test case failing. Hope it helps.

 Strange anomaly trying to write a SchemaRDD into an Avro file
 -

 Key: SPARK-3350
 URL: https://issues.apache.org/jira/browse/SPARK-3350
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
 Environment: jdk1.7, macosx
Reporter: David Greco
 Attachments: AvroWriteTestCase.scala


 I found a way to automatically save a SchemaRDD in Avro format, similarly to 
 what Spark does with parquet file.
 I attached a test case to this issue. The code fails with a NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3052) Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop

2014-09-02 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-3052:
-
Assignee: Sandy Ryza

 Misleading and spurious FileSystem closed errors whenever a job fails while 
 reading from Hadoop
 ---

 Key: SPARK-3052
 URL: https://issues.apache.org/jira/browse/SPARK-3052
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2014-09-02 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118566#comment-14118566
 ] 

Matthew Farrellee commented on SPARK-3181:
--

pls excuse my changes to this issue, i'm not planning to work on it, but cannot 
appear to remove myself as the assignee.

 Add Robust Regression Algorithm with Huber Estimator
 

 Key: SPARK-3181
 URL: https://issues.apache.org/jira/browse/SPARK-3181
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Assignee: Matthew Farrellee
Priority: Critical
  Labels: features
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression  to employ a 
 fitting criterion that is not as vulnerable as least square.
 In 1973, Huber introduced M-estimation for regression which stands for 
 maximum likelihood type. The method is resistant to outliers in the 
 response variable and has been widely used.
 The new feature for MLlib will contain 3 new files
 /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
 /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
 /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
 and one new class HuberRobustGradient in 
 /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3351) Yarn YarnRMClientImpl.shutdown can be called before register - NPE

2014-09-02 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3351:


 Summary: Yarn YarnRMClientImpl.shutdown can be called before 
register - NPE
 Key: SPARK-3351
 URL: https://issues.apache.org/jira/browse/SPARK-3351
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves


If the SparkContext exits while its in the 
applicationmaster.waitForSparkContextInitialized then the 
YarnRMClientImpl.shutdown can be called before register and you get a null 
pointer exception on the uihistoryAddress.

14/09/02 18:59:21 INFO ApplicationMaster: Finishing ApplicationMaster with 
FAILED (diag message: Timed out waiting for SparkContext.)
Exception in thread main java.lang.NullPointerException
at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$FinishApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:2312)
at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.FinishApplicationMasterRequestPBImpl.setTrackingUrl(FinishApplicationMasterRequestPBImpl.java:121)
at 
org.apache.spark.deploy.yarn.YarnRMClientImpl.shutdown(YarnRMClientImpl.scala:73)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.finish(ApplicationMaster.scala:140)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:178)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:113)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3351) Yarn YarnRMClientImpl.shutdown can be called before register - NPE

2014-09-02 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118574#comment-14118574
 ] 

Thomas Graves commented on SPARK-3351:
--

Note I was running on yarn alpha but I think the same applies for yarn stable.

 Yarn YarnRMClientImpl.shutdown can be called before register - NPE
 --

 Key: SPARK-3351
 URL: https://issues.apache.org/jira/browse/SPARK-3351
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves

 If the SparkContext exits while its in the 
 applicationmaster.waitForSparkContextInitialized then the 
 YarnRMClientImpl.shutdown can be called before register and you get a null 
 pointer exception on the uihistoryAddress.
 14/09/02 18:59:21 INFO ApplicationMaster: Finishing ApplicationMaster with 
 FAILED (diag message: Timed out waiting for SparkContext.)
 Exception in thread main java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.proto.YarnServiceProtos$FinishApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:2312)
   at 
 org.apache.hadoop.yarn.api.protocolrecords.impl.pb.FinishApplicationMasterRequestPBImpl.setTrackingUrl(FinishApplicationMasterRequestPBImpl.java:121)
   at 
 org.apache.spark.deploy.yarn.YarnRMClientImpl.shutdown(YarnRMClientImpl.scala:73)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.finish(ApplicationMaster.scala:140)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:178)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:113)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-02 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118593#comment-14118593
 ] 

Patrick Wendell commented on SPARK-:


Hey [~nchammas] one other thing. If you go back to the original job that was 
causing this issue, can you see any regression? This benchmark is pathological 
enough that it would be good to see if there is actually an issue in Spark.

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas
Priority: Blocker

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Exception in thread Result resolver thread-3 14/08/29 21:56:26 ERROR 
 SendingConnection: Exception while reading SendingConnection to 
 ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014)
 java.nio.channels.ClosedChannelException
 at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
 at 

[jira] [Updated] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2014-09-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3181:
--
Assignee: (was: Matthew Farrellee)

 Add Robust Regression Algorithm with Huber Estimator
 

 Key: SPARK-3181
 URL: https://issues.apache.org/jira/browse/SPARK-3181
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
  Labels: features
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression  to employ a 
 fitting criterion that is not as vulnerable as least square.
 In 1973, Huber introduced M-estimation for regression which stands for 
 maximum likelihood type. The method is resistant to outliers in the 
 response variable and has been widely used.
 The new feature for MLlib will contain 3 new files
 /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
 /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
 /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
 and one new class HuberRobustGradient in 
 /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3351) Yarn YarnRMClientImpl.shutdown can be called before register - NPE

2014-09-02 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118616#comment-14118616
 ] 

Thomas Graves commented on SPARK-3351:
--

Note that this failure scenario also can cause the staging directory to be 
cleaned up but yarn will retry the application master attempt and fail since 
its been deleted.

 Yarn YarnRMClientImpl.shutdown can be called before register - NPE
 --

 Key: SPARK-3351
 URL: https://issues.apache.org/jira/browse/SPARK-3351
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves

 If the SparkContext exits while its in the 
 applicationmaster.waitForSparkContextInitialized then the 
 YarnRMClientImpl.shutdown can be called before register and you get a null 
 pointer exception on the uihistoryAddress.
 14/09/02 18:59:21 INFO ApplicationMaster: Finishing ApplicationMaster with 
 FAILED (diag message: Timed out waiting for SparkContext.)
 Exception in thread main java.lang.NullPointerException
   at 
 org.apache.hadoop.yarn.proto.YarnServiceProtos$FinishApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:2312)
   at 
 org.apache.hadoop.yarn.api.protocolrecords.impl.pb.FinishApplicationMasterRequestPBImpl.setTrackingUrl(FinishApplicationMasterRequestPBImpl.java:121)
   at 
 org.apache.spark.deploy.yarn.YarnRMClientImpl.shutdown(YarnRMClientImpl.scala:73)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.finish(ApplicationMaster.scala:140)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:178)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:113)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3352) Rename Flume Polling stream to Pull Based stream

2014-09-02 Thread Hari Shreedharan (JIRA)
Hari Shreedharan created SPARK-3352:
---

 Summary: Rename Flume Polling stream to Pull Based stream
 Key: SPARK-3352
 URL: https://issues.apache.org/jira/browse/SPARK-3352
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3352) Rename Flume Polling stream to Pull Based stream

2014-09-02 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118644#comment-14118644
 ] 

Patrick Wendell commented on SPARK-3352:


For my part, I actually found PollingStream to be nicer sounding than Pull 
Based... but maybe for others this is a worse option. 

 Rename Flume Polling stream to Pull Based stream
 

 Key: SPARK-3352
 URL: https://issues.apache.org/jira/browse/SPARK-3352
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3352) Rename Flume Polling stream to Pull Based stream

2014-09-02 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118650#comment-14118650
 ] 

Hari Shreedharan commented on SPARK-3352:
-

Sorry, should have provided more context here. TD suggested the Pull based name 
since it was easier to understand, especially since the other was push based.



 Rename Flume Polling stream to Pull Based stream
 

 Key: SPARK-3352
 URL: https://issues.apache.org/jira/browse/SPARK-3352
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3352) Rename Flume Polling stream to Pull Based stream

2014-09-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118649#comment-14118649
 ] 

Apache Spark commented on SPARK-3352:
-

User 'harishreedharan' has created a pull request for this issue:
https://github.com/apache/spark/pull/2238

 Rename Flume Polling stream to Pull Based stream
 

 Key: SPARK-3352
 URL: https://issues.apache.org/jira/browse/SPARK-3352
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM

2014-09-02 Thread Clive Cox (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118669#comment-14118669
 ] 

Clive Cox commented on SPARK-2885:
--

Ah. Thanks. Sorry for the confusion.

 All-pairs similarity via DIMSUM
 ---

 Key: SPARK-2885
 URL: https://issues.apache.org/jira/browse/SPARK-2885
 Project: Spark
  Issue Type: New Feature
Reporter: Reza Zadeh
Assignee: Reza Zadeh
 Attachments: SimilarItemsSmallTest.java


 Build all-pairs similarity algorithm via DIMSUM. 
 Given a dataset of sparse vector data, the all-pairs similarity problem is to 
 find all similar vector pairs according to a similarity function such as 
 cosine similarity, and a given similarity score threshold. Sometimes, this 
 problem is called a “similarity join”.
 The brute force approach of considering all pairs quickly breaks, since it 
 scales quadratically. For example, for a million vectors, it is not feasible 
 to check all roughly trillion pairs to see if they are above the similarity 
 threshold. Having said that, there exist clever sampling techniques to focus 
 the computational effort on those pairs that are above the similarity 
 threshold, which makes the problem feasible.
 DIMSUM has a single parameter (called gamma) to tradeoff computation time vs 
 accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of 
 computation vs accuracy from low computation to high accuracy. For a very 
 large gamma, all cosine similarities are computed exactly with no sampling.
 Current PR:
 https://github.com/apache/spark/pull/1778
 Justification for adding to MLlib:
 - All-pairs similarity is missing from MLlib and has been requested several 
 times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see 
 https://github.com/apache/spark/pull/1778#issuecomment-51300825)
 - Algorithm is used in large-scale production at Twitter. e.g. see 
 https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
  . Twitter also open-sourced their version in scalding: 
 https://github.com/twitter/scalding/pull/833
 - When used with the gamma parameter set high, this algorithm becomes the 
 normalized gramian matrix, which is useful in RowMatrix alongside the 
 computeGramianMatrix method already in RowMatrix
 More details about usage at Twitter: 
 https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
 For correctness proof, see Theorem 4.3 in 
 http://stanford.edu/~rezab/papers/dimsum.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-02 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118695#comment-14118695
 ] 

Josh Rosen commented on SPARK-:
---

I was unable to reproduce this on a cluster with two *r3*.xlarge instances 
(trying on *m3* shortly) using the following script:

{code}
from pyspark import SparkContext

sc = SparkContext(appName=test)
a = sc.parallelize([Nick, John, Bob])
a = a.repartition(24000)
parallelism = sc.defaultParallelism
a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, parallelism).take(1)
{code}

In both 1.0.2 and 1.1.0-rc3, the job ran in ~1 minute 10 seconds (with 
essentially no timing difference between the two versions).  I'm going to try 
this again on m3 instances and double-check both configurations.

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas
Priority: Blocker

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at 

[jira] [Commented] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite

2014-09-02 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118701#comment-14118701
 ] 

Patrick Wendell commented on SPARK-3330:


I reviewed the patch and it seemed to me like `mvn clean` already correctly 
clears the assembly target directory regardless of the profile used (?)

 Successive test runs with different profiles fail SparkSubmitSuite
 --

 Key: SPARK-3330
 URL: https://issues.apache.org/jira/browse/SPARK-3330
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen

 Maven-based Jenkins builds have been failing for a while:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console
 One common cause is that on the second and subsequent runs of mvn clean 
 test, at least two assembly JARs will exist in assembly/target. Because 
 assembly is not a submodule of parent, mvn clean is not invoked for 
 assembly. The presence of two assembly jars causes spark-submit to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite

2014-09-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3330:
---
Comment: was deleted

(was: I reviewed the patch and it seemed to me like `mvn clean` already 
correctly clears the assembly target directory regardless of the profile used 
(?))

 Successive test runs with different profiles fail SparkSubmitSuite
 --

 Key: SPARK-3330
 URL: https://issues.apache.org/jira/browse/SPARK-3330
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen

 Maven-based Jenkins builds have been failing for a while:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console
 One common cause is that on the second and subsequent runs of mvn clean 
 test, at least two assembly JARs will exist in assembly/target. Because 
 assembly is not a submodule of parent, mvn clean is not invoked for 
 assembly. The presence of two assembly jars causes spark-submit to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118735#comment-14118735
 ] 

Nicholas Chammas commented on SPARK-:
-

{quote}
If we can't narrow it down in time, it's not worth holding a bunch of other 
features for this... we can fix issues in a patch release shortly if we find 
any.
{quote}

Sounds fine to me. I'm trying to retrace how I originally stumbled on this but 
am currently tied up with other stuff. I'll report back when I can. It looks 
like the issue is limited to some specific setup which Josh seems to be slowly 
honing in on, which is great.

Just FYI Josh, if it matters, I've been running the little benchmarks reported 
here via in the shell, not using spark-submit.

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas
Priority: Blocker

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at 

[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-09-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118748#comment-14118748
 ] 

Apache Spark commented on SPARK-1981:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/2239

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly
 Fix For: 1.1.0


 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3353) Stage id monotonicity (parent stage should have lower stage id)

2014-09-02 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3353:
--

 Summary: Stage id monotonicity (parent stage should have lower 
stage id)
 Key: SPARK-3353
 URL: https://issues.apache.org/jira/browse/SPARK-3353
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Reynold Xin


The way stage IDs are generated is that parent stages actually have higher 
stage id. This is very confusing because parent stages get scheduled  executed 
first.

We should reverse that order so the scheduling timeline of stages (absent of 
failures) is monotonic, i.e. stages that are executed first have lower stage 
ids.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2895) Support mapPartitionsWithContext in Spark Java API

2014-09-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118759#comment-14118759
 ] 

Xuefu Zhang commented on SPARK-2895:


Hi [~rxin], could you review [~chengxiang li]'s patch when you get a chance? 
This is needed for our hive-on-spark milestone #1. Thanks.

 Support mapPartitionsWithContext in Spark Java API
 --

 Key: SPARK-2895
 URL: https://issues.apache.org/jira/browse/SPARK-2895
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li
Assignee: Chengxiang Li
  Labels: hive

 This is a requirement from Hive on Spark, mapPartitionsWithContext only 
 exists in Spark Scala API, we expect to access from Spark Java API. 
 For HIVE-7627, HIVE-7843, Hive operators which are invoked in mapPartitions 
 closure need to get taskId.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3354) Add LENGTH and DATALENGTH functions to Spark SQL

2014-09-02 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-3354:
---

 Summary: Add LENGTH and DATALENGTH functions to Spark SQL
 Key: SPARK-3354
 URL: https://issues.apache.org/jira/browse/SPARK-3354
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Nicholas Chammas


It's very common when working on data sets of strings to want to know the 
lengths of the strings you are analyzing. Say I have some Tweets and want to 
find the average length of a Tweet by language.

{code}
SELECT language, AVG(LEN(tweet)) AS avg_length
FROM tweets
GROUP BY language
ORDER BY avg_length DESC;
{code}

This is currently not possible because Spark SQL doesn't have a {{LEN()}} 
function.

Another common function that would be useful is one that gives the size of the 
data item in bytes. This can be handy when moving data from Spark SQL to 
another system and you need to know how to size the receiving fields 
appropriately.

*Proposal*

* Add a {{LENGTH}} function. Make {{LEN}} a synonym of {{LENGTH}}. This 
function returns the number of characters in a string expression.
* Add a {{DATALENGTH}} function. Make {{DATALEN}} a synonym of {{DATALENGTH}}. 
This function returns the number of bytes in any expression.

*Special care* must be given to the following cases:
* multi-byte characters
* {{NULL}}
* trailing spaces

*Examples*
These are suggestions for how these 2 functions should work.
{code}
LEN('Hello') - 5
LEN('안녕') - 2
LEN('Hello 안녕') - 8
LEN(NULL) - NULL
LEN('') - 0
LEN('Bob  ') - 3
{code}

In this last example with {{'Bob  '}}, trailing spaces are ignored. This 
matches the [behavior of SQL 
Server|http://msdn.microsoft.com/en-us/library/ms190329.aspx], but we could opt 
to include the spaces.

{code}
DATALEN('Hello') - 5
DATALEN('안녕') - 4
DATALEN('Hello 안녕') - 16
DATALEN(NULL) - NULL
DATALEN('') - 0
DATALEN('Bob  ') - 5
{code}

Note here how mixing English and Korean characters causes every character to be 
interpreted as a 2 byte wide character. Dunno if this sane; this may depend on 
Scala or JVM details that I wouldn't know about at the moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-02 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118775#comment-14118775
 ] 

Josh Rosen commented on SPARK-:
---

Tried this on a m3.xlarge cluster (1 master, 1 worker) using the same script 
from my previous comment.  On this cluster, 1.1.0 was noticeably slower.

In v1.0.2-rc1, it took ~220 seconds.
In v1.1.0-rc3, it took ~360 seconds.

Going to run again and keep the logs to see if I can spot where the extra time 
is coming from.

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas
Priority: Blocker

 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Exception in thread Result resolver thread-3 14/08/29 21:56:26 ERROR 
 SendingConnection: Exception while reading SendingConnection to 
 ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014)
 java.nio.channels.ClosedChannelException
 at 

[jira] [Commented] (SPARK-3176) Implement 'POWER', 'ABS and 'LAST' for sql

2014-09-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118796#comment-14118796
 ] 

Nicholas Chammas commented on SPARK-3176:
-

Reposting a [comment I made on the 
PR|https://github.com/apache/spark/pull/2099#discussion_r16691086] here, since 
it relates to {{POWER}}'s behavior. 

[~marmbrus] and [~xinyunh] - We should probably document that behavior in this 
JIRA issue for starters, and then also in the appropriate docs.

{quote}
Microsoft has some good documentation for how SQL Server handles these things. 
As an established and very popular product, SQL Server could provide y'all with 
a good reference implementation for this behavior.

From their [documentation on 
{{POWER()}}|http://msdn.microsoft.com/en-us/library/ms174276.aspx]:
 Returns the same type as submitted in _float_expression_. For example, if a 
 *decimal*(2,0) is submitted as _float_expression_, the result returned is 
 *decimal*(2,0).

There are a few good examples that follow.

Microsoft also has some good documentation on [how precision, scale, and length 
are calculated for results of arithmetic 
operations|http://msdn.microsoft.com/en-us/library/ms190476.aspx].
{quote}

 Implement 'POWER', 'ABS and 'LAST' for sql
 --

 Key: SPARK-3176
 URL: https://issues.apache.org/jira/browse/SPARK-3176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: All
Reporter: Xinyun Huang
Priority: Minor
 Fix For: 1.2.0

   Original Estimate: 3h
  Remaining Estimate: 3h

 Add support for the mathematical function POWER and ABS and  the analytic 
 function last to return a subset of the rows satisfying a query within 
 spark sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3333) Large number of partitions causes OOM

2014-09-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-:
--
Attachment: spark--logs.zip

I've attached some logs from my most recent run, showing a ~100 second 
difference between the two releases.

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas
Priority: Blocker
 Attachments: spark--logs.zip


 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Exception in thread Result resolver thread-3 14/08/29 21:56:26 ERROR 
 SendingConnection: Exception while reading SendingConnection to 
 ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014)
 java.nio.channels.ClosedChannelException
 at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
 at 

[jira] [Updated] (SPARK-3355) Allow running maven tests in run-tests

2014-09-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3355:
---
Summary: Allow running maven tests in run-tests  (was: Allow running maven 
tests in run-tests and run-tests-jenkins)

 Allow running maven tests in run-tests
 --

 Key: SPARK-3355
 URL: https://issues.apache.org/jira/browse/SPARK-3355
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell

 We should have a variable called AMPLAB_JENKINS_BUILD_TOOL that decides 
 whether to run sbt or maven.
 This would allow us to simplify our build matrix in Jenkins... currently the 
 maven builds run a totally different thing than the normal run-tests builds.
 The maven build currently does something like this:
 {code}
 mvn -DskipTests -Pprofile1 -Pprofile2 ... clean package
 mvn test -Pprofile1 -Pprofile2 ... --fail-at-end
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3176) Implement 'POWER', 'ABS and 'LAST' for sql

2014-09-02 Thread Xinyun Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118822#comment-14118822
 ] 

Xinyun Huang commented on SPARK-3176:
-

For now, the POWER's function will implicitly convert the input to Double and 
return the result in Double type. 

 Implement 'POWER', 'ABS and 'LAST' for sql
 --

 Key: SPARK-3176
 URL: https://issues.apache.org/jira/browse/SPARK-3176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: All
Reporter: Xinyun Huang
Priority: Minor
 Fix For: 1.2.0

   Original Estimate: 3h
  Remaining Estimate: 3h

 Add support for the mathematical function POWER and ABS and  the analytic 
 function last to return a subset of the rows satisfying a query within 
 spark sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3355) Allow running maven tests in run-tests and run-tests-jenkins

2014-09-02 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3355:
--

 Summary: Allow running maven tests in run-tests and 
run-tests-jenkins
 Key: SPARK-3355
 URL: https://issues.apache.org/jira/browse/SPARK-3355
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell


We should have a variable called AMPLAB_JENKINS_BUILD_TOOL that decides whether 
to run sbt or maven.

This would allow us to simplify our build matrix in Jenkins... currently the 
maven builds run a totally different thing than the normal run-tests builds.

The maven build currently does something like this:

{code}
mvn -DskipTests -Pprofile1 -Pprofile2 ... clean package
mvn test -Pprofile1 -Pprofile2 ... --fail-at-end
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite

2014-09-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3330.
--
Resolution: Won't Fix

It will be more suitable to change jenkins to run mvn clean  mvn ... 
package to address this, if anything. It's not yet clear this is the cause of 
the failure in the same test in Jenkins though.

 Successive test runs with different profiles fail SparkSubmitSuite
 --

 Key: SPARK-3330
 URL: https://issues.apache.org/jira/browse/SPARK-3330
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen

 Maven-based Jenkins builds have been failing for a while:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console
 One common cause is that on the second and subsequent runs of mvn clean 
 test, at least two assembly JARs will exist in assembly/target. Because 
 assembly is not a submodule of parent, mvn clean is not invoked for 
 assembly. The presence of two assembly jars causes spark-submit to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-02 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118826#comment-14118826
 ] 

Josh Rosen commented on SPARK-:
---

360/220 is approximately 1.6.  Using Splunk, I computed the average task 
execution time from these logs.  It looks like 1.1.0 took around 60ms per task, 
while 1.0.2 only took 37ms, and 60/37 is also about 1.6 (the standard 
deviations appear to be in the same ballpark, too)..  It seems that the tasks 
themselves are running slightly slower on 1.1.0, at least according to these 
logs.

 Large number of partitions causes OOM
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
Reporter: Nicholas Chammas
Priority: Blocker
 Attachments: spark--logs.zip


 Here’s a repro for PySpark:
 {code}
 a = sc.parallelize([Nick, John, Bob])
 a = a.repartition(24000)
 a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 {code}
 This code runs fine on 1.0.2. It returns the following result in just over a 
 minute:
 {code}
 [(4, 'NickJohn')]
 {code}
 However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
 runs for a very, very long time (at least  45 min) and then fails with 
 {{java.lang.OutOfMemoryError: Java heap space}}.
 Here is a stack trace taken from a run on 1.1.0-rc2:
 {code}
  a = sc.parallelize([Nick, John, Bob])
  a = a.repartition(24000)
  a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
 heart beats: 175143ms exceeds 45000ms
 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
 heart beats: 175359ms exceeds 45000ms
 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
 heart beats: 173061ms exceeds 45000ms
 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
 heart beats: 176816ms exceeds 45000ms
 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
 heart beats: 182241ms exceeds 45000ms
 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
 heart beats: 178406ms exceeds 45000ms
 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
 thread-3
 java.lang.OutOfMemoryError: Java heap space
 at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
 at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
 at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
 at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at 
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at 
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
 at 
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Exception in thread Result resolver thread-3 14/08/29 21:56:26 ERROR 
 SendingConnection: Exception while reading SendingConnection to 
 

[jira] [Updated] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL

2014-09-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2686:

Target Version/s: 1.2.0  (was: 1.1.1)

 Add Length support to Spark SQL and HQL and Strlen support to SQL
 -

 Key: SPARK-2686
 URL: https://issues.apache.org/jira/browse/SPARK-2686
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 0.9.1, 0.9.2, 1.0.0, 1.1.0, 1.1.1
 Environment: all
Reporter: Stephen Boesch
Priority: Minor
  Labels: hql, length, sql
   Original Estimate: 0h
  Remaining Estimate: 0h

 Syntactic, parsing, and operational support have been added for LEN(GTH) and 
 STRLEN functions.
 Examples:
 SQL:
 import org.apache.spark.sql._
 case class TestData(key: Int, value: String)
 val sqlc = new SQLContext(sc)
 import sqlc._
   val testData: SchemaRDD = sqlc.sparkContext.parallelize(
 (1 to 100).map(i = TestData(i, i.toString)))
   testData.registerAsTable(testData)
 sqlc.sql(select length(key) as key_len from testData order by key_len desc 
 limit 5).collect
 res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2])
 HQL:
 val hc = new org.apache.spark.sql.hive.HiveContext(sc)
 import hc._
 hc.hql
 hql(select length(grp) from simplex).collect
 res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6])
 As far as codebase changes: they have been purposefully made similar to the 
 ones made for  for adding SUBSTR(ING) from July 17:
 SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main 
 classes changed.  The testing suites affected are ConstantFolding and  
 ExpressionEvaluation.
 In addition some ad-hoc testing was done as shown in the examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL

2014-09-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2686:

Affects Version/s: (was: 1.1.1)
   (was: 0.9.2)
   (was: 1.1.0)
   (was: 0.9.1)
   (was: 1.0.0)

 Add Length support to Spark SQL and HQL and Strlen support to SQL
 -

 Key: SPARK-2686
 URL: https://issues.apache.org/jira/browse/SPARK-2686
 Project: Spark
  Issue Type: Improvement
  Components: SQL
 Environment: all
Reporter: Stephen Boesch
Priority: Minor
  Labels: hql, length, sql
   Original Estimate: 0h
  Remaining Estimate: 0h

 Syntactic, parsing, and operational support have been added for LEN(GTH) and 
 STRLEN functions.
 Examples:
 SQL:
 import org.apache.spark.sql._
 case class TestData(key: Int, value: String)
 val sqlc = new SQLContext(sc)
 import sqlc._
   val testData: SchemaRDD = sqlc.sparkContext.parallelize(
 (1 to 100).map(i = TestData(i, i.toString)))
   testData.registerAsTable(testData)
 sqlc.sql(select length(key) as key_len from testData order by key_len desc 
 limit 5).collect
 res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2])
 HQL:
 val hc = new org.apache.spark.sql.hive.HiveContext(sc)
 import hc._
 hc.hql
 hql(select length(grp) from simplex).collect
 res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6])
 As far as codebase changes: they have been purposefully made similar to the 
 ones made for  for adding SUBSTR(ING) from July 17:
 SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main 
 classes changed.  The testing suites affected are ConstantFolding and  
 ExpressionEvaluation.
 In addition some ad-hoc testing was done as shown in the examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3354) Add LENGTH and DATALENGTH functions to Spark SQL

2014-09-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3354.
-
Resolution: Duplicate

Closing this as a duplicate of [SPARK-2686].  Can you make your comments there 
are on the linked PR?  Thanks!

 Add LENGTH and DATALENGTH functions to Spark SQL
 

 Key: SPARK-3354
 URL: https://issues.apache.org/jira/browse/SPARK-3354
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Nicholas Chammas

 It's very common when working on data sets of strings to want to know the 
 lengths of the strings you are analyzing. Say I have some Tweets and want to 
 find the average length of a Tweet by language.
 {code}
 SELECT language, AVG(LEN(tweet)) AS avg_length
 FROM tweets
 GROUP BY language
 ORDER BY avg_length DESC;
 {code}
 This is currently not possible because Spark SQL doesn't have a {{LEN()}} 
 function.
 Another common function that would be useful is one that gives the size of 
 the data item in bytes. This can be handy when moving data from Spark SQL to 
 another system and you need to know how to size the receiving fields 
 appropriately.
 *Proposal*
 * Add a {{LENGTH}} function. Make {{LEN}} a synonym of {{LENGTH}}. This 
 function returns the number of characters in a string expression.
 * Add a {{DATALENGTH}} function. Make {{DATALEN}} a synonym of 
 {{DATALENGTH}}. This function returns the number of bytes in any expression.
 *Special care* must be given to the following cases:
 * multi-byte characters
 * {{NULL}}
 * trailing spaces
 *Examples*
 These are suggestions for how these 2 functions should work.
 {code}
 LEN('Hello') - 5
 LEN('안녕') - 2
 LEN('Hello 안녕') - 8
 LEN(NULL) - NULL
 LEN('') - 0
 LEN('Bob  ') - 3
 {code}
 In this last example with {{'Bob  '}}, trailing spaces are ignored. This 
 matches the [behavior of SQL 
 Server|http://msdn.microsoft.com/en-us/library/ms190329.aspx], but we could 
 opt to include the spaces.
 {code}
 DATALEN('Hello') - 5
 DATALEN('안녕') - 4
 DATALEN('Hello 안녕') - 16
 DATALEN(NULL) - NULL
 DATALEN('') - 0
 DATALEN('Bob  ') - 5
 {code}
 Note here how mixing English and Korean characters causes every character to 
 be interpreted as a 2 byte wide character. Dunno if this sane; this may 
 depend on Scala or JVM details that I wouldn't know about at the moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3343) Support for CREATE TABLE AS SELECT that specifies the format

2014-09-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3343:

Summary: Support for CREATE TABLE AS SELECT that specifies the format  
(was: Unsupported language features in query)

 Support for CREATE TABLE AS SELECT that specifies the format
 

 Key: SPARK-3343
 URL: https://issues.apache.org/jira/browse/SPARK-3343
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: HuQizhong

 hql(CREATE TABLE tmp_adclick_gm_all ROW FORMAT DELIMITED FIELDS
 TERMINATED BY ',' LINES TERMINATED BY '\n' as  SELECT SUM(uv) as uv,
 round(SUM(cost),2) as total, round(SUM(cost)/SUM(uv),2) FROM
 tmp_adclick_sellplat )
 14/09/02 15:32:28 INFO ParseDriver: Parse Completed
 java.lang.RuntimeException:
 Unsupported language features in query: CREATE TABLE
 tmp_adclick_gm_all ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
 TERMINATED BY 'abc' as  SELECT SUM(uv) as uv, round(SUM(cost),2) as
 total, round(SUM(cost)/SUM(uv),2) FROM tmp_adclick_sellplat
 at scala.sys.package$.error(package.scala:27)
 at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:255)
 at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:75)
 at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:78)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3343) Support for CREATE TABLE AS SELECT that specifies the format

2014-09-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3343:

 Target Version/s: 1.2.0
Affects Version/s: (was: 1.0.2)
   Issue Type: New Feature  (was: Bug)

 Support for CREATE TABLE AS SELECT that specifies the format
 

 Key: SPARK-3343
 URL: https://issues.apache.org/jira/browse/SPARK-3343
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: HuQizhong

 hql(CREATE TABLE tmp_adclick_gm_all ROW FORMAT DELIMITED FIELDS
 TERMINATED BY ',' LINES TERMINATED BY '\n' as  SELECT SUM(uv) as uv,
 round(SUM(cost),2) as total, round(SUM(cost)/SUM(uv),2) FROM
 tmp_adclick_sellplat )
 14/09/02 15:32:28 INFO ParseDriver: Parse Completed
 java.lang.RuntimeException:
 Unsupported language features in query: CREATE TABLE
 tmp_adclick_gm_all ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
 TERMINATED BY 'abc' as  SELECT SUM(uv) as uv, round(SUM(cost),2) as
 total, round(SUM(cost)/SUM(uv),2) FROM tmp_adclick_sellplat
 at scala.sys.package$.error(package.scala:27)
 at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:255)
 at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:75)
 at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:78)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3019) Pluggable block transfer (data plane communication) interface

2014-09-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118849#comment-14118849
 ] 

Apache Spark commented on SPARK-3019:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2240

 Pluggable block transfer (data plane communication) interface
 -

 Key: SPARK-3019
 URL: https://issues.apache.org/jira/browse/SPARK-3019
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
 Attachments: PluggableBlockTransferServiceProposalforSpark - draft 
 1.pdf


 The attached design doc proposes a standard interface for block transferring, 
 which will make future engineering of this functionality easier, allowing the 
 Spark community to provide alternative implementations.
 Block transferring is a critical function in Spark. All of the following 
 depend on it:
 * shuffle
 * torrent broadcast
 * block replication in BlockManager
 * remote block reads for tasks scheduled without locality



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF

2014-09-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3336:

Assignee: Davies Liu

 [Spark SQL] In pyspark, cannot group by field on UDF
 

 Key: SPARK-3336
 URL: https://issues.apache.org/jira/browse/SPARK-3336
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: kay feng
Assignee: Davies Liu

 Running pyspark on a spark cluster with standalone master.
 Cannot group by field on a UDF. But we can group by UDF in Scala.
 For example:
 q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo)  FROM bar GROUP BY 
 MYUDF(foo)')
 out = q.collect()
 I got this exception:
 Py4JJavaError: An error occurred while calling o183.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 
 in stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 
 56.0 (TID 14038, ip-10-33-9-144.us-west-2.compute.internal): 
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
 attribute, tree: pythonUDF#1278
 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 scala.collection.AbstractIterator.to(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 scala.collection.immutable.List.foreach(List.scala:318)
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 

[jira] [Updated] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF

2014-09-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3336:

Target Version/s: 1.2.0

 [Spark SQL] In pyspark, cannot group by field on UDF
 

 Key: SPARK-3336
 URL: https://issues.apache.org/jira/browse/SPARK-3336
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: kay feng
Assignee: Davies Liu

 Running pyspark on a spark cluster with standalone master.
 Cannot group by field on a UDF. But we can group by UDF in Scala.
 For example:
 q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo)  FROM bar GROUP BY 
 MYUDF(foo)')
 out = q.collect()
 I got this exception:
 Py4JJavaError: An error occurred while calling o183.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 
 in stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 
 56.0 (TID 14038, ip-10-33-9-144.us-west-2.compute.internal): 
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
 attribute, tree: pythonUDF#1278
 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 scala.collection.AbstractIterator.to(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 scala.collection.immutable.List.foreach(List.scala:318)
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 

[jira] [Updated] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF

2014-09-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3335:

Assignee: Davies Liu

 [Spark SQL] In pyspark, cannot use broadcast variables in UDF 
 --

 Key: SPARK-3335
 URL: https://issues.apache.org/jira/browse/SPARK-3335
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: kay feng
Assignee: Davies Liu

 Running pyspark on a spark cluster with standalone master, spark sql cannot 
 use broadcast variables in UDF. But we can use broadcast variable in spark in 
 scala.
 For example,
 bar={a:aa, b:bb, c:abc}
 foo=sc.broadcast(bar)
 sqlContext.registerFunction(MYUDF, lambda x: foo.value[x] if x else '').
 q= sqlContext.sql('SELECT MYUDF(c)  FROM foobar')
 out = q.collect()
 Got the following exception:
 Py4JJavaError: An error occurred while calling o169.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 
 in stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 
 51.0 (TID 13040, ip-10-33-9-144.us-west-2.compute.internal): 
 org.apache.spark.api.python.PythonException: Traceback (most recent call 
 last):
   File /root/spark/python/pyspark/worker.py, line 75, in main
 command = pickleSer._read_with_length(infile)
   File /root/spark/python/pyspark/serializers.py, line 150, in 
 _read_with_length
 return self.loads(obj)
   File /root/spark/python/pyspark/broadcast.py, line 41, in _from_id
 raise Exception(Broadcast variable '%s' not loaded! % bid)
 Exception: (Exception(Broadcast variable '21' not loaded!,), function 
 _from_id at 0x35042a8, (21L,))
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
   at scala.Option.foreach(Option.scala:236)
   at 
 

[jira] [Updated] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF

2014-09-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3335:

Target Version/s: 1.2.0

 [Spark SQL] In pyspark, cannot use broadcast variables in UDF 
 --

 Key: SPARK-3335
 URL: https://issues.apache.org/jira/browse/SPARK-3335
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: kay feng
Assignee: Davies Liu

 Running pyspark on a spark cluster with standalone master, spark sql cannot 
 use broadcast variables in UDF. But we can use broadcast variable in spark in 
 scala.
 For example,
 bar={a:aa, b:bb, c:abc}
 foo=sc.broadcast(bar)
 sqlContext.registerFunction(MYUDF, lambda x: foo.value[x] if x else '').
 q= sqlContext.sql('SELECT MYUDF(c)  FROM foobar')
 out = q.collect()
 Got the following exception:
 Py4JJavaError: An error occurred while calling o169.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 
 in stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 
 51.0 (TID 13040, ip-10-33-9-144.us-west-2.compute.internal): 
 org.apache.spark.api.python.PythonException: Traceback (most recent call 
 last):
   File /root/spark/python/pyspark/worker.py, line 75, in main
 command = pickleSer._read_with_length(infile)
   File /root/spark/python/pyspark/serializers.py, line 150, in 
 _read_with_length
 return self.loads(obj)
   File /root/spark/python/pyspark/broadcast.py, line 41, in _from_id
 raise Exception(Broadcast variable '%s' not loaded! % bid)
 Exception: (Exception(Broadcast variable '21' not loaded!,), function 
 _from_id at 0x35042a8, (21L,))
 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154)
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
   at scala.Option.foreach(Option.scala:236)
   

[jira] [Updated] (SPARK-3329) HiveQuerySuite SET tests depend on map orderings

2014-09-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3329:

Target Version/s: 1.2.0

 HiveQuerySuite SET tests depend on map orderings
 

 Key: SPARK-3329
 URL: https://issues.apache.org/jira/browse/SPARK-3329
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
Reporter: William Benton
Priority: Trivial

 The SET tests in HiveQuerySuite that return multiple values depend on the 
 ordering in which map pairs are returned from Hive and can fail spuriously if 
 this changes due to environment or library changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3109) Sql query with OR condition should be handled above PhysicalOperation layer

2014-09-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3109.
-
Resolution: Won't Fix

Hey Alex, I'm going to close this as I'm not sure its actually possible to do 
this optimization in the general case.  If I'm wrong or if you have a more 
specific instance where you think this optimization could work please feel free 
to reopen.

 Sql query with OR condition should be handled above PhysicalOperation layer
 ---

 Key: SPARK-3109
 URL: https://issues.apache.org/jira/browse/SPARK-3109
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2
Reporter: Alex Liu

 For query like 
 {code}
 select d, e  from test where a = 1 and b = 1 and c = 1 and d  20 or  d  0
 {code}
 Spark SQL pushes the whole query to PhysicalOperation. I haven't check how 
 Spark SQL internal query plan works, but I think OR condition in the above 
 query should be handled above physical operation. Physical operation should 
 have the following query
 {code} select d, e from test where a = 1 and b = 1 and c  = 1 and d  20 
 {code}
 OR
 {code}select d, e from test where d  0 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables

2014-09-02 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118886#comment-14118886
 ] 

Michael Armbrust commented on SPARK-3298:
-

Throwing an error here would be a breaking API change, so we'd want to do it 
with caution.  Regarding the concern about caching, once all references to the 
RDD are lost the context cleaner will actually remove the blocks from the cache.

 [SQL] registerAsTable / registerTempTable overwrites old tables
 ---

 Key: SPARK-3298
 URL: https://issues.apache.org/jira/browse/SPARK-3298
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Evan Chan
Priority: Minor
  Labels: newbie

 At least in Spark 1.0.2,  calling registerAsTable(a) when a had been 
 registered before does not cause an error.  However, there is no way to 
 access the old table, even though it may be cached and taking up space.
 How about at least throwing an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format

2014-09-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2883:

Target Version/s:   (was: 1.2.0)

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Reporter: Zhan Zhang

 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format

2014-09-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2883:

Target Version/s: 1.2.0

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Reporter: Zhan Zhang

 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-02 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118913#comment-14118913
 ] 

Matei Zaharia commented on SPARK-3098:
--

Yup, let's maybe document this for now. I'll create a JIRA for it.

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 The reproduce code:
 {code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
   (1 to 1).toSeq.map(p = i * 6000 + p)
 }.distinct().zipWithIndex() 
 c.join(c).filter(t = t._2._1 != t._2._2).take(3)
 {code}
  = 
 {code}
  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
 (36579712,(13,14)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118917#comment-14118917
 ] 

Reynold Xin commented on SPARK-2978:


I talked to [~pwendell] about this. How about this?

We can add two APIs to OrderedRDDFunctions (and whatever the Java equivalent 
is):
- sortWithinPartition
- repartitionAndSortWithinPartition

The first one is obvious, while the 2nd one is functionally equivalent to 
repartition followed by sortWithinPartition. The 2nd one is an optimization 
because it can push the sorting code into ShuffledRDD.



 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-02 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118918#comment-14118918
 ] 

Matei Zaharia commented on SPARK-3098:
--

Created SPARK-3356 to track this.

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 The reproduce code:
 {code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
   (1 to 1).toSeq.map(p = i * 6000 + p)
 }.distinct().zipWithIndex() 
 c.join(c).filter(t = t._2._1 != t._2._2).take(3)
 {code}
  = 
 {code}
  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
 (36579712,(13,14)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3356) Document when RDD elements' ordering within partitions is nondeterministic

2014-09-02 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3356:


 Summary: Document when RDD elements' ordering within partitions is 
nondeterministic
 Key: SPARK-3356
 URL: https://issues.apache.org/jira/browse/SPARK-3356
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Matei Zaharia


As reported in SPARK-3098 for example, for users using zipWithIndex, 
zipWithUniqueId, etc, (and maybe even things like mapPartitions) it's confusing 
that the order of elements in each partition after a shuffle operation is 
nondeterministic (unless the operation was sortByKey). We should explain this 
in the docs for the zip and partition-wise operations.

Another subtle issue is that the order of values for each key in groupBy / join 
/ etc can be nondeterministic -- we need to explain that too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2014-09-02 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118921#comment-14118921
 ] 

Michael Armbrust commented on SPARK-2883:
-

Can you elaborate on what you mean here?  If you create a table that uses the 
ORC SerDe you can do table(tableName) to get an RDD for it.  Would that 
satisfy your use case?

I think it is unlikely that we will have the resources to write a native ORC 
decoder that does not use the hive serdes anytime in the near future.

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Reporter: Zhan Zhang

 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-02 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3098.
--
Resolution: Won't Fix

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 The reproduce code:
 {code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
   (1 to 1).toSeq.map(p = i * 6000 + p)
 }.distinct().zipWithIndex() 
 c.join(c).filter(t = t._2._1 != t._2._2).take(3)
 {code}
  = 
 {code}
  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
 (36579712,(13,14)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext

2014-09-02 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118960#comment-14118960
 ] 

Marcelo Vanzin commented on SPARK-3215:
---

For those who'd prefer to see some code, here's a proof-of-concept:
https://github.com/vanzin/spark/tree/SPARK-3215/remote

Please ignore the fact that it's a module inside Spark; I picked a different 
package name so that I didn't end up using any internal Spark APIs. I just 
wanted to avoid having to write build code.

In particular, focus on this package (and *not* what's inside impl):
https://github.com/vanzin/spark/tree/SPARK-3215/remote/src/main/scala/org/apache/spark_remote

That's all a user would see; what happens inside impl does not matter to the 
user. If you really want to look at the implementation code, it's currently 
using akka and has very little error handling.

 Add remote interface for SparkContext
 -

 Key: SPARK-3215
 URL: https://issues.apache.org/jira/browse/SPARK-3215
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Marcelo Vanzin
  Labels: hive
 Attachments: RemoteSparkContext.pdf


 A quick description of the issue: as part of running Hive jobs on top of 
 Spark, it's desirable to have a SparkContext that is running in the 
 background and listening for job requests for a particular user session.
 Running multiple contexts in the same JVM is not a very good solution. Not 
 only SparkContext currently has issues sharing the same JVM among multiple 
 instances, but that turns the JVM running the contexts into a huge bottleneck 
 in the system.
 So I'm proposing a solution where we have a SparkContext that is running in a 
 separate process, and listening for requests from the client application via 
 some RPC interface (most probably Akka).
 I'll attach a document shortly with the current proposal. Let's use this bug 
 to discuss the proposal and any other suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2706) Enable Spark to support Hive 0.13

2014-09-02 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118979#comment-14118979
 ] 

Zhan Zhang commented on SPARK-2706:
---

send out pull request https://github.com/apache/spark/pull/2241

 Enable Spark to support Hive 0.13
 -

 Key: SPARK-2706
 URL: https://issues.apache.org/jira/browse/SPARK-2706
 Project: Spark
  Issue Type: Dependency upgrade
  Components: SQL
Affects Versions: 1.0.1
Reporter: Chunjun Xiao
Assignee: Zhan Zhang
 Attachments: hive.diff, spark-2706-v1.txt, spark-2706-v2.txt, 
 spark-hive.err, v1.0.2.diff


 It seems Spark cannot work with Hive 0.13 well.
 When I compiled Spark with Hive 0.13.1, I got some error messages, as 
 attached below.
 So, when can Spark be enabled to support Hive 0.13?
 Compiling Error:
 {quote}
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180:
  type mismatch;
  found   : String
  required: Array[String]
 [ERROR]   val proc: CommandProcessor = 
 CommandProcessorFactory.get(tokens(0), hiveconf)
 [ERROR]  ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264:
  overloaded method constructor TableDesc with alternatives:
   (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: 
 Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc 
 and
   ()org.apache.hadoop.hive.ql.plan.TableDesc
  cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], 
 Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in 
 value tableDesc)(in value tableDesc)], java.util.Properties)
 [ERROR]   val tableDesc = new TableDesc(
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140:
  value getPartitionPath is not a member of 
 org.apache.hadoop.hive.ql.metadata.Partition
 [ERROR]   val partPath = partition.getPartitionPath
 [ERROR]^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132:
  value appendReadColumnNames is not a member of object 
 org.apache.hadoop.hive.serde2.ColumnProjectionUtils
 [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, 
 attributes.map(_.name))
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79:
  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
 [ERROR]   new HiveDecimal(bd.underlying())
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132:
  type mismatch;
  found   : org.apache.hadoop.fs.Path
  required: String
 [ERROR]   
 SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf))
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179:
  value getExternalTmpFileURI is not a member of 
 org.apache.hadoop.hive.ql.Context
 [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation)
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209:
  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
 [ERROR]   case bd: BigDecimal = new HiveDecimal(bd.underlying())
 [ERROR]  ^
 [ERROR] 8 errors found
 [DEBUG] Compilation failed (CompilerInterface)
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO] 
 [INFO] Spark Project Parent POM .. SUCCESS [2.579s]
 [INFO] Spark Project Core  SUCCESS [2:39.805s]
 [INFO] Spark Project Bagel ... SUCCESS [21.148s]
 [INFO] Spark Project GraphX .. SUCCESS [59.950s]
 [INFO] Spark Project ML Library .. SUCCESS [1:08.771s]
 [INFO] Spark Project Streaming ... SUCCESS [1:17.759s]
 [INFO] Spark Project Tools ... SUCCESS [15.405s]
 [INFO] Spark Project Catalyst  SUCCESS [1:17.405s]
 [INFO] Spark Project SQL . SUCCESS [1:11.094s]
 [INFO] Spark Project Hive  FAILURE [11.121s]
 [INFO] Spark Project REPL  SKIPPED
 [INFO] Spark Project YARN Parent POM . SKIPPED
 [INFO] Spark Project YARN Stable API . 

[jira] [Created] (SPARK-3357) Internal log messages should be set at DEBUG level instead of INFO

2014-09-02 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3357:


 Summary: Internal log messages should be set at DEBUG level 
instead of INFO
 Key: SPARK-3357
 URL: https://issues.apache.org/jira/browse/SPARK-3357
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor


spark-shell shows INFO by default, so we should carefully choose what to show 
at INFO level. For example, if I run

{code}
sc.parallelize(0 until 100).count()
{code}

and wait for one minute or so. I will see messages that mixed with the current 
input box, which is annoying:

{code}
scala 14/09/02 17:09:00 INFO BlockManager: Removing broadcast 0
14/09/02 17:09:00 INFO BlockManager: Removing block broadcast_0
14/09/02 17:09:00 INFO MemoryStore: Block broadcast_0 of size 1088 dropped from 
memory (free 278019440)
14/09/02 17:09:00 INFO ContextCleaner: Cleaned broadcast 0
{code}

Does a user need to know when a broadcast variable is removed? Maybe not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3357) Internal log messages should be set at DEBUG level instead of INFO

2014-09-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3357:
-
Description: 
spark-shell shows INFO by default, so we should carefully choose what to show 
at INFO level. For example, if I run

{code}
sc.parallelize(0 until 100).count()
{code}

and wait for one minute or so. I will see messages that mix with the current 
input box, which is annoying:

{code}
scala 14/09/02 17:09:00 INFO BlockManager: Removing broadcast 0
14/09/02 17:09:00 INFO BlockManager: Removing block broadcast_0
14/09/02 17:09:00 INFO MemoryStore: Block broadcast_0 of size 1088 dropped from 
memory (free 278019440)
14/09/02 17:09:00 INFO ContextCleaner: Cleaned broadcast 0
{code}

Does a user need to know when a broadcast variable is removed? Maybe not.

  was:
spark-shell shows INFO by default, so we should carefully choose what to show 
at INFO level. For example, if I run

{code}
sc.parallelize(0 until 100).count()
{code}

and wait for one minute or so. I will see messages that mixed with the current 
input box, which is annoying:

{code}
scala 14/09/02 17:09:00 INFO BlockManager: Removing broadcast 0
14/09/02 17:09:00 INFO BlockManager: Removing block broadcast_0
14/09/02 17:09:00 INFO MemoryStore: Block broadcast_0 of size 1088 dropped from 
memory (free 278019440)
14/09/02 17:09:00 INFO ContextCleaner: Cleaned broadcast 0
{code}

Does a user need to know when a broadcast variable is removed? Maybe not.


 Internal log messages should be set at DEBUG level instead of INFO
 --

 Key: SPARK-3357
 URL: https://issues.apache.org/jira/browse/SPARK-3357
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor

 spark-shell shows INFO by default, so we should carefully choose what to show 
 at INFO level. For example, if I run
 {code}
 sc.parallelize(0 until 100).count()
 {code}
 and wait for one minute or so. I will see messages that mix with the current 
 input box, which is annoying:
 {code}
 scala 14/09/02 17:09:00 INFO BlockManager: Removing broadcast 0
 14/09/02 17:09:00 INFO BlockManager: Removing block broadcast_0
 14/09/02 17:09:00 INFO MemoryStore: Block broadcast_0 of size 1088 dropped 
 from memory (free 278019440)
 14/09/02 17:09:00 INFO ContextCleaner: Cleaned broadcast 0
 {code}
 Does a user need to know when a broadcast variable is removed? Maybe not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3350) Strange anomaly trying to write a SchemaRDD into an Avro file

2014-09-02 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119057#comment-14119057
 ] 

Patrick Wendell commented on SPARK-3350:


Can you provide the stacktrace?

 Strange anomaly trying to write a SchemaRDD into an Avro file
 -

 Key: SPARK-3350
 URL: https://issues.apache.org/jira/browse/SPARK-3350
 Project: Spark
  Issue Type: Bug
  Components: SQL
 Environment: jdk1.7, macosx
Reporter: David Greco
 Attachments: AvroWriteTestCase.scala


 I found a way to automatically save a SchemaRDD in Avro format, similarly to 
 what Spark does with parquet file.
 I attached a test case to this issue. The code fails with a NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3350) Strange anomaly trying to write a SchemaRDD into an Avro file

2014-09-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3350:
---
Component/s: (was: Input/Output)
 SQL

 Strange anomaly trying to write a SchemaRDD into an Avro file
 -

 Key: SPARK-3350
 URL: https://issues.apache.org/jira/browse/SPARK-3350
 Project: Spark
  Issue Type: Bug
  Components: SQL
 Environment: jdk1.7, macosx
Reporter: David Greco
 Attachments: AvroWriteTestCase.scala


 I found a way to automatically save a SchemaRDD in Avro format, similarly to 
 what Spark does with parquet file.
 I attached a test case to this issue. The code fails with a NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3346) Error happened in using memcached

2014-09-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3346.

Resolution: Invalid

Would you mind reporting this to the user list? We use JIRA only for issues 
that have been diagnosed in more detail. Also, please include the source code 
on the user list, that will make it easier to debug.

 Error happened in using memcached
 -

 Key: SPARK-3346
 URL: https://issues.apache.org/jira/browse/SPARK-3346
 Project: Spark
  Issue Type: Bug
 Environment: CDH5.1.0
 Spark1.0.0
Reporter: Gavin Zhang

 I finished a distributed project in hadoop streaming and it worked fine with 
 using memcached storage during mapping. Actually, it's a python project.
 Now I want to move it to Spark. But when I called the memcached library, two 
 errors was found during computing. (Both)
 1. File memcache.py, line 414, in get
 rkey, rlen = self._expectvalue(server)
 ValueError: too many values to unpack
 2.   File memcache.py, line 714, in check_key
 return key.translate(ill_map)
 TypeError: character mapping must return integer, None or unicode
 After adding exception handing, there was no successful cache got at all. 
 However, it works in hadoop streaming without any error. Why?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3322) ConnectionManager logs an error when the application ends

2014-09-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3322:
---
Summary: ConnectionManager logs an error when the application ends  (was: 
Log a ConnectionManager error when the application ends)

 ConnectionManager logs an error when the application ends
 -

 Key: SPARK-3322
 URL: https://issues.apache.org/jira/browse/SPARK-3322
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: wangfei

 Athough it does not influence the result, it always would log an error from 
 ConnectionManager.
 Sometimes only log ConnectionManagerId(vm2,40992) not found and sometimes 
 it also will log CancelledKeyException
 The log Info as fellow:
 14/08/29 16:54:53 ERROR ConnectionManager: Corresponding SendingConnection to 
 ConnectionManagerId(vm2,40992) not found
 14/08/29 16:54:53 INFO ConnectionManager: key already cancelled ? 
 sun.nio.ch.SelectionKeyImpl@457245f9
 java.nio.channels.CancelledKeyException
 at 
 org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
 at 
 org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3358) PySpark worker fork()ing performance regression in m3.* / PVM instances

2014-09-02 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-3358:
-

 Summary: PySpark worker fork()ing performance regression in m3.* / 
PVM instances
 Key: SPARK-3358
 URL: https://issues.apache.org/jira/browse/SPARK-3358
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: m3.* instances on EC2
Reporter: Josh Rosen


SPARK-2764 (and some followup commits) simplified PySpark's worker process 
structure by removing an intermediate pool of processes forked by daemon.py.  
Previously, daemon.py forked a fixed-size pool of processes that shared a 
socket and handled worker launch requests from Java.  After my patch, this 
intermediate pool was removed and launch requests are handled directly in 
daemon.py.

Unfortunately, this seems to have increased PySpark task launch latency when 
running on m3* class instances in EC2.  Most of this difference can be 
attributed to m3 instances' more expensive fork() system calls.  I tried the 
following microbenchmark on m3.xlarge and r3.xlarge instances:

{code}
import os

for x in range(1000):
  if os.fork() == 0:
exit()
{code}

On the r3.xlarge instance:

{code}
real0m0.761s
user0m0.008s
sys 0m0.144s
{code}

And on m3.xlarge:

{code}
real0m1.699s
user0m0.012s
sys 0m1.008s
{code}

I think this is due to HVM vs PVM EC2 instances using different virtualization 
technologies with different fork costs.

It may be the case that this performance difference only appears in certain 
microbenchmarks and is masked by other performance improvements in PySpark, 
such as improvements to large group-bys.  I'm in the process of re-running 
spark-perf benchmarks on m3 instances in order to confirm whether this impacts 
more realistic jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3328) ./make-distribution.sh --with-tachyon build is broken

2014-09-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3328.

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 2228
[https://github.com/apache/spark/pull/2228]

 ./make-distribution.sh --with-tachyon build is broken
 -

 Key: SPARK-3328
 URL: https://issues.apache.org/jira/browse/SPARK-3328
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Elijah Epifanov
 Fix For: 1.1.0


 cp: tachyon-0.5.0/target/tachyon-0.5.0-jar-with-dependencies.jar: No such 
 file or directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3328) ./make-distribution.sh --with-tachyon build is broken

2014-09-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3328:
---
Assignee: prudhvi krishna

 ./make-distribution.sh --with-tachyon build is broken
 -

 Key: SPARK-3328
 URL: https://issues.apache.org/jira/browse/SPARK-3328
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Elijah Epifanov
Assignee: prudhvi krishna
 Fix For: 1.1.0


 cp: tachyon-0.5.0/target/tachyon-0.5.0-jar-with-dependencies.jar: No such 
 file or directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2978:
---
Target Version/s: 1.2.0

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2706) Enable Spark to support Hive 0.13

2014-09-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119079#comment-14119079
 ] 

Apache Spark commented on SPARK-2706:
-

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/2241

 Enable Spark to support Hive 0.13
 -

 Key: SPARK-2706
 URL: https://issues.apache.org/jira/browse/SPARK-2706
 Project: Spark
  Issue Type: Dependency upgrade
  Components: SQL
Affects Versions: 1.0.1
Reporter: Chunjun Xiao
Assignee: Zhan Zhang
 Attachments: hive.diff, spark-2706-v1.txt, spark-2706-v2.txt, 
 spark-hive.err, v1.0.2.diff


 It seems Spark cannot work with Hive 0.13 well.
 When I compiled Spark with Hive 0.13.1, I got some error messages, as 
 attached below.
 So, when can Spark be enabled to support Hive 0.13?
 Compiling Error:
 {quote}
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180:
  type mismatch;
  found   : String
  required: Array[String]
 [ERROR]   val proc: CommandProcessor = 
 CommandProcessorFactory.get(tokens(0), hiveconf)
 [ERROR]  ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264:
  overloaded method constructor TableDesc with alternatives:
   (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: 
 Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc 
 and
   ()org.apache.hadoop.hive.ql.plan.TableDesc
  cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], 
 Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in 
 value tableDesc)(in value tableDesc)], java.util.Properties)
 [ERROR]   val tableDesc = new TableDesc(
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140:
  value getPartitionPath is not a member of 
 org.apache.hadoop.hive.ql.metadata.Partition
 [ERROR]   val partPath = partition.getPartitionPath
 [ERROR]^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132:
  value appendReadColumnNames is not a member of object 
 org.apache.hadoop.hive.serde2.ColumnProjectionUtils
 [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, 
 attributes.map(_.name))
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79:
  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
 [ERROR]   new HiveDecimal(bd.underlying())
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132:
  type mismatch;
  found   : org.apache.hadoop.fs.Path
  required: String
 [ERROR]   
 SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf))
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179:
  value getExternalTmpFileURI is not a member of 
 org.apache.hadoop.hive.ql.Context
 [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation)
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209:
  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
 [ERROR]   case bd: BigDecimal = new HiveDecimal(bd.underlying())
 [ERROR]  ^
 [ERROR] 8 errors found
 [DEBUG] Compilation failed (CompilerInterface)
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO] 
 [INFO] Spark Project Parent POM .. SUCCESS [2.579s]
 [INFO] Spark Project Core  SUCCESS [2:39.805s]
 [INFO] Spark Project Bagel ... SUCCESS [21.148s]
 [INFO] Spark Project GraphX .. SUCCESS [59.950s]
 [INFO] Spark Project ML Library .. SUCCESS [1:08.771s]
 [INFO] Spark Project Streaming ... SUCCESS [1:17.759s]
 [INFO] Spark Project Tools ... SUCCESS [15.405s]
 [INFO] Spark Project Catalyst  SUCCESS [1:17.405s]
 [INFO] Spark Project SQL . SUCCESS [1:11.094s]
 [INFO] Spark Project Hive  FAILURE [11.121s]
 [INFO] Spark Project REPL  SKIPPED
 [INFO] Spark Project YARN Parent POM . SKIPPED
 [INFO] Spark Project 

[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-02 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119080#comment-14119080
 ] 

Sandy Ryza commented on SPARK-2978:
---

What's the thinking behind adding sortWithinPartition?  It shouldn't be 
difficult to add, but I can't think of a situation where it would be useful 
without a repartition before.

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119084#comment-14119084
 ] 

Reynold Xin commented on SPARK-2978:


It was just asked multiple times by various users.  I think the use case is to 
provide a robust external sort implementation without exposing the 
ExternalSorter API.  It doesn't need to be part of this change.


 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3358) PySpark worker fork()ing performance regression in m3.* / PVM instances

2014-09-02 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119089#comment-14119089
 ] 

Josh Rosen commented on SPARK-3358:
---

Credit where it's due: Davies pointed out the potential for this problem in the 
original PR: https://github.com/apache/spark/pull/1680#issuecomment-50721351

The Redis team did their own benchmarking on this 
(http://redislabs.com/blog/testing-fork-time-on-awsxen-infrastructure (or 
https://web.archive.org/web/20140529181436/http://redislabs.com/blog/testing-fork-time-on-awsxen-infrastructure,
 since their site may be down / slow right now)).

Based on those results, and updated numbers at 
http://redislabs.com/blog/benchmarking-the-new-aws-m3-instances-with-redis, it 
looks like HVM AMIs don't have this problem.  I'm going to try running a 
similar microbenchmark on m3.xlarge with the spark-ec2 HVM AMI to see if that 
improves performance.  If so, we should consider changing from PVM to HVM for 
those instance types.

 PySpark worker fork()ing performance regression in m3.* / PVM instances
 ---

 Key: SPARK-3358
 URL: https://issues.apache.org/jira/browse/SPARK-3358
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: m3.* instances on EC2
Reporter: Josh Rosen

 SPARK-2764 (and some followup commits) simplified PySpark's worker process 
 structure by removing an intermediate pool of processes forked by daemon.py.  
 Previously, daemon.py forked a fixed-size pool of processes that shared a 
 socket and handled worker launch requests from Java.  After my patch, this 
 intermediate pool was removed and launch requests are handled directly in 
 daemon.py.
 Unfortunately, this seems to have increased PySpark task launch latency when 
 running on m3* class instances in EC2.  Most of this difference can be 
 attributed to m3 instances' more expensive fork() system calls.  I tried the 
 following microbenchmark on m3.xlarge and r3.xlarge instances:
 {code}
 import os
 for x in range(1000):
   if os.fork() == 0:
 exit()
 {code}
 On the r3.xlarge instance:
 {code}
 real  0m0.761s
 user  0m0.008s
 sys   0m0.144s
 {code}
 And on m3.xlarge:
 {code}
 real0m1.699s
 user0m0.012s
 sys 0m1.008s
 {code}
 I think this is due to HVM vs PVM EC2 instances using different 
 virtualization technologies with different fork costs.
 It may be the case that this performance difference only appears in certain 
 microbenchmarks and is masked by other performance improvements in PySpark, 
 such as improvements to large group-bys.  I'm in the process of re-running 
 spark-perf benchmarks on m3 instances in order to confirm whether this 
 impacts more realistic jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-02 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119091#comment-14119091
 ] 

Sandy Ryza commented on SPARK-2978:
---

Ah ok, sounds good.

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >