[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117918#comment-14117918 ] Patrick Wendell commented on SPARK-1701: I think that's a straw man. A closer review once there is a patch might reveal some corner cases. On Mon, Sep 1, 2014 at 10:59 PM, Nicholas Chammas (JIRA) Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Priority: Minor Labels: starter Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2633) enhance spark listener API to gather more spark job information
[ https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated SPARK-2633: - Attachment: Spark listener enhancement for Hive on Spark job monitor and statistic.docx enhance spark listener API to gather more spark job information --- Key: SPARK-2633 URL: https://issues.apache.org/jira/browse/SPARK-2633 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Priority: Critical Labels: hive Attachments: Spark listener enhancement for Hive on Spark job monitor and statistic.docx Based on Hive on Spark job status monitoring and statistic collection requirement, try to enhance spark listener API to gather more spark job information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2633) enhance spark listener API to gather more spark job information
[ https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated SPARK-2633: - Attachment: (was: Spark listener enhancement for Hive on Spark job monitor and statistic.docx) enhance spark listener API to gather more spark job information --- Key: SPARK-2633 URL: https://issues.apache.org/jira/browse/SPARK-2633 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Priority: Critical Labels: hive Attachments: Spark listener enhancement for Hive on Spark job monitor and statistic.docx Based on Hive on Spark job status monitoring and statistic collection requirement, try to enhance spark listener API to gather more spark job information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2636) Expose job ID in JobWaiter API
[ https://issues.apache.org/jira/browse/SPARK-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2636. Resolution: Fixed Fix Version/s: 1.2.0 Expose job ID in JobWaiter API -- Key: SPARK-2636 URL: https://issues.apache.org/jira/browse/SPARK-2636 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: hive Fix For: 1.2.0 In Hive on Spark, we want to track spark job status through Spark API, the basic idea is as following: # create an hive-specified spark listener and register it to spark listener bus. # hive-specified spark listener generate job status by spark listener events. # hive driver track job status through hive-specified spark listener. the current problem is that hive driver need job identifier to track specified job status through spark listener, but there is no spark API to get job identifier(like job id) while submit spark job. I think other project whoever try to track job status with spark API would suffer from this as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117999#comment-14117999 ] Reynold Xin commented on SPARK-2978: [~sandyryza] instead of adding a new API, what if Hive project just creates a utility function that does partition, sort, and then walking down the sorted list to provide grouping similar to MR? The reason I'm asking about this is that eventually we would want to make groupByKey itself support sort and spill. But it is fairly tricky to design as you've already pointed out, so it could take a while to finalize that API. Provide an MR-style shuffle transformation -- Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe? * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1499) Workers continuously produce failing executors
[ https://issues.apache.org/jira/browse/SPARK-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118011#comment-14118011 ] Alex Burghelea commented on SPARK-1499: --- I also encounter the same problem. Is there any workaround/solution for this issue ? I am runny the spark cluster using a vagrant setup with 4 machines. All the nodes behave this way. [~adrian-wang] I can confirm that the worker nodes log the messages posted by you. Workers continuously produce failing executors -- Key: SPARK-1499 URL: https://issues.apache.org/jira/browse/SPARK-1499 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 0.9.1, 1.0.0 Reporter: Aaron Davidson If a node is in a bad state, such that newly started executors fail on startup or first use, the Standalone Cluster Worker will happily keep spawning new ones. A better behavior would be for a Worker to mark itself as dead if it has had a history of continuously producing erroneous executors, or else to somehow prevent a driver from re-registering executors from the same machine repeatedly. Reported on mailing list: http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccal8t0bqjfgtf-vbzjq6yj7ckbl_9p9s0trvew2mvg6zbngx...@mail.gmail.com%3E Relevant logs: {noformat} 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: app-20140411190649-0008/4 is now FAILED (Command exited with code 53) 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140411190649-0008/4 removed: Command exited with code 53 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 4 disconnected, so removing it 14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 (already removed): Failed to create local directory (bad spark.local.dir?) 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor added: app-20140411190649-0008/27 on worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140411190649-0008/27 on hostPort ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: app-20140411190649-0008/27 is now RUNNING 14/04/11 19:06:52 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager ip-172-31-24-76.us-west-1.compute.internal:50256 with 32.7 GB RAM 14/04/11 19:06:52 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=wikistats_pd 14/04/11 19:06:52 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=wikistats_pd 14/04/11 19:06:53 DEBUG hive.log: DDL: struct wikistats_pd { string projectcode, string pagename, i32 pageviews, i32 bytes} 14/04/11 19:06:53 DEBUG lazy.LazySimpleSerDe: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe initialized with: columnNames=[projectcode, pagename, pageviews, bytes] columnTypes=[string, string, int, int] separator=[[B@29a81175] nullstring=\N lastColumnTakesRest=false shark 14/04/11 19:06:55 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkexecu...@ip-172-31-19-11.us-west-1.compute.internal:45248/user/Executor#-1002203295] with ID 27 show 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 27 disconnected, so removing it 14/04/11 19:06:56 ERROR scheduler.TaskSchedulerImpl: Lost an executor 27 (already removed): remote Akka client disassociated 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: app-20140411190649-0008/27 is now FAILED (Command exited with code 53) 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140411190649-0008/27 removed: Command exited with code 53 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor added: app-20140411190649-0008/28 on worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140411190649-0008/28 on hostPort ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: app-20140411190649-0008/28 is now RUNNING tables; {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2312) Spark Actors do not handle unknown messages in their receive methods
[ https://issues.apache.org/jira/browse/SPARK-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isaias Barroso resolved SPARK-2312. --- Resolution: Fixed Spark Actors do not handle unknown messages in their receive methods Key: SPARK-2312 URL: https://issues.apache.org/jira/browse/SPARK-2312 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kam Kasravi Assignee: Isaias Barroso Priority: Minor Labels: starter Fix For: 1.1.0 Original Estimate: 24h Remaining Estimate: 24h Per akka documentation - an actor should provide a pattern match for all messages including _ otherwise akka.actor.UnhandledMessage will be propagated. Noted actors: MapOutputTrackerMasterActor, ClientActor, Master, Worker... Should minimally do a logWarning(sReceived unexpected actor system event: $_) so message info is logged in correct actor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3344) Reformat code: add blank lines
WangTaoTheTonic created SPARK-3344: -- Summary: Reformat code: add blank lines Key: SPARK-3344 URL: https://issues.apache.org/jira/browse/SPARK-3344 Project: Spark Issue Type: Test Components: Deploy Reporter: WangTaoTheTonic Priority: Trivial There should have blank lines between test cases, so do some code reformat. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3344) Reformat code: add blank lines
[ https://issues.apache.org/jira/browse/SPARK-3344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118038#comment-14118038 ] Apache Spark commented on SPARK-3344: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/2234 Reformat code: add blank lines -- Key: SPARK-3344 URL: https://issues.apache.org/jira/browse/SPARK-3344 Project: Spark Issue Type: Test Components: Deploy Reporter: WangTaoTheTonic Priority: Trivial There should have blank lines between test cases, so do some code reformat. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()
[ https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118069#comment-14118069 ] Xu Zhongxing commented on SPARK-3005: - By tasks themselves already died and exited, I mean that even if we do nothing in killTasks(), there won't be any zombie tasks left on the slaves. This is what I get from testing. If I'm wrong, please correct me. But the logic here is incomplete or inconsistent, and needs to be fixed. Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask() --- Key: SPARK-3005 URL: https://issues.apache.org/jira/browse/SPARK-3005 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector Reporter: Xu Zhongxing Attachments: SPARK-3005_1.diff I am using Spark, Mesos, spark-cassandra-connector to do some work on a cassandra cluster. During the job running, I killed the Cassandra daemon to simulate some failure cases. This results in task failures. If I run the job in Mesos coarse-grained mode, the spark driver program throws an exception and shutdown cleanly. But when I run the job in Mesos fine-grained mode, the spark driver program hangs. The spark log is: {code} INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 Logging.scala (line 58) Cancelling stage 1 INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 Logging.scala (line 79) Could not cancel tasks for stage 1 java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at
[jira] [Comment Edited] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()
[ https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118069#comment-14118069 ] Xu Zhongxing edited comment on SPARK-3005 at 9/2/14 9:59 AM: - By tasks themselves already died and exited, I mean that even if we do nothing in killTasks(), there won't be any zombie tasks left on the slaves. This is what I get from testing the Mesos fine-grained mode. If I'm wrong, please correct me. But the logic here is incomplete or inconsistent, and needs to be fixed. was (Author: xuzhongxing): By tasks themselves already died and exited, I mean that even if we do nothing in killTasks(), there won't be any zombie tasks left on the slaves. This is what I get from testing. If I'm wrong, please correct me. But the logic here is incomplete or inconsistent, and needs to be fixed. Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask() --- Key: SPARK-3005 URL: https://issues.apache.org/jira/browse/SPARK-3005 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector Reporter: Xu Zhongxing Attachments: SPARK-3005_1.diff I am using Spark, Mesos, spark-cassandra-connector to do some work on a cassandra cluster. During the job running, I killed the Cassandra daemon to simulate some failure cases. This results in task failures. If I run the job in Mesos coarse-grained mode, the spark driver program throws an exception and shutdown cleanly. But when I run the job in Mesos fine-grained mode, the spark driver program hangs. The spark log is: {code} INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 Logging.scala (line 58) Cancelling stage 1 INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 Logging.scala (line 79) Could not cancel tasks for stage 1 java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
[jira] [Created] (SPARK-3345) Do correct parameters for ShuffleFileGroup
Liang-Chi Hsieh created SPARK-3345: -- Summary: Do correct parameters for ShuffleFileGroup Key: SPARK-3345 URL: https://issues.apache.org/jira/browse/SPARK-3345 Project: Spark Issue Type: Bug Reporter: Liang-Chi Hsieh Priority: Minor In the method newFileGroup of class FileShuffleBlockManager, the parameters for creating new ShuffleFileGroup object is in wrong order. Wrong: new ShuffleFileGroup(fileId, shuffleId, files) Corrent: new ShuffleFileGroup(shuffleId, fileId, files) Because in current codes, the parameters shuffleId and fileId are not used. So it doesn't cause problem now. However it should be corrected for readability and avoid future problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3168) The ServletContextHandler of webui lacks a SessionManager
[ https://issues.apache.org/jira/browse/SPARK-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118173#comment-14118173 ] Thomas Graves commented on SPARK-3168: -- Spark is using the ServletContextHandler. Which has an option to create sessionManager but is not there by default. I never added support for it as many basic filters do not need it. I do not know anything about CAS and haven't done much with sessions themselves. It looks like we would have to investigate more what is really needed by CAS. Is any session manager ok, specific ones, what else on the backend do we need to do with clearing sessions or saving them, etc. I would consider this a new feature. The ServletContextHandler of webui lacks a SessionManager - Key: SPARK-3168 URL: https://issues.apache.org/jira/browse/SPARK-3168 Project: Spark Issue Type: Bug Components: Spark Core Environment: CAS Reporter: meiyoula When i use CAS to realize single sign of webui, it occurs a exception: {code} WARN [qtp1076146544-24] / org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:561) java.lang.IllegalStateException: No SessionManager at org.eclipse.jetty.server.Request.getSession(Request.java:1269) at org.eclipse.jetty.server.Request.getSession(Request.java:1248) at org.jasig.cas.client.validation.AbstractTicketValidationFilter.doFilter(AbstractTicketValidationFilter.java:178) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.jasig.cas.client.authentication.AuthenticationFilter.doFilter(AuthenticationFilter.java:116) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.jasig.cas.client.session.SingleSignOutFilter.doFilter(SingleSignOutFilter.java:76) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3347) Spark alpha doesn't compile due to SPARK-2889
Thomas Graves created SPARK-3347: Summary: Spark alpha doesn't compile due to SPARK-2889 Key: SPARK-3347 URL: https://issues.apache.org/jira/browse/SPARK-3347 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves Assignee: Marcelo Vanzin Priority: Blocker Spark on yarn alpha doesn't compile after SPARK-2889 was merged in. [ERROR] /home/tgraves/tgravescs-spark/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:43: not found: value SparkHadoopUtil -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118282#comment-14118282 ] Nicholas Chammas commented on SPARK-1701: - Oh absolutely; sorry, didn't mean to mischaracterize your recommendations. I think this simplistic summary should help direct an initial PR which can then be reviewed in more detail. Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Priority: Minor Labels: starter Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118286#comment-14118286 ] Sandy Ryza commented on SPARK-2978: --- IIUC, that would require using ShuffledRDD directly. Would we be comfortable taking off the DeveloperAPI tag? Another option that would allow us to avoid making the groupBy decision would be exposing a repartitionAndSortWithinPartition transform. Then Hive would handle the grouping on the sorted stream. Provide an MR-style shuffle transformation -- Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe? * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3347) Spark on yarn alpha doesn't compile due to SPARK-2889
[ https://issues.apache.org/jira/browse/SPARK-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118374#comment-14118374 ] Apache Spark commented on SPARK-3347: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/2236 Spark on yarn alpha doesn't compile due to SPARK-2889 - Key: SPARK-3347 URL: https://issues.apache.org/jira/browse/SPARK-3347 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves Assignee: Marcelo Vanzin Priority: Blocker Spark on yarn alpha doesn't compile after SPARK-2889 was merged in. [ERROR] /home/tgraves/tgravescs-spark/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:43: not found: value SparkHadoopUtil -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118397#comment-14118397 ] Guoqiang Li commented on SPARK-3098: The following RDD operation seem to have this problem: {{zip}},{{zipWithIndex}},{{zipWithUniqueId}},{{zipPartitions}},{{groupBy}},{{groupByKey}}. In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3331) PEP8 tests fail because they check unzipped py4j code
[ https://issues.apache.org/jira/browse/SPARK-3331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3331. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request [https://github.com/apache/spark/pull/] PEP8 tests fail because they check unzipped py4j code - Key: SPARK-3331 URL: https://issues.apache.org/jira/browse/SPARK-3331 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Priority: Minor Fix For: 1.2.0 PEP8 tests run on files under ./python, but unzipped py4j code is found at ./python/build/py4j. Py4J code fails style checks and can fail ./dev/run-tests if this code is present locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3061) Maven build fails in Windows OS
[ https://issues.apache.org/jira/browse/SPARK-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118418#comment-14118418 ] Andrew Or commented on SPARK-3061: -- Here's a friendly reminder for myself to back port this into branch-1.1 after the release. Others please disregard this comment. :) Maven build fails in Windows OS --- Key: SPARK-3061 URL: https://issues.apache.org/jira/browse/SPARK-3061 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2, 1.1.0 Environment: Windows Reporter: Masayoshi TSUZUKI Assignee: Josh Rosen Priority: Minor Maven build fails in Windows OS with this error message. {noformat} [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec (default) on project spark-core_2.10: Command execution failed. Cannot run program unzip (in directory C:\path\to\gitofspark\python): CreateProcess error=2, w肳ꂽt@ - [Help 1] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3061) Maven build fails in Windows OS
[ https://issues.apache.org/jira/browse/SPARK-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3061: - Fix Version/s: 1.2.0 Maven build fails in Windows OS --- Key: SPARK-3061 URL: https://issues.apache.org/jira/browse/SPARK-3061 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2, 1.1.0 Environment: Windows Reporter: Masayoshi TSUZUKI Assignee: Josh Rosen Priority: Minor Fix For: 1.2.0 Maven build fails in Windows OS with this error message. {noformat} [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec (default) on project spark-core_2.10: Command execution failed. Cannot run program unzip (in directory C:\path\to\gitofspark\python): CreateProcess error=2, w肳ꂽt@ - [Help 1] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1919) In Windows, Spark shell cannot load classes in spark.jars (--jars)
[ https://issues.apache.org/jira/browse/SPARK-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118421#comment-14118421 ] Andrew Or commented on SPARK-1919: -- Here's a reminder for myself to back port this into branch-1.1 after the release. In Windows, Spark shell cannot load classes in spark.jars (--jars) -- Key: SPARK-1919 URL: https://issues.apache.org/jira/browse/SPARK-1919 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 1.0.0 Reporter: Andrew Or Fix For: 1.2.0 Not sure what the issue is, but Spark submit does not have the same problem, even if the jars specified are the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1919) In Windows, Spark shell cannot load classes in spark.jars (--jars)
[ https://issues.apache.org/jira/browse/SPARK-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1919: - Fix Version/s: 1.2.0 In Windows, Spark shell cannot load classes in spark.jars (--jars) -- Key: SPARK-1919 URL: https://issues.apache.org/jira/browse/SPARK-1919 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 1.0.0 Reporter: Andrew Or Fix For: 1.2.0 Not sure what the issue is, but Spark submit does not have the same problem, even if the jars specified are the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118427#comment-14118427 ] Reza Zadeh commented on SPARK-2885: --- Hi Clive, Can you please post the code you used to generate the error? I am having trouble reproducing this. There is a test for sparse vectors in the PR, which is not catching this. Reza All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Assignee: Reza Zadeh Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Clive Cox updated SPARK-2885: - Attachment: SimilarItemsSmallTest.java This is a test I used. Sorry its not cleaned up, but hopefully you can reproduce the error. All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Assignee: Reza Zadeh Attachments: SimilarItemsSmallTest.java Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118445#comment-14118445 ] Reza Zadeh commented on SPARK-2885: --- I don't see your code. Looking at the exception however, when generating the random sparse vectors, did you remember to sort by indices? This is a requirement of breeze, and it looks like the example you have violate it: http://www.scalanlp.org/api/breeze/index.html#breeze.linalg.SparseVector All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Assignee: Reza Zadeh Attachments: SimilarItemsSmallTest.java Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118455#comment-14118455 ] Patrick Wendell commented on SPARK-: Hey [~nchammas] unfortunately we couldn't reproduce this at a smaller scale. The reproduction here is a slightly pathological case (24,000 empty tasks) so it's not totally clear to me that this would affect any actual workload. Also by default I think Spark launches with a very small amount of driver memory, so it might be that there is GC happening due to an increase in the amount of memory required for tasks, the web UI, or other meta-data and that's why it's slower. It would be good to log GC data by setting spark.driver.extraJavaOptions in to -XX:+printGCDetails in spark-defaults.conf. I'll make a call soon about whether to let this block the release. If we can't narrow it down in time, it's not worth holding a bunch of other features for this... we can fix issues in a patch release shortly if we find any. Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Priority: Blocker Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at
[jira] [Comment Edited] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118455#comment-14118455 ] Patrick Wendell edited comment on SPARK- at 9/2/14 6:15 PM: Hey [~nchammas] unfortunately we couldn't reproduce this at a smaller scale. The reproduction here is a slightly pathological case (24,000 empty tasks) so it's not totally clear to me that this would affect any actual workload. Also by default I think Spark launches with a very small amount of driver memory, so it might be that there is GC happening due to an increase in the amount of memory required for tasks, the web UI, or other meta-data and that's why it's slower. It would be good to log GC data by setting spark.driver.extraJavaOptions in to -XX:+printGCDetails in spark-defaults.conf and see if you launch this with something more sane (a 10GB heap, e.g.) does it still show a difference. I'll make a call soon about whether to let this block the release. If we can't narrow it down in time, it's not worth holding a bunch of other features for this... we can fix issues in a patch release shortly if we find any. was (Author: pwendell): Hey [~nchammas] unfortunately we couldn't reproduce this at a smaller scale. The reproduction here is a slightly pathological case (24,000 empty tasks) so it's not totally clear to me that this would affect any actual workload. Also by default I think Spark launches with a very small amount of driver memory, so it might be that there is GC happening due to an increase in the amount of memory required for tasks, the web UI, or other meta-data and that's why it's slower. It would be good to log GC data by setting spark.driver.extraJavaOptions in to -XX:+printGCDetails in spark-defaults.conf. I'll make a call soon about whether to let this block the release. If we can't narrow it down in time, it's not worth holding a bunch of other features for this... we can fix issues in a patch release shortly if we find any. Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Priority: Blocker Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118464#comment-14118464 ] Reza Zadeh commented on SPARK-2885: --- Yes, looking at your code it looks like you need to sort the indices. All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Assignee: Reza Zadeh Attachments: SimilarItemsSmallTest.java Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118464#comment-14118464 ] Reza Zadeh edited comment on SPARK-2885 at 9/2/14 6:20 PM: --- Yes, looking at your code it looks like you need to sort by indices. was (Author: rezazadeh): Yes, looking at your code it looks like you need to sort the indices. All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Assignee: Reza Zadeh Attachments: SimilarItemsSmallTest.java Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3349) [Spark SQL] Incorrect partitioning after LIMIT operator
Eric Liang created SPARK-3349: - Summary: [Spark SQL] Incorrect partitioning after LIMIT operator Key: SPARK-3349 URL: https://issues.apache.org/jira/browse/SPARK-3349 Project: Spark Issue Type: Bug Reporter: Eric Liang Reproduced by the following example: import org.apache.spark.sql.catalyst.plans.Inner import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute import org.apache.spark.sql.catalyst.expressions._ val series = sql(select distinct year from sales order by year asc limit 10) val results = sql(select * from sales) series.registerTempTable(series) results.registerTempTable(results) sql(select * from results inner join series where results.year = series.year).count --- java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:56) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:189) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:189) at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:298) at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:310) at org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:246) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:723) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1333) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3350) Strange anomaly trying to write a SchemaRDD into an Avro file
David Greco created SPARK-3350: -- Summary: Strange anomaly trying to write a SchemaRDD into an Avro file Key: SPARK-3350 URL: https://issues.apache.org/jira/browse/SPARK-3350 Project: Spark Issue Type: Bug Components: Input/Output Environment: jdk1.7, macosx Reporter: David Greco I found a way to automatically save a SchemaRDD in Avro format, similarly to what Spark does with parquet file. I attached a test case to this issue. The code fails with a NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3347) Spark on yarn alpha doesn't compile due to SPARK-2889
[ https://issues.apache.org/jira/browse/SPARK-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3347. -- Resolution: Fixed Fix Version/s: 1.2.0 Spark on yarn alpha doesn't compile due to SPARK-2889 - Key: SPARK-3347 URL: https://issues.apache.org/jira/browse/SPARK-3347 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves Assignee: Marcelo Vanzin Priority: Blocker Fix For: 1.2.0 Spark on yarn alpha doesn't compile after SPARK-2889 was merged in. [ERROR] /home/tgraves/tgravescs-spark/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:43: not found: value SparkHadoopUtil -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3350) Strange anomaly trying to write a SchemaRDD into an Avro file
[ https://issues.apache.org/jira/browse/SPARK-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Greco updated SPARK-3350: --- Attachment: AvroWriteTestCase.scala This is the test case failing. Hope it helps. Strange anomaly trying to write a SchemaRDD into an Avro file - Key: SPARK-3350 URL: https://issues.apache.org/jira/browse/SPARK-3350 Project: Spark Issue Type: Bug Components: Input/Output Environment: jdk1.7, macosx Reporter: David Greco Attachments: AvroWriteTestCase.scala I found a way to automatically save a SchemaRDD in Avro format, similarly to what Spark does with parquet file. I attached a test case to this issue. The code fails with a NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3052) Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop
[ https://issues.apache.org/jira/browse/SPARK-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3052: - Assignee: Sandy Ryza Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop --- Key: SPARK-3052 URL: https://issues.apache.org/jira/browse/SPARK-3052 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2 Reporter: Sandy Ryza Assignee: Sandy Ryza Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator
[ https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118566#comment-14118566 ] Matthew Farrellee commented on SPARK-3181: -- pls excuse my changes to this issue, i'm not planning to work on it, but cannot appear to remove myself as the assignee. Add Robust Regression Algorithm with Huber Estimator Key: SPARK-3181 URL: https://issues.apache.org/jira/browse/SPARK-3181 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.2 Reporter: Fan Jiang Assignee: Matthew Farrellee Priority: Critical Labels: features Fix For: 1.1.1, 1.2.0 Original Estimate: 0h Remaining Estimate: 0h Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. In 1973, Huber introduced M-estimation for regression which stands for maximum likelihood type. The method is resistant to outliers in the response variable and has been widely used. The new feature for MLlib will contain 3 new files /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala and one new class HuberRobustGradient in /main/scala/org/apache/spark/mllib/optimization/Gradient.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3351) Yarn YarnRMClientImpl.shutdown can be called before register - NPE
Thomas Graves created SPARK-3351: Summary: Yarn YarnRMClientImpl.shutdown can be called before register - NPE Key: SPARK-3351 URL: https://issues.apache.org/jira/browse/SPARK-3351 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves If the SparkContext exits while its in the applicationmaster.waitForSparkContextInitialized then the YarnRMClientImpl.shutdown can be called before register and you get a null pointer exception on the uihistoryAddress. 14/09/02 18:59:21 INFO ApplicationMaster: Finishing ApplicationMaster with FAILED (diag message: Timed out waiting for SparkContext.) Exception in thread main java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnServiceProtos$FinishApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:2312) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.FinishApplicationMasterRequestPBImpl.setTrackingUrl(FinishApplicationMasterRequestPBImpl.java:121) at org.apache.spark.deploy.yarn.YarnRMClientImpl.shutdown(YarnRMClientImpl.scala:73) at org.apache.spark.deploy.yarn.ApplicationMaster.finish(ApplicationMaster.scala:140) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:178) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:113) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3351) Yarn YarnRMClientImpl.shutdown can be called before register - NPE
[ https://issues.apache.org/jira/browse/SPARK-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118574#comment-14118574 ] Thomas Graves commented on SPARK-3351: -- Note I was running on yarn alpha but I think the same applies for yarn stable. Yarn YarnRMClientImpl.shutdown can be called before register - NPE -- Key: SPARK-3351 URL: https://issues.apache.org/jira/browse/SPARK-3351 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves If the SparkContext exits while its in the applicationmaster.waitForSparkContextInitialized then the YarnRMClientImpl.shutdown can be called before register and you get a null pointer exception on the uihistoryAddress. 14/09/02 18:59:21 INFO ApplicationMaster: Finishing ApplicationMaster with FAILED (diag message: Timed out waiting for SparkContext.) Exception in thread main java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnServiceProtos$FinishApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:2312) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.FinishApplicationMasterRequestPBImpl.setTrackingUrl(FinishApplicationMasterRequestPBImpl.java:121) at org.apache.spark.deploy.yarn.YarnRMClientImpl.shutdown(YarnRMClientImpl.scala:73) at org.apache.spark.deploy.yarn.ApplicationMaster.finish(ApplicationMaster.scala:140) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:178) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:113) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118593#comment-14118593 ] Patrick Wendell commented on SPARK-: Hey [~nchammas] one other thing. If you go back to the original job that was causing this issue, can you see any regression? This benchmark is pathological enough that it would be good to see if there is actually an issue in Spark. Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Priority: Blocker Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread Result resolver thread-3 14/08/29 21:56:26 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014) java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) at
[jira] [Updated] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator
[ https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3181: -- Assignee: (was: Matthew Farrellee) Add Robust Regression Algorithm with Huber Estimator Key: SPARK-3181 URL: https://issues.apache.org/jira/browse/SPARK-3181 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.2 Reporter: Fan Jiang Priority: Critical Labels: features Fix For: 1.1.1, 1.2.0 Original Estimate: 0h Remaining Estimate: 0h Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. In 1973, Huber introduced M-estimation for regression which stands for maximum likelihood type. The method is resistant to outliers in the response variable and has been widely used. The new feature for MLlib will contain 3 new files /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala and one new class HuberRobustGradient in /main/scala/org/apache/spark/mllib/optimization/Gradient.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3351) Yarn YarnRMClientImpl.shutdown can be called before register - NPE
[ https://issues.apache.org/jira/browse/SPARK-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118616#comment-14118616 ] Thomas Graves commented on SPARK-3351: -- Note that this failure scenario also can cause the staging directory to be cleaned up but yarn will retry the application master attempt and fail since its been deleted. Yarn YarnRMClientImpl.shutdown can be called before register - NPE -- Key: SPARK-3351 URL: https://issues.apache.org/jira/browse/SPARK-3351 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves If the SparkContext exits while its in the applicationmaster.waitForSparkContextInitialized then the YarnRMClientImpl.shutdown can be called before register and you get a null pointer exception on the uihistoryAddress. 14/09/02 18:59:21 INFO ApplicationMaster: Finishing ApplicationMaster with FAILED (diag message: Timed out waiting for SparkContext.) Exception in thread main java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnServiceProtos$FinishApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:2312) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.FinishApplicationMasterRequestPBImpl.setTrackingUrl(FinishApplicationMasterRequestPBImpl.java:121) at org.apache.spark.deploy.yarn.YarnRMClientImpl.shutdown(YarnRMClientImpl.scala:73) at org.apache.spark.deploy.yarn.ApplicationMaster.finish(ApplicationMaster.scala:140) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:178) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:113) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3352) Rename Flume Polling stream to Pull Based stream
Hari Shreedharan created SPARK-3352: --- Summary: Rename Flume Polling stream to Pull Based stream Key: SPARK-3352 URL: https://issues.apache.org/jira/browse/SPARK-3352 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3352) Rename Flume Polling stream to Pull Based stream
[ https://issues.apache.org/jira/browse/SPARK-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118644#comment-14118644 ] Patrick Wendell commented on SPARK-3352: For my part, I actually found PollingStream to be nicer sounding than Pull Based... but maybe for others this is a worse option. Rename Flume Polling stream to Pull Based stream Key: SPARK-3352 URL: https://issues.apache.org/jira/browse/SPARK-3352 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3352) Rename Flume Polling stream to Pull Based stream
[ https://issues.apache.org/jira/browse/SPARK-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118650#comment-14118650 ] Hari Shreedharan commented on SPARK-3352: - Sorry, should have provided more context here. TD suggested the Pull based name since it was easier to understand, especially since the other was push based. Rename Flume Polling stream to Pull Based stream Key: SPARK-3352 URL: https://issues.apache.org/jira/browse/SPARK-3352 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3352) Rename Flume Polling stream to Pull Based stream
[ https://issues.apache.org/jira/browse/SPARK-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118649#comment-14118649 ] Apache Spark commented on SPARK-3352: - User 'harishreedharan' has created a pull request for this issue: https://github.com/apache/spark/pull/2238 Rename Flume Polling stream to Pull Based stream Key: SPARK-3352 URL: https://issues.apache.org/jira/browse/SPARK-3352 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118669#comment-14118669 ] Clive Cox commented on SPARK-2885: -- Ah. Thanks. Sorry for the confusion. All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Assignee: Reza Zadeh Attachments: SimilarItemsSmallTest.java Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118695#comment-14118695 ] Josh Rosen commented on SPARK-: --- I was unable to reproduce this on a cluster with two *r3*.xlarge instances (trying on *m3* shortly) using the following script: {code} from pyspark import SparkContext sc = SparkContext(appName=test) a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) parallelism = sc.defaultParallelism a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, parallelism).take(1) {code} In both 1.0.2 and 1.1.0-rc3, the job ran in ~1 minute 10 seconds (with essentially no timing difference between the two versions). I'm going to try this again on m3 instances and double-check both configurations. Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Priority: Blocker Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at
[jira] [Commented] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite
[ https://issues.apache.org/jira/browse/SPARK-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118701#comment-14118701 ] Patrick Wendell commented on SPARK-3330: I reviewed the patch and it seemed to me like `mvn clean` already correctly clears the assembly target directory regardless of the profile used (?) Successive test runs with different profiles fail SparkSubmitSuite -- Key: SPARK-3330 URL: https://issues.apache.org/jira/browse/SPARK-3330 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Maven-based Jenkins builds have been failing for a while: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console One common cause is that on the second and subsequent runs of mvn clean test, at least two assembly JARs will exist in assembly/target. Because assembly is not a submodule of parent, mvn clean is not invoked for assembly. The presence of two assembly jars causes spark-submit to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite
[ https://issues.apache.org/jira/browse/SPARK-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3330: --- Comment: was deleted (was: I reviewed the patch and it seemed to me like `mvn clean` already correctly clears the assembly target directory regardless of the profile used (?)) Successive test runs with different profiles fail SparkSubmitSuite -- Key: SPARK-3330 URL: https://issues.apache.org/jira/browse/SPARK-3330 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Maven-based Jenkins builds have been failing for a while: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console One common cause is that on the second and subsequent runs of mvn clean test, at least two assembly JARs will exist in assembly/target. Because assembly is not a submodule of parent, mvn clean is not invoked for assembly. The presence of two assembly jars causes spark-submit to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118735#comment-14118735 ] Nicholas Chammas commented on SPARK-: - {quote} If we can't narrow it down in time, it's not worth holding a bunch of other features for this... we can fix issues in a patch release shortly if we find any. {quote} Sounds fine to me. I'm trying to retrace how I originally stumbled on this but am currently tied up with other stuff. I'll report back when I can. It looks like the issue is limited to some specific setup which Josh seems to be slowly honing in on, which is great. Just FYI Josh, if it matters, I've been running the little benchmarks reported here via in the shell, not using spark-submit. Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Priority: Blocker Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118748#comment-14118748 ] Apache Spark commented on SPARK-1981: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/2239 Add AWS Kinesis streaming support - Key: SPARK-1981 URL: https://issues.apache.org/jira/browse/SPARK-1981 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Chris Fregly Assignee: Chris Fregly Fix For: 1.1.0 Add AWS Kinesis support to Spark Streaming. Initial discussion occured here: https://github.com/apache/spark/pull/223 I discussed this with Parviz from AWS recently and we agreed that I would take this over. Look for a new PR that takes into account all the feedback from the earlier PR including spark-1.0-compliant implementation, AWS-license-aware build support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3353) Stage id monotonicity (parent stage should have lower stage id)
Reynold Xin created SPARK-3353: -- Summary: Stage id monotonicity (parent stage should have lower stage id) Key: SPARK-3353 URL: https://issues.apache.org/jira/browse/SPARK-3353 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Reynold Xin The way stage IDs are generated is that parent stages actually have higher stage id. This is very confusing because parent stages get scheduled executed first. We should reverse that order so the scheduling timeline of stages (absent of failures) is monotonic, i.e. stages that are executed first have lower stage ids. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2895) Support mapPartitionsWithContext in Spark Java API
[ https://issues.apache.org/jira/browse/SPARK-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118759#comment-14118759 ] Xuefu Zhang commented on SPARK-2895: Hi [~rxin], could you review [~chengxiang li]'s patch when you get a chance? This is needed for our hive-on-spark milestone #1. Thanks. Support mapPartitionsWithContext in Spark Java API -- Key: SPARK-2895 URL: https://issues.apache.org/jira/browse/SPARK-2895 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: hive This is a requirement from Hive on Spark, mapPartitionsWithContext only exists in Spark Scala API, we expect to access from Spark Java API. For HIVE-7627, HIVE-7843, Hive operators which are invoked in mapPartitions closure need to get taskId. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3354) Add LENGTH and DATALENGTH functions to Spark SQL
Nicholas Chammas created SPARK-3354: --- Summary: Add LENGTH and DATALENGTH functions to Spark SQL Key: SPARK-3354 URL: https://issues.apache.org/jira/browse/SPARK-3354 Project: Spark Issue Type: Improvement Components: SQL Reporter: Nicholas Chammas It's very common when working on data sets of strings to want to know the lengths of the strings you are analyzing. Say I have some Tweets and want to find the average length of a Tweet by language. {code} SELECT language, AVG(LEN(tweet)) AS avg_length FROM tweets GROUP BY language ORDER BY avg_length DESC; {code} This is currently not possible because Spark SQL doesn't have a {{LEN()}} function. Another common function that would be useful is one that gives the size of the data item in bytes. This can be handy when moving data from Spark SQL to another system and you need to know how to size the receiving fields appropriately. *Proposal* * Add a {{LENGTH}} function. Make {{LEN}} a synonym of {{LENGTH}}. This function returns the number of characters in a string expression. * Add a {{DATALENGTH}} function. Make {{DATALEN}} a synonym of {{DATALENGTH}}. This function returns the number of bytes in any expression. *Special care* must be given to the following cases: * multi-byte characters * {{NULL}} * trailing spaces *Examples* These are suggestions for how these 2 functions should work. {code} LEN('Hello') - 5 LEN('안녕') - 2 LEN('Hello 안녕') - 8 LEN(NULL) - NULL LEN('') - 0 LEN('Bob ') - 3 {code} In this last example with {{'Bob '}}, trailing spaces are ignored. This matches the [behavior of SQL Server|http://msdn.microsoft.com/en-us/library/ms190329.aspx], but we could opt to include the spaces. {code} DATALEN('Hello') - 5 DATALEN('안녕') - 4 DATALEN('Hello 안녕') - 16 DATALEN(NULL) - NULL DATALEN('') - 0 DATALEN('Bob ') - 5 {code} Note here how mixing English and Korean characters causes every character to be interpreted as a 2 byte wide character. Dunno if this sane; this may depend on Scala or JVM details that I wouldn't know about at the moment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118775#comment-14118775 ] Josh Rosen commented on SPARK-: --- Tried this on a m3.xlarge cluster (1 master, 1 worker) using the same script from my previous comment. On this cluster, 1.1.0 was noticeably slower. In v1.0.2-rc1, it took ~220 seconds. In v1.1.0-rc3, it took ~360 seconds. Going to run again and keep the logs to see if I can spot where the extra time is coming from. Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Priority: Blocker Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread Result resolver thread-3 14/08/29 21:56:26 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014) java.nio.channels.ClosedChannelException at
[jira] [Commented] (SPARK-3176) Implement 'POWER', 'ABS and 'LAST' for sql
[ https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118796#comment-14118796 ] Nicholas Chammas commented on SPARK-3176: - Reposting a [comment I made on the PR|https://github.com/apache/spark/pull/2099#discussion_r16691086] here, since it relates to {{POWER}}'s behavior. [~marmbrus] and [~xinyunh] - We should probably document that behavior in this JIRA issue for starters, and then also in the appropriate docs. {quote} Microsoft has some good documentation for how SQL Server handles these things. As an established and very popular product, SQL Server could provide y'all with a good reference implementation for this behavior. From their [documentation on {{POWER()}}|http://msdn.microsoft.com/en-us/library/ms174276.aspx]: Returns the same type as submitted in _float_expression_. For example, if a *decimal*(2,0) is submitted as _float_expression_, the result returned is *decimal*(2,0). There are a few good examples that follow. Microsoft also has some good documentation on [how precision, scale, and length are calculated for results of arithmetic operations|http://msdn.microsoft.com/en-us/library/ms190476.aspx]. {quote} Implement 'POWER', 'ABS and 'LAST' for sql -- Key: SPARK-3176 URL: https://issues.apache.org/jira/browse/SPARK-3176 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2, 1.1.0 Environment: All Reporter: Xinyun Huang Priority: Minor Fix For: 1.2.0 Original Estimate: 3h Remaining Estimate: 3h Add support for the mathematical function POWER and ABS and the analytic function last to return a subset of the rows satisfying a query within spark sql. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-: -- Attachment: spark--logs.zip I've attached some logs from my most recent run, showing a ~100 second difference between the two releases. Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Priority: Blocker Attachments: spark--logs.zip Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread Result resolver thread-3 14/08/29 21:56:26 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014) java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) at
[jira] [Updated] (SPARK-3355) Allow running maven tests in run-tests
[ https://issues.apache.org/jira/browse/SPARK-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3355: --- Summary: Allow running maven tests in run-tests (was: Allow running maven tests in run-tests and run-tests-jenkins) Allow running maven tests in run-tests -- Key: SPARK-3355 URL: https://issues.apache.org/jira/browse/SPARK-3355 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell We should have a variable called AMPLAB_JENKINS_BUILD_TOOL that decides whether to run sbt or maven. This would allow us to simplify our build matrix in Jenkins... currently the maven builds run a totally different thing than the normal run-tests builds. The maven build currently does something like this: {code} mvn -DskipTests -Pprofile1 -Pprofile2 ... clean package mvn test -Pprofile1 -Pprofile2 ... --fail-at-end {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3176) Implement 'POWER', 'ABS and 'LAST' for sql
[ https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118822#comment-14118822 ] Xinyun Huang commented on SPARK-3176: - For now, the POWER's function will implicitly convert the input to Double and return the result in Double type. Implement 'POWER', 'ABS and 'LAST' for sql -- Key: SPARK-3176 URL: https://issues.apache.org/jira/browse/SPARK-3176 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2, 1.1.0 Environment: All Reporter: Xinyun Huang Priority: Minor Fix For: 1.2.0 Original Estimate: 3h Remaining Estimate: 3h Add support for the mathematical function POWER and ABS and the analytic function last to return a subset of the rows satisfying a query within spark sql. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3355) Allow running maven tests in run-tests and run-tests-jenkins
Patrick Wendell created SPARK-3355: -- Summary: Allow running maven tests in run-tests and run-tests-jenkins Key: SPARK-3355 URL: https://issues.apache.org/jira/browse/SPARK-3355 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell We should have a variable called AMPLAB_JENKINS_BUILD_TOOL that decides whether to run sbt or maven. This would allow us to simplify our build matrix in Jenkins... currently the maven builds run a totally different thing than the normal run-tests builds. The maven build currently does something like this: {code} mvn -DskipTests -Pprofile1 -Pprofile2 ... clean package mvn test -Pprofile1 -Pprofile2 ... --fail-at-end {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3330) Successive test runs with different profiles fail SparkSubmitSuite
[ https://issues.apache.org/jira/browse/SPARK-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3330. -- Resolution: Won't Fix It will be more suitable to change jenkins to run mvn clean mvn ... package to address this, if anything. It's not yet clear this is the cause of the failure in the same test in Jenkins though. Successive test runs with different profiles fail SparkSubmitSuite -- Key: SPARK-3330 URL: https://issues.apache.org/jira/browse/SPARK-3330 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Maven-based Jenkins builds have been failing for a while: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/480/HADOOP_PROFILE=hadoop-2.4,label=centos/console One common cause is that on the second and subsequent runs of mvn clean test, at least two assembly JARs will exist in assembly/target. Because assembly is not a submodule of parent, mvn clean is not invoked for assembly. The presence of two assembly jars causes spark-submit to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118826#comment-14118826 ] Josh Rosen commented on SPARK-: --- 360/220 is approximately 1.6. Using Splunk, I computed the average task execution time from these logs. It looks like 1.1.0 took around 60ms per task, while 1.0.2 only took 37ms, and 60/37 is also about 1.6 (the standard deviations appear to be in the same ballpark, too).. It seems that the tasks themselves are running slightly slower on 1.1.0, at least according to these logs. Large number of partitions causes OOM - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances Reporter: Nicholas Chammas Priority: Blocker Attachments: spark--logs.zip Here’s a repro for PySpark: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) {code} This code runs fine on 1.0.2. It returns the following result in just over a minute: {code} [(4, 'NickJohn')] {code} However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it runs for a very, very long time (at least 45 min) and then fails with {{java.lang.OutOfMemoryError: Java heap space}}. Here is a stack trace taken from a run on 1.1.0-rc2: {code} a = sc.parallelize([Nick, John, Bob]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent heart beats: 175143ms exceeds 45000ms 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 175359ms exceeds 45000ms 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent heart beats: 176816ms exceeds 45000ms 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 182241ms exceeds 45000ms 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3 java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread Result resolver thread-3 14/08/29 21:56:26 ERROR SendingConnection: Exception while reading SendingConnection to
[jira] [Updated] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL
[ https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2686: Target Version/s: 1.2.0 (was: 1.1.1) Add Length support to Spark SQL and HQL and Strlen support to SQL - Key: SPARK-2686 URL: https://issues.apache.org/jira/browse/SPARK-2686 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 0.9.1, 0.9.2, 1.0.0, 1.1.0, 1.1.1 Environment: all Reporter: Stephen Boesch Priority: Minor Labels: hql, length, sql Original Estimate: 0h Remaining Estimate: 0h Syntactic, parsing, and operational support have been added for LEN(GTH) and STRLEN functions. Examples: SQL: import org.apache.spark.sql._ case class TestData(key: Int, value: String) val sqlc = new SQLContext(sc) import sqlc._ val testData: SchemaRDD = sqlc.sparkContext.parallelize( (1 to 100).map(i = TestData(i, i.toString))) testData.registerAsTable(testData) sqlc.sql(select length(key) as key_len from testData order by key_len desc limit 5).collect res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2]) HQL: val hc = new org.apache.spark.sql.hive.HiveContext(sc) import hc._ hc.hql hql(select length(grp) from simplex).collect res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6]) As far as codebase changes: they have been purposefully made similar to the ones made for for adding SUBSTR(ING) from July 17: SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main classes changed. The testing suites affected are ConstantFolding and ExpressionEvaluation. In addition some ad-hoc testing was done as shown in the examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL
[ https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2686: Affects Version/s: (was: 1.1.1) (was: 0.9.2) (was: 1.1.0) (was: 0.9.1) (was: 1.0.0) Add Length support to Spark SQL and HQL and Strlen support to SQL - Key: SPARK-2686 URL: https://issues.apache.org/jira/browse/SPARK-2686 Project: Spark Issue Type: Improvement Components: SQL Environment: all Reporter: Stephen Boesch Priority: Minor Labels: hql, length, sql Original Estimate: 0h Remaining Estimate: 0h Syntactic, parsing, and operational support have been added for LEN(GTH) and STRLEN functions. Examples: SQL: import org.apache.spark.sql._ case class TestData(key: Int, value: String) val sqlc = new SQLContext(sc) import sqlc._ val testData: SchemaRDD = sqlc.sparkContext.parallelize( (1 to 100).map(i = TestData(i, i.toString))) testData.registerAsTable(testData) sqlc.sql(select length(key) as key_len from testData order by key_len desc limit 5).collect res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2]) HQL: val hc = new org.apache.spark.sql.hive.HiveContext(sc) import hc._ hc.hql hql(select length(grp) from simplex).collect res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6]) As far as codebase changes: they have been purposefully made similar to the ones made for for adding SUBSTR(ING) from July 17: SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main classes changed. The testing suites affected are ConstantFolding and ExpressionEvaluation. In addition some ad-hoc testing was done as shown in the examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3354) Add LENGTH and DATALENGTH functions to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3354. - Resolution: Duplicate Closing this as a duplicate of [SPARK-2686]. Can you make your comments there are on the linked PR? Thanks! Add LENGTH and DATALENGTH functions to Spark SQL Key: SPARK-3354 URL: https://issues.apache.org/jira/browse/SPARK-3354 Project: Spark Issue Type: Improvement Components: SQL Reporter: Nicholas Chammas It's very common when working on data sets of strings to want to know the lengths of the strings you are analyzing. Say I have some Tweets and want to find the average length of a Tweet by language. {code} SELECT language, AVG(LEN(tweet)) AS avg_length FROM tweets GROUP BY language ORDER BY avg_length DESC; {code} This is currently not possible because Spark SQL doesn't have a {{LEN()}} function. Another common function that would be useful is one that gives the size of the data item in bytes. This can be handy when moving data from Spark SQL to another system and you need to know how to size the receiving fields appropriately. *Proposal* * Add a {{LENGTH}} function. Make {{LEN}} a synonym of {{LENGTH}}. This function returns the number of characters in a string expression. * Add a {{DATALENGTH}} function. Make {{DATALEN}} a synonym of {{DATALENGTH}}. This function returns the number of bytes in any expression. *Special care* must be given to the following cases: * multi-byte characters * {{NULL}} * trailing spaces *Examples* These are suggestions for how these 2 functions should work. {code} LEN('Hello') - 5 LEN('안녕') - 2 LEN('Hello 안녕') - 8 LEN(NULL) - NULL LEN('') - 0 LEN('Bob ') - 3 {code} In this last example with {{'Bob '}}, trailing spaces are ignored. This matches the [behavior of SQL Server|http://msdn.microsoft.com/en-us/library/ms190329.aspx], but we could opt to include the spaces. {code} DATALEN('Hello') - 5 DATALEN('안녕') - 4 DATALEN('Hello 안녕') - 16 DATALEN(NULL) - NULL DATALEN('') - 0 DATALEN('Bob ') - 5 {code} Note here how mixing English and Korean characters causes every character to be interpreted as a 2 byte wide character. Dunno if this sane; this may depend on Scala or JVM details that I wouldn't know about at the moment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3343) Support for CREATE TABLE AS SELECT that specifies the format
[ https://issues.apache.org/jira/browse/SPARK-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3343: Summary: Support for CREATE TABLE AS SELECT that specifies the format (was: Unsupported language features in query) Support for CREATE TABLE AS SELECT that specifies the format Key: SPARK-3343 URL: https://issues.apache.org/jira/browse/SPARK-3343 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: HuQizhong hql(CREATE TABLE tmp_adclick_gm_all ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' as SELECT SUM(uv) as uv, round(SUM(cost),2) as total, round(SUM(cost)/SUM(uv),2) FROM tmp_adclick_sellplat ) 14/09/02 15:32:28 INFO ParseDriver: Parse Completed java.lang.RuntimeException: Unsupported language features in query: CREATE TABLE tmp_adclick_gm_all ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'abc' as SELECT SUM(uv) as uv, round(SUM(cost),2) as total, round(SUM(cost)/SUM(uv),2) FROM tmp_adclick_sellplat at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:255) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:75) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:78) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3343) Support for CREATE TABLE AS SELECT that specifies the format
[ https://issues.apache.org/jira/browse/SPARK-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3343: Target Version/s: 1.2.0 Affects Version/s: (was: 1.0.2) Issue Type: New Feature (was: Bug) Support for CREATE TABLE AS SELECT that specifies the format Key: SPARK-3343 URL: https://issues.apache.org/jira/browse/SPARK-3343 Project: Spark Issue Type: New Feature Components: SQL Reporter: HuQizhong hql(CREATE TABLE tmp_adclick_gm_all ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' as SELECT SUM(uv) as uv, round(SUM(cost),2) as total, round(SUM(cost)/SUM(uv),2) FROM tmp_adclick_sellplat ) 14/09/02 15:32:28 INFO ParseDriver: Parse Completed java.lang.RuntimeException: Unsupported language features in query: CREATE TABLE tmp_adclick_gm_all ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'abc' as SELECT SUM(uv) as uv, round(SUM(cost),2) as total, round(SUM(cost)/SUM(uv),2) FROM tmp_adclick_sellplat at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:255) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:75) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:78) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3019) Pluggable block transfer (data plane communication) interface
[ https://issues.apache.org/jira/browse/SPARK-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118849#comment-14118849 ] Apache Spark commented on SPARK-3019: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2240 Pluggable block transfer (data plane communication) interface - Key: SPARK-3019 URL: https://issues.apache.org/jira/browse/SPARK-3019 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Attachments: PluggableBlockTransferServiceProposalforSpark - draft 1.pdf The attached design doc proposes a standard interface for block transferring, which will make future engineering of this functionality easier, allowing the Spark community to provide alternative implementations. Block transferring is a critical function in Spark. All of the following depend on it: * shuffle * torrent broadcast * block replication in BlockManager * remote block reads for tasks scheduled without locality -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF
[ https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3336: Assignee: Davies Liu [Spark SQL] In pyspark, cannot group by field on UDF Key: SPARK-3336 URL: https://issues.apache.org/jira/browse/SPARK-3336 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.1.0 Reporter: kay feng Assignee: Davies Liu Running pyspark on a spark cluster with standalone master. Cannot group by field on a UDF. But we can group by UDF in Scala. For example: q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo) FROM bar GROUP BY MYUDF(foo)') out = q.collect() I got this exception: Py4JJavaError: An error occurred while calling o183.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 56.0 (TID 14038, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF#1278 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156) org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.immutable.List.foreach(List.scala:318) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54)
[jira] [Updated] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF
[ https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3336: Target Version/s: 1.2.0 [Spark SQL] In pyspark, cannot group by field on UDF Key: SPARK-3336 URL: https://issues.apache.org/jira/browse/SPARK-3336 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.1.0 Reporter: kay feng Assignee: Davies Liu Running pyspark on a spark cluster with standalone master. Cannot group by field on a UDF. But we can group by UDF in Scala. For example: q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo) FROM bar GROUP BY MYUDF(foo)') out = q.collect() I got this exception: Py4JJavaError: An error occurred while calling o183.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 56.0 (TID 14038, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF#1278 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156) org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) scala.collection.immutable.List.foreach(List.scala:318) scala.collection.TraversableLike$class.map(TraversableLike.scala:244) scala.collection.AbstractTraversable.map(Traversable.scala:105) org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54)
[jira] [Updated] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF
[ https://issues.apache.org/jira/browse/SPARK-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3335: Assignee: Davies Liu [Spark SQL] In pyspark, cannot use broadcast variables in UDF -- Key: SPARK-3335 URL: https://issues.apache.org/jira/browse/SPARK-3335 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.1.0 Reporter: kay feng Assignee: Davies Liu Running pyspark on a spark cluster with standalone master, spark sql cannot use broadcast variables in UDF. But we can use broadcast variable in spark in scala. For example, bar={a:aa, b:bb, c:abc} foo=sc.broadcast(bar) sqlContext.registerFunction(MYUDF, lambda x: foo.value[x] if x else ''). q= sqlContext.sql('SELECT MYUDF(c) FROM foobar') out = q.collect() Got the following exception: Py4JJavaError: An error occurred while calling o169.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 (TID 13040, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /root/spark/python/pyspark/worker.py, line 75, in main command = pickleSer._read_with_length(infile) File /root/spark/python/pyspark/serializers.py, line 150, in _read_with_length return self.loads(obj) File /root/spark/python/pyspark/broadcast.py, line 41, in _from_id raise Exception(Broadcast variable '%s' not loaded! % bid) Exception: (Exception(Broadcast variable '21' not loaded!,), function _from_id at 0x35042a8, (21L,)) org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at
[jira] [Updated] (SPARK-3335) [Spark SQL] In pyspark, cannot use broadcast variables in UDF
[ https://issues.apache.org/jira/browse/SPARK-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3335: Target Version/s: 1.2.0 [Spark SQL] In pyspark, cannot use broadcast variables in UDF -- Key: SPARK-3335 URL: https://issues.apache.org/jira/browse/SPARK-3335 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.1.0 Reporter: kay feng Assignee: Davies Liu Running pyspark on a spark cluster with standalone master, spark sql cannot use broadcast variables in UDF. But we can use broadcast variable in spark in scala. For example, bar={a:aa, b:bb, c:abc} foo=sc.broadcast(bar) sqlContext.registerFunction(MYUDF, lambda x: foo.value[x] if x else ''). q= sqlContext.sql('SELECT MYUDF(c) FROM foobar') out = q.collect() Got the following exception: Py4JJavaError: An error occurred while calling o169.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 51.0 failed 4 times, most recent failure: Lost task 4.3 in stage 51.0 (TID 13040, ip-10-33-9-144.us-west-2.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /root/spark/python/pyspark/worker.py, line 75, in main command = pickleSer._read_with_length(infile) File /root/spark/python/pyspark/serializers.py, line 150, in _read_with_length return self.loads(obj) File /root/spark/python/pyspark/broadcast.py, line 41, in _from_id raise Exception(Broadcast variable '%s' not loaded! % bid) Exception: (Exception(Broadcast variable '21' not loaded!,), function _from_id at 0x35042a8, (21L,)) org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236)
[jira] [Updated] (SPARK-3329) HiveQuerySuite SET tests depend on map orderings
[ https://issues.apache.org/jira/browse/SPARK-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3329: Target Version/s: 1.2.0 HiveQuerySuite SET tests depend on map orderings Key: SPARK-3329 URL: https://issues.apache.org/jira/browse/SPARK-3329 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.0 Reporter: William Benton Priority: Trivial The SET tests in HiveQuerySuite that return multiple values depend on the ordering in which map pairs are returned from Hive and can fail spuriously if this changes due to environment or library changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3109) Sql query with OR condition should be handled above PhysicalOperation layer
[ https://issues.apache.org/jira/browse/SPARK-3109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3109. - Resolution: Won't Fix Hey Alex, I'm going to close this as I'm not sure its actually possible to do this optimization in the general case. If I'm wrong or if you have a more specific instance where you think this optimization could work please feel free to reopen. Sql query with OR condition should be handled above PhysicalOperation layer --- Key: SPARK-3109 URL: https://issues.apache.org/jira/browse/SPARK-3109 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.2 Reporter: Alex Liu For query like {code} select d, e from test where a = 1 and b = 1 and c = 1 and d 20 or d 0 {code} Spark SQL pushes the whole query to PhysicalOperation. I haven't check how Spark SQL internal query plan works, but I think OR condition in the above query should be handled above physical operation. Physical operation should have the following query {code} select d, e from test where a = 1 and b = 1 and c = 1 and d 20 {code} OR {code}select d, e from test where d 0 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables
[ https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118886#comment-14118886 ] Michael Armbrust commented on SPARK-3298: - Throwing an error here would be a breaking API change, so we'd want to do it with caution. Regarding the concern about caching, once all references to the RDD are lost the context cleaner will actually remove the blocks from the cache. [SQL] registerAsTable / registerTempTable overwrites old tables --- Key: SPARK-3298 URL: https://issues.apache.org/jira/browse/SPARK-3298 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Evan Chan Priority: Minor Labels: newbie At least in Spark 1.0.2, calling registerAsTable(a) when a had been registered before does not cause an error. However, there is no way to access the old table, even though it may be cached and taking up space. How about at least throwing an error? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2883: Target Version/s: (was: 1.2.0) Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: Bug Components: Input/Output, SQL Reporter: Zhan Zhang Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2883: Target Version/s: 1.2.0 Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: Bug Components: Input/Output, SQL Reporter: Zhan Zhang Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118913#comment-14118913 ] Matei Zaharia commented on SPARK-3098: -- Yup, let's maybe document this for now. I'll create a JIRA for it. In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118917#comment-14118917 ] Reynold Xin commented on SPARK-2978: I talked to [~pwendell] about this. How about this? We can add two APIs to OrderedRDDFunctions (and whatever the Java equivalent is): - sortWithinPartition - repartitionAndSortWithinPartition The first one is obvious, while the 2nd one is functionally equivalent to repartition followed by sortWithinPartition. The 2nd one is an optimization because it can push the sorting code into ShuffledRDD. Provide an MR-style shuffle transformation -- Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe? * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118918#comment-14118918 ] Matei Zaharia commented on SPARK-3098: -- Created SPARK-3356 to track this. In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3356) Document when RDD elements' ordering within partitions is nondeterministic
Matei Zaharia created SPARK-3356: Summary: Document when RDD elements' ordering within partitions is nondeterministic Key: SPARK-3356 URL: https://issues.apache.org/jira/browse/SPARK-3356 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Matei Zaharia As reported in SPARK-3098 for example, for users using zipWithIndex, zipWithUniqueId, etc, (and maybe even things like mapPartitions) it's confusing that the order of elements in each partition after a shuffle operation is nondeterministic (unless the operation was sortByKey). We should explain this in the docs for the zip and partition-wise operations. Another subtle issue is that the order of values for each key in groupBy / join / etc can be nondeterministic -- we need to explain that too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118921#comment-14118921 ] Michael Armbrust commented on SPARK-2883: - Can you elaborate on what you mean here? If you create a table that uses the ORC SerDe you can do table(tableName) to get an RDD for it. Would that satisfy your use case? I think it is unlikely that we will have the resources to write a native ORC decoder that does not use the hive serdes anytime in the near future. Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: Bug Components: Input/Output, SQL Reporter: Zhan Zhang Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3098. -- Resolution: Won't Fix In some cases, operation zipWithIndex get a wrong results -- Key: SPARK-3098 URL: https://issues.apache.org/jira/browse/SPARK-3098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Guoqiang Li Priority: Critical The reproduce code: {code} val c = sc.parallelize(1 to 7899).flatMap { i = (1 to 1).toSeq.map(p = i * 6000 + p) }.distinct().zipWithIndex() c.join(c).filter(t = t._2._1 != t._2._2).take(3) {code} = {code} Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), (36579712,(13,14))) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext
[ https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118960#comment-14118960 ] Marcelo Vanzin commented on SPARK-3215: --- For those who'd prefer to see some code, here's a proof-of-concept: https://github.com/vanzin/spark/tree/SPARK-3215/remote Please ignore the fact that it's a module inside Spark; I picked a different package name so that I didn't end up using any internal Spark APIs. I just wanted to avoid having to write build code. In particular, focus on this package (and *not* what's inside impl): https://github.com/vanzin/spark/tree/SPARK-3215/remote/src/main/scala/org/apache/spark_remote That's all a user would see; what happens inside impl does not matter to the user. If you really want to look at the implementation code, it's currently using akka and has very little error handling. Add remote interface for SparkContext - Key: SPARK-3215 URL: https://issues.apache.org/jira/browse/SPARK-3215 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Marcelo Vanzin Labels: hive Attachments: RemoteSparkContext.pdf A quick description of the issue: as part of running Hive jobs on top of Spark, it's desirable to have a SparkContext that is running in the background and listening for job requests for a particular user session. Running multiple contexts in the same JVM is not a very good solution. Not only SparkContext currently has issues sharing the same JVM among multiple instances, but that turns the JVM running the contexts into a huge bottleneck in the system. So I'm proposing a solution where we have a SparkContext that is running in a separate process, and listening for requests from the client application via some RPC interface (most probably Akka). I'll attach a document shortly with the current proposal. Let's use this bug to discuss the proposal and any other suggestions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2706) Enable Spark to support Hive 0.13
[ https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118979#comment-14118979 ] Zhan Zhang commented on SPARK-2706: --- send out pull request https://github.com/apache/spark/pull/2241 Enable Spark to support Hive 0.13 - Key: SPARK-2706 URL: https://issues.apache.org/jira/browse/SPARK-2706 Project: Spark Issue Type: Dependency upgrade Components: SQL Affects Versions: 1.0.1 Reporter: Chunjun Xiao Assignee: Zhan Zhang Attachments: hive.diff, spark-2706-v1.txt, spark-2706-v2.txt, spark-hive.err, v1.0.2.diff It seems Spark cannot work with Hive 0.13 well. When I compiled Spark with Hive 0.13.1, I got some error messages, as attached below. So, when can Spark be enabled to support Hive 0.13? Compiling Error: {quote} [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180: type mismatch; found : String required: Array[String] [ERROR] val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264: overloaded method constructor TableDesc with alternatives: (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc and ()org.apache.hadoop.hive.ql.plan.TableDesc cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in value tableDesc)(in value tableDesc)], java.util.Properties) [ERROR] val tableDesc = new TableDesc( [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140: value getPartitionPath is not a member of org.apache.hadoop.hive.ql.metadata.Partition [ERROR] val partPath = partition.getPartitionPath [ERROR]^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132: value appendReadColumnNames is not a member of object org.apache.hadoop.hive.serde2.ColumnProjectionUtils [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, attributes.map(_.name)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132: type mismatch; found : org.apache.hadoop.fs.Path required: String [ERROR] SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179: value getExternalTmpFileURI is not a member of org.apache.hadoop.hive.ql.Context [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] case bd: BigDecimal = new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] 8 errors found [DEBUG] Compilation failed (CompilerInterface) [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.579s] [INFO] Spark Project Core SUCCESS [2:39.805s] [INFO] Spark Project Bagel ... SUCCESS [21.148s] [INFO] Spark Project GraphX .. SUCCESS [59.950s] [INFO] Spark Project ML Library .. SUCCESS [1:08.771s] [INFO] Spark Project Streaming ... SUCCESS [1:17.759s] [INFO] Spark Project Tools ... SUCCESS [15.405s] [INFO] Spark Project Catalyst SUCCESS [1:17.405s] [INFO] Spark Project SQL . SUCCESS [1:11.094s] [INFO] Spark Project Hive FAILURE [11.121s] [INFO] Spark Project REPL SKIPPED [INFO] Spark Project YARN Parent POM . SKIPPED [INFO] Spark Project YARN Stable API .
[jira] [Created] (SPARK-3357) Internal log messages should be set at DEBUG level instead of INFO
Xiangrui Meng created SPARK-3357: Summary: Internal log messages should be set at DEBUG level instead of INFO Key: SPARK-3357 URL: https://issues.apache.org/jira/browse/SPARK-3357 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor spark-shell shows INFO by default, so we should carefully choose what to show at INFO level. For example, if I run {code} sc.parallelize(0 until 100).count() {code} and wait for one minute or so. I will see messages that mixed with the current input box, which is annoying: {code} scala 14/09/02 17:09:00 INFO BlockManager: Removing broadcast 0 14/09/02 17:09:00 INFO BlockManager: Removing block broadcast_0 14/09/02 17:09:00 INFO MemoryStore: Block broadcast_0 of size 1088 dropped from memory (free 278019440) 14/09/02 17:09:00 INFO ContextCleaner: Cleaned broadcast 0 {code} Does a user need to know when a broadcast variable is removed? Maybe not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3357) Internal log messages should be set at DEBUG level instead of INFO
[ https://issues.apache.org/jira/browse/SPARK-3357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3357: - Description: spark-shell shows INFO by default, so we should carefully choose what to show at INFO level. For example, if I run {code} sc.parallelize(0 until 100).count() {code} and wait for one minute or so. I will see messages that mix with the current input box, which is annoying: {code} scala 14/09/02 17:09:00 INFO BlockManager: Removing broadcast 0 14/09/02 17:09:00 INFO BlockManager: Removing block broadcast_0 14/09/02 17:09:00 INFO MemoryStore: Block broadcast_0 of size 1088 dropped from memory (free 278019440) 14/09/02 17:09:00 INFO ContextCleaner: Cleaned broadcast 0 {code} Does a user need to know when a broadcast variable is removed? Maybe not. was: spark-shell shows INFO by default, so we should carefully choose what to show at INFO level. For example, if I run {code} sc.parallelize(0 until 100).count() {code} and wait for one minute or so. I will see messages that mixed with the current input box, which is annoying: {code} scala 14/09/02 17:09:00 INFO BlockManager: Removing broadcast 0 14/09/02 17:09:00 INFO BlockManager: Removing block broadcast_0 14/09/02 17:09:00 INFO MemoryStore: Block broadcast_0 of size 1088 dropped from memory (free 278019440) 14/09/02 17:09:00 INFO ContextCleaner: Cleaned broadcast 0 {code} Does a user need to know when a broadcast variable is removed? Maybe not. Internal log messages should be set at DEBUG level instead of INFO -- Key: SPARK-3357 URL: https://issues.apache.org/jira/browse/SPARK-3357 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor spark-shell shows INFO by default, so we should carefully choose what to show at INFO level. For example, if I run {code} sc.parallelize(0 until 100).count() {code} and wait for one minute or so. I will see messages that mix with the current input box, which is annoying: {code} scala 14/09/02 17:09:00 INFO BlockManager: Removing broadcast 0 14/09/02 17:09:00 INFO BlockManager: Removing block broadcast_0 14/09/02 17:09:00 INFO MemoryStore: Block broadcast_0 of size 1088 dropped from memory (free 278019440) 14/09/02 17:09:00 INFO ContextCleaner: Cleaned broadcast 0 {code} Does a user need to know when a broadcast variable is removed? Maybe not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3350) Strange anomaly trying to write a SchemaRDD into an Avro file
[ https://issues.apache.org/jira/browse/SPARK-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119057#comment-14119057 ] Patrick Wendell commented on SPARK-3350: Can you provide the stacktrace? Strange anomaly trying to write a SchemaRDD into an Avro file - Key: SPARK-3350 URL: https://issues.apache.org/jira/browse/SPARK-3350 Project: Spark Issue Type: Bug Components: SQL Environment: jdk1.7, macosx Reporter: David Greco Attachments: AvroWriteTestCase.scala I found a way to automatically save a SchemaRDD in Avro format, similarly to what Spark does with parquet file. I attached a test case to this issue. The code fails with a NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3350) Strange anomaly trying to write a SchemaRDD into an Avro file
[ https://issues.apache.org/jira/browse/SPARK-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3350: --- Component/s: (was: Input/Output) SQL Strange anomaly trying to write a SchemaRDD into an Avro file - Key: SPARK-3350 URL: https://issues.apache.org/jira/browse/SPARK-3350 Project: Spark Issue Type: Bug Components: SQL Environment: jdk1.7, macosx Reporter: David Greco Attachments: AvroWriteTestCase.scala I found a way to automatically save a SchemaRDD in Avro format, similarly to what Spark does with parquet file. I attached a test case to this issue. The code fails with a NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3346) Error happened in using memcached
[ https://issues.apache.org/jira/browse/SPARK-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3346. Resolution: Invalid Would you mind reporting this to the user list? We use JIRA only for issues that have been diagnosed in more detail. Also, please include the source code on the user list, that will make it easier to debug. Error happened in using memcached - Key: SPARK-3346 URL: https://issues.apache.org/jira/browse/SPARK-3346 Project: Spark Issue Type: Bug Environment: CDH5.1.0 Spark1.0.0 Reporter: Gavin Zhang I finished a distributed project in hadoop streaming and it worked fine with using memcached storage during mapping. Actually, it's a python project. Now I want to move it to Spark. But when I called the memcached library, two errors was found during computing. (Both) 1. File memcache.py, line 414, in get rkey, rlen = self._expectvalue(server) ValueError: too many values to unpack 2. File memcache.py, line 714, in check_key return key.translate(ill_map) TypeError: character mapping must return integer, None or unicode After adding exception handing, there was no successful cache got at all. However, it works in hadoop streaming without any error. Why? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3322) ConnectionManager logs an error when the application ends
[ https://issues.apache.org/jira/browse/SPARK-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3322: --- Summary: ConnectionManager logs an error when the application ends (was: Log a ConnectionManager error when the application ends) ConnectionManager logs an error when the application ends - Key: SPARK-3322 URL: https://issues.apache.org/jira/browse/SPARK-3322 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: wangfei Athough it does not influence the result, it always would log an error from ConnectionManager. Sometimes only log ConnectionManagerId(vm2,40992) not found and sometimes it also will log CancelledKeyException The log Info as fellow: 14/08/29 16:54:53 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(vm2,40992) not found 14/08/29 16:54:53 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@457245f9 java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3358) PySpark worker fork()ing performance regression in m3.* / PVM instances
Josh Rosen created SPARK-3358: - Summary: PySpark worker fork()ing performance regression in m3.* / PVM instances Key: SPARK-3358 URL: https://issues.apache.org/jira/browse/SPARK-3358 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: m3.* instances on EC2 Reporter: Josh Rosen SPARK-2764 (and some followup commits) simplified PySpark's worker process structure by removing an intermediate pool of processes forked by daemon.py. Previously, daemon.py forked a fixed-size pool of processes that shared a socket and handled worker launch requests from Java. After my patch, this intermediate pool was removed and launch requests are handled directly in daemon.py. Unfortunately, this seems to have increased PySpark task launch latency when running on m3* class instances in EC2. Most of this difference can be attributed to m3 instances' more expensive fork() system calls. I tried the following microbenchmark on m3.xlarge and r3.xlarge instances: {code} import os for x in range(1000): if os.fork() == 0: exit() {code} On the r3.xlarge instance: {code} real0m0.761s user0m0.008s sys 0m0.144s {code} And on m3.xlarge: {code} real0m1.699s user0m0.012s sys 0m1.008s {code} I think this is due to HVM vs PVM EC2 instances using different virtualization technologies with different fork costs. It may be the case that this performance difference only appears in certain microbenchmarks and is masked by other performance improvements in PySpark, such as improvements to large group-bys. I'm in the process of re-running spark-perf benchmarks on m3 instances in order to confirm whether this impacts more realistic jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3328) ./make-distribution.sh --with-tachyon build is broken
[ https://issues.apache.org/jira/browse/SPARK-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3328. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 2228 [https://github.com/apache/spark/pull/2228] ./make-distribution.sh --with-tachyon build is broken - Key: SPARK-3328 URL: https://issues.apache.org/jira/browse/SPARK-3328 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Elijah Epifanov Fix For: 1.1.0 cp: tachyon-0.5.0/target/tachyon-0.5.0-jar-with-dependencies.jar: No such file or directory -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3328) ./make-distribution.sh --with-tachyon build is broken
[ https://issues.apache.org/jira/browse/SPARK-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3328: --- Assignee: prudhvi krishna ./make-distribution.sh --with-tachyon build is broken - Key: SPARK-3328 URL: https://issues.apache.org/jira/browse/SPARK-3328 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.1.0 Reporter: Elijah Epifanov Assignee: prudhvi krishna Fix For: 1.1.0 cp: tachyon-0.5.0/target/tachyon-0.5.0-jar-with-dependencies.jar: No such file or directory -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2978: --- Target Version/s: 1.2.0 Provide an MR-style shuffle transformation -- Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe? * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2706) Enable Spark to support Hive 0.13
[ https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119079#comment-14119079 ] Apache Spark commented on SPARK-2706: - User 'zhzhan' has created a pull request for this issue: https://github.com/apache/spark/pull/2241 Enable Spark to support Hive 0.13 - Key: SPARK-2706 URL: https://issues.apache.org/jira/browse/SPARK-2706 Project: Spark Issue Type: Dependency upgrade Components: SQL Affects Versions: 1.0.1 Reporter: Chunjun Xiao Assignee: Zhan Zhang Attachments: hive.diff, spark-2706-v1.txt, spark-2706-v2.txt, spark-hive.err, v1.0.2.diff It seems Spark cannot work with Hive 0.13 well. When I compiled Spark with Hive 0.13.1, I got some error messages, as attached below. So, when can Spark be enabled to support Hive 0.13? Compiling Error: {quote} [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180: type mismatch; found : String required: Array[String] [ERROR] val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264: overloaded method constructor TableDesc with alternatives: (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc and ()org.apache.hadoop.hive.ql.plan.TableDesc cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in value tableDesc)(in value tableDesc)], java.util.Properties) [ERROR] val tableDesc = new TableDesc( [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140: value getPartitionPath is not a member of org.apache.hadoop.hive.ql.metadata.Partition [ERROR] val partPath = partition.getPartitionPath [ERROR]^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132: value appendReadColumnNames is not a member of object org.apache.hadoop.hive.serde2.ColumnProjectionUtils [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, attributes.map(_.name)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132: type mismatch; found : org.apache.hadoop.fs.Path required: String [ERROR] SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf)) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179: value getExternalTmpFileURI is not a member of org.apache.hadoop.hive.ql.Context [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation) [ERROR] ^ [ERROR] /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209: org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor [ERROR] case bd: BigDecimal = new HiveDecimal(bd.underlying()) [ERROR] ^ [ERROR] 8 errors found [DEBUG] Compilation failed (CompilerInterface) [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.579s] [INFO] Spark Project Core SUCCESS [2:39.805s] [INFO] Spark Project Bagel ... SUCCESS [21.148s] [INFO] Spark Project GraphX .. SUCCESS [59.950s] [INFO] Spark Project ML Library .. SUCCESS [1:08.771s] [INFO] Spark Project Streaming ... SUCCESS [1:17.759s] [INFO] Spark Project Tools ... SUCCESS [15.405s] [INFO] Spark Project Catalyst SUCCESS [1:17.405s] [INFO] Spark Project SQL . SUCCESS [1:11.094s] [INFO] Spark Project Hive FAILURE [11.121s] [INFO] Spark Project REPL SKIPPED [INFO] Spark Project YARN Parent POM . SKIPPED [INFO] Spark Project
[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119080#comment-14119080 ] Sandy Ryza commented on SPARK-2978: --- What's the thinking behind adding sortWithinPartition? It shouldn't be difficult to add, but I can't think of a situation where it would be useful without a repartition before. Provide an MR-style shuffle transformation -- Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe? * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119084#comment-14119084 ] Reynold Xin commented on SPARK-2978: It was just asked multiple times by various users. I think the use case is to provide a robust external sort implementation without exposing the ExternalSorter API. It doesn't need to be part of this change. Provide an MR-style shuffle transformation -- Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe? * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3358) PySpark worker fork()ing performance regression in m3.* / PVM instances
[ https://issues.apache.org/jira/browse/SPARK-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119089#comment-14119089 ] Josh Rosen commented on SPARK-3358: --- Credit where it's due: Davies pointed out the potential for this problem in the original PR: https://github.com/apache/spark/pull/1680#issuecomment-50721351 The Redis team did their own benchmarking on this (http://redislabs.com/blog/testing-fork-time-on-awsxen-infrastructure (or https://web.archive.org/web/20140529181436/http://redislabs.com/blog/testing-fork-time-on-awsxen-infrastructure, since their site may be down / slow right now)). Based on those results, and updated numbers at http://redislabs.com/blog/benchmarking-the-new-aws-m3-instances-with-redis, it looks like HVM AMIs don't have this problem. I'm going to try running a similar microbenchmark on m3.xlarge with the spark-ec2 HVM AMI to see if that improves performance. If so, we should consider changing from PVM to HVM for those instance types. PySpark worker fork()ing performance regression in m3.* / PVM instances --- Key: SPARK-3358 URL: https://issues.apache.org/jira/browse/SPARK-3358 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: m3.* instances on EC2 Reporter: Josh Rosen SPARK-2764 (and some followup commits) simplified PySpark's worker process structure by removing an intermediate pool of processes forked by daemon.py. Previously, daemon.py forked a fixed-size pool of processes that shared a socket and handled worker launch requests from Java. After my patch, this intermediate pool was removed and launch requests are handled directly in daemon.py. Unfortunately, this seems to have increased PySpark task launch latency when running on m3* class instances in EC2. Most of this difference can be attributed to m3 instances' more expensive fork() system calls. I tried the following microbenchmark on m3.xlarge and r3.xlarge instances: {code} import os for x in range(1000): if os.fork() == 0: exit() {code} On the r3.xlarge instance: {code} real 0m0.761s user 0m0.008s sys 0m0.144s {code} And on m3.xlarge: {code} real0m1.699s user0m0.012s sys 0m1.008s {code} I think this is due to HVM vs PVM EC2 instances using different virtualization technologies with different fork costs. It may be the case that this performance difference only appears in certain microbenchmarks and is masked by other performance improvements in PySpark, such as improvements to large group-bys. I'm in the process of re-running spark-perf benchmarks on m3 instances in order to confirm whether this impacts more realistic jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation
[ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119091#comment-14119091 ] Sandy Ryza commented on SPARK-2978: --- Ah ok, sounds good. Provide an MR-style shuffle transformation -- Key: SPARK-2978 URL: https://issues.apache.org/jira/browse/SPARK-2978 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Sandy Ryza For Hive on Spark joins in particular, and for running legacy MR code in general, I think it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle, i.e. one that * groups by key: provides (Key, Iterator[Value]) * within each partition, provides keys in sorted order A couple ways that could make sense to expose this: * Add a new operator. groupAndSortByKey, groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe? * Allow groupByKey to take an ordering param for keys within a partition -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org