[jira] [Created] (SPARK-21894) Some Netty errors do not propagate to the top level driver

2017-09-01 Thread Charles Allen (JIRA)
Charles Allen created SPARK-21894:
-

 Summary: Some Netty errors do not propagate to the top level driver
 Key: SPARK-21894
 URL: https://issues.apache.org/jira/browse/SPARK-21894
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Charles Allen


We have an environment with Netty 4.1 ( 
https://issues.apache.org/jira/browse/SPARK-19552 for some context) and the 
following error occurs. The reason THIS issue is being filed is because this 
error leaves the Spark workload in a bad state where it does not make any 
progress, and does not shut down.

The expected behavior is that the spark job would throw an exception that can 
be caught by the driving application.

{code}
017-09-01T16:13:32,175 ERROR [shuffle-server-3-2] 
org.apache.spark.network.server.TransportRequestHandler - Error sending result 
StreamResponse{streamId=/jars/lz4-1.3.0.jar, byteCount=236880, 
body=FileSegmentManagedBuffer{file=/Users/charlesallen/.m2/repository/net/jpountz/lz4/lz4/1.3.0/lz4-1.3.0.jar,
 offset=0, length=236880}} to /192.168.59.3:56703; closing connection
java.lang.AbstractMethodError
at io.netty.util.ReferenceCountUtil.touch(ReferenceCountUtil.java:73) 
~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.DefaultChannelPipeline.touch(DefaultChannelPipeline.java:107) 
~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:810)
 ~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
 ~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:111)
 ~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
 ~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
 ~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
 ~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
 ~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:305) 
~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
 ~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:801)
 ~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:814)
 ~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:794)
 ~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:831)
 ~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1032)
 ~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:296) 
~[netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194)
 [spark-network-common_2.11-2.1.0-mmx9.jar:2.1.0-mmx9]
at 
org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:150)
 [spark-network-common_2.11-2.1.0-mmx9.jar:2.1.0-mmx9]
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
 [spark-network-common_2.11-2.1.0-mmx9.jar:2.1.0-mmx9]
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
 [spark-network-common_2.11-2.1.0-mmx9.jar:2.1.0-mmx9]
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
 [spark-network-common_2.11-2.1.0-mmx9.jar:2.1.0-mmx9]
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 [netty-all-4.1.11.Final.jar:4.1.11.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
 [netty-all-4.1.11.Final.jar:4.1.11.Final]
at 

[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-08-21 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16135819#comment-16135819
 ] 

Charles Allen commented on SPARK-19552:
---

[~aash] Do you have a link to an Apache Arrow issue on this?

> Upgrade Netty version to 4.1.8 final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern but not one we'd be 
> exposed to with Spark "out of the box". Let's upgrade the version we use to 
> be on the safe side as the security fix I'm especially interested in is not 
> available in the 4.0.x release line. 
> We should move up anyway to take on a bunch of other big fixes cited in the 
> release notes (and if anyone were to use Spark with netty and tcnative, they 
> shouldn't be exposed to the security problem) - we should be good citizens 
> and make this change.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. This JIRA and associated pull 
> request starts the process which I'll work on - and any help would be much 
> appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-06-20 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056601#comment-16056601
 ] 

Charles Allen commented on SPARK-19111:
---

[~ste...@apache.org] It makes it unstable because the history server chokes on 
such large files. The task finishes though. We didn't debug what about the 
large files is making the history server choke, but if I recall the history 
files were on the order of tens of GB, so it wouldn't shock me if the history 
server didn't handle them efficiently.

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO 

[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-06-09 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044925#comment-16044925
 ] 

Charles Allen commented on SPARK-19552:
---

This is starting to show problems on our side due to library issues 
https://github.com/druid-io/druid/issues/4390 

> Upgrade Netty version to 4.1.8 final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern but not one we'd be 
> exposed to with Spark "out of the box". Let's upgrade the version we use to 
> be on the safe side as the security fix I'm especially interested in is not 
> available in the 4.0.x release line. 
> We should move up anyway to take on a bunch of other big fixes cited in the 
> release notes (and if anyone were to use Spark with netty and tcnative, they 
> shouldn't be exposed to the security problem) - we should be good citizens 
> and make this change.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. This JIRA and associated pull 
> request starts the process which I'll work on - and any help would be much 
> appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints

2017-04-03 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954212#comment-15954212
 ] 

Charles Allen commented on SPARK-4899:
--

It was discussed on the mailing list with [~timchen] that checkpointing might 
just need a timeout setting available to the other schedulers.

> Support Mesos features: roles and checkpoints
> -
>
> Key: SPARK-4899
> URL: https://issues.apache.org/jira/browse/SPARK-4899
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> Inspired by https://github.com/apache/spark/pull/60
> Mesos has two features that would be nice for Spark to take advantage of:
> 1. Roles -- a way to specify ACLs and priorities for users
> 2. Checkpoints -- a way to restart a failed Mesos slave without losing all 
> the work that was happening on the box
> Some of these may require a Mesos upgrade past our current 0.18.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints

2017-04-03 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954170#comment-15954170
 ] 

Charles Allen commented on SPARK-4899:
--

{{org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils#createSchedulerDriver}}
 seems to allow checkpointing, which only 
{{org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler}} uses. 
Neither the fine grained nor coarse grained schedulers use it, is there a 
reason for that?

> Support Mesos features: roles and checkpoints
> -
>
> Key: SPARK-4899
> URL: https://issues.apache.org/jira/browse/SPARK-4899
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 1.2.0
>Reporter: Andrew Ash
>
> Inspired by https://github.com/apache/spark/pull/60
> Mesos has two features that would be nice for Spark to take advantage of:
> 1. Roles -- a way to specify ACLs and priorities for users
> 2. Checkpoints -- a way to restart a failed Mesos slave without losing all 
> the work that was happening on the box
> Some of these may require a Mesos upgrade past our current 0.18.1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-24 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883331#comment-15883331
 ] 

Charles Allen commented on SPARK-19698:
---

[~mridulm80] is there documentation somewhere that describes the output commit 
best practices? I see a bunch of things that seem to have either the hadoop MR 
output committer, or some Spark specific output committing stuff, but it is not 
clear when each should be used.

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-22 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878959#comment-15878959
 ] 

Charles Allen edited comment on SPARK-19698 at 2/22/17 6:46 PM:


I *think* this is due to the driver not having the concept of a "critical 
section" for code being executed, meaning that you can't declare a portion of 
the code being run as "I'm in a non-atomic or critical command region, please 
let me finish" 


was (Author: drcrallen):
I *think* this is due to the driver not having the concept of a "critical 
section" for code being executed, meaning that you can't declare a portion of 
the code being run as "I'm in a non-idempotent command region, please let me 
finish" 

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-22 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878959#comment-15878959
 ] 

Charles Allen commented on SPARK-19698:
---

I *think* this is due to the driver not having the concept of a "critical 
section" for code being executed, meaning that you can't declare a portion of 
the code being run as "I'm in a non-idempotent command region, please let me 
finish" 

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-22 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-19698:
--
Summary: Race condition in stale attempt task completion vs current attempt 
task completion when task is doing persistent state changes  (was: Race 
condition in stale attempt task completion vs current attempt task completion)

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion

2017-02-22 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878927#comment-15878927
 ] 

Charles Allen commented on SPARK-19698:
---

[~jisookim0...@gmail.com] has been investigating this on our side.

> Race condition in stale attempt task completion vs current attempt task 
> completion
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion

2017-02-22 Thread Charles Allen (JIRA)
Charles Allen created SPARK-19698:
-

 Summary: Race condition in stale attempt task completion vs 
current attempt task completion
 Key: SPARK-19698
 URL: https://issues.apache.org/jira/browse/SPARK-19698
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Spark Core
Affects Versions: 2.0.0
Reporter: Charles Allen


We have encountered a strange scenario in our production environment. Below is 
the best guess we have right now as to what's going on.

Potentially, the final stage of a job has a failure in one of the tasks (such 
as OOME on the executor) which can cause tasks for that stage to be relaunched 
in a second attempt.

https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155

keeps track of which tasks have been completed, but does NOT keep track of 
which attempt those tasks were completed in. As such, we have encountered a 
scenario where a particular task gets executed twice in different stage 
attempts, and the DAGScheduler does not consider if the second attempt is still 
running. This means if the first task attempt succeeded, the second attempt can 
be cancelled part-way through its run cycle if all other tasks (including the 
prior failed) are completed successfully.

What this means is that if a task is manipulating some state somewhere (for 
example: a upload-to-temporary-file-location, then delete-then-move on an 
underlying s3n storage implementation) the driver can improperly shutdown the 
running (2nd attempt) task between state manipulations, leaving the persistent 
state in a bad state since the 2nd attempt never got to complete its 
manipulations, and was terminated prematurely at some arbitrary point in its 
state change logic (ex: finished the delete but not the move).

This is using the mesos coarse grained executor. It is unclear if this behavior 
is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19479) Spark Mesos artifact split causes spark-core dependency to not pull in mesos impl

2017-02-06 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855102#comment-15855102
 ] 

Charles Allen commented on SPARK-19479:
---

[~mgummelt] that's actually a really good suggestion. Somehow I never got 
subscribed to the dev list

> Spark Mesos artifact split causes spark-core dependency to not pull in mesos 
> impl
> -
>
> Key: SPARK-19479
> URL: https://issues.apache.org/jira/browse/SPARK-19479
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.1.0
>Reporter: Charles Allen
>
> https://github.com/apache/spark/pull/14637 ( 
> https://issues.apache.org/jira/browse/SPARK-16967 ) forked off the mesos impl 
> into its own artifact, but the release notes do not call this out. This broke 
> our deployments because we depend on packaging with spark-core, which no 
> longer had any mesos awareness. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19479) Spark Mesos artifact split causes spark-core dependency to not pull in mesos impl

2017-02-06 Thread Charles Allen (JIRA)
Charles Allen created SPARK-19479:
-

 Summary: Spark Mesos artifact split causes spark-core dependency 
to not pull in mesos impl
 Key: SPARK-19479
 URL: https://issues.apache.org/jira/browse/SPARK-19479
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Spark Core
Affects Versions: 2.1.0
Reporter: Charles Allen


https://github.com/apache/spark/pull/14637 ( 
https://issues.apache.org/jira/browse/SPARK-16967 ) forked off the mesos impl 
into its own artifact, but the release notes do not call this out. This broke 
our deployments because we depend on packaging with spark-core, which no longer 
had any mesos awareness. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16333) Excessive Spark history event/json data size (5GB each)

2017-02-01 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849433#comment-15849433
 ] 

Charles Allen commented on SPARK-16333:
---

We put in a fix for this in our local branch by (optionally) disabling a whole 
bunch of extra metrics that were added recently.

> Excessive Spark history event/json data size (5GB each)
> ---
>
> Key: SPARK-16333
> URL: https://issues.apache.org/jira/browse/SPARK-16333
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) 
> and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server 
> release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build)
>Reporter: Peter Liu
>  Labels: performance, spark2.0.0
>
> With Spark2.0.0-preview (May-24 build), the history event data (the json 
> file), that is generated for each Spark application run (see below), can be 
> as big as 5GB (instead of 14 MB for exactly the same application run and the 
> same input data of 1TB under Spark1.6.1)
> -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959-
> -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856-
> -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556-
> The test is done with Sparkbench V2, SQL RDD (see github: 
> https://github.com/SparkTC/spark-bench)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-26 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840419#comment-15840419
 ] 

Charles Allen commented on SPARK-19111:
---

While switching to s3a helped the logs upload, it made the spark history server 
unusable, which is probably another bug.

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:59:44,679 INFO [main] 
> 

[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-26 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840418#comment-15840418
 ] 

Charles Allen commented on SPARK-19111:
---

We have a patch https://github.com/apache/spark/pull/16714 for SPARK-16333 
which fixes the problem on our side by disabling the verbose new metrics.

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload
> 

[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-09 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812304#comment-15812304
 ] 

Charles Allen commented on SPARK-19111:
---

That's great information, thank you [~ste...@apache.org]

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:59:44,679 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for 

[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-09 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812301#comment-15812301
 ] 

Charles Allen commented on SPARK-19111:
---

I was also going to have the folks here look at the closing sequence to see why 
the spark executor lifecycle wasn't waiting for the close to complete, or not 
reporting the error. https://issues.apache.org/jira/browse/SPARK-12330 was 
filed previously as a "exits too fast" bug.

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream 

[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-09 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15812286#comment-15812286
 ] 

Charles Allen commented on SPARK-19111:
---

If you prefer "re-open on more data" then I can definitely accommodate that.

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:59:44,679 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem 

[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-08 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15809837#comment-15809837
 ] 

Charles Allen commented on SPARK-19111:
---

So to clarify, I agree this ticket is not actionable as is, but I have not seen 
this effect recorded in another ticket (is it?), so I propose leaving this open 
until more information is available either from our side or from others who 
report a similar problem.

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 

[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-07 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807603#comment-15807603
 ] 

Charles Allen commented on SPARK-19111:
---

I have not been able to finish root cause stuff, but I know it works for jobs 
except for our largest spark job. And it consistently fails for that large 
spark job.

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload

[jira] [Updated] (SPARK-19111) S3 Mesos history upload fails if too large

2017-01-06 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-19111:
--
Summary: S3 Mesos history upload fails if too large  (was: S3 Mesos history 
upload fails if too large or if distributed datastore is misbehaving)

> S3 Mesos history upload fails if too large
> --
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:59:44,679 INFO [main] 
> 

[jira] [Updated] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-06 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-19111:
--
Summary: S3 Mesos history upload fails silently if too large  (was: S3 
Mesos history upload fails if too large)

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:59:44,679 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem 

[jira] [Created] (SPARK-19111) S3 Mesos history upload fails if too large or if distributed datastore is misbehaving

2017-01-06 Thread Charles Allen (JIRA)
Charles Allen created SPARK-19111:
-

 Summary: S3 Mesos history upload fails if too large or if 
distributed datastore is misbehaving
 Key: SPARK-19111
 URL: https://issues.apache.org/jira/browse/SPARK-19111
 Project: Spark
  Issue Type: Bug
  Components: EC2, Mesos, Spark Core
Affects Versions: 2.0.0
Reporter: Charles Allen


{code}
2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped Spark 
web UI at http://REDACTED:4041
2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.jvmGCTime
2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.shuffle.read.localBlocksFetched
2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.resultSerializationTime
2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(
364,WrappedArray())
2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.resultSize
2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.peakExecutionMemory
2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.shuffle.read.fetchWaitTime
2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.memoryBytesSpilled
2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.shuffle.read.remoteBytesRead
2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.diskBytesSpilled
2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.shuffle.read.localBytesRead
2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.shuffle.read.recordsRead
2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.executorDeserializeTime
2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.executorRunTime
2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.shuffle.read.remoteBlocksFetched
2017-01-06T21:32:32,943 INFO [main] 
org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
closed. Now beginning upload
2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
{code}

Running spark on mesos, some large jobs fail to upload to the history server 
storage!

A successful sequence of events in the log that yield an upload are as follows:

{code}
2017-01-06T19:14:32,925 INFO [main] 
org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
2017-01-06T21:59:14,789 INFO [main] 
org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
closed. Now beginning upload
2017-01-06T21:59:44,679 INFO [main] 
org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' upload 
complete
{code}

But large jobs do not ever get to the {{upload complete}} log message, and 
instead exit before completion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Created] (SPARK-18600) BZ2 CRC read error needs better reporting

2016-11-27 Thread Charles Allen (JIRA)
Charles Allen created SPARK-18600:
-

 Summary: BZ2 CRC read error needs better reporting
 Key: SPARK-18600
 URL: https://issues.apache.org/jira/browse/SPARK-18600
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Charles Allen




{code}
16/11/25 20:05:03 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 148 in 
stage 5.0 failed 1 times, most recent failure: Lost task 148.0 in stage 5.0 
(TID 5945, localhost): org.apache.spark.SparkException: Task failed while 
writing rows
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalStateException - Error reading from input
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Empty value=null
Escape unquoted values=false
Header extraction enabled=null
Headers=[INTERVALSTARTTIME_GMT, INTERVALENDTIME_GMT, OPR_DT, OPR_HR, 
NODE_ID_XML, NODE_ID, NODE, MARKET_RUN_ID, LMP_TYPE, XML_DATA_ITEM, 
PNODE_RESMRID, GRP_TYPE, POS, VALUE, OPR_INTERVAL, GROUP]
Ignore leading whitespaces=false
Ignore trailing whitespaces=false
Input buffer size=128
Input reading on separate thread=false
Keep escape sequences=false
Line separator detection enabled=false
Maximum number of characters per column=100
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Row processor=none
RowProcessor error handler=null
Selected fields=none
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=\0
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=27089, column=13, record=27089, 
charIndex=4451456, headers=[INTERVALSTARTTIME_GMT, INTERVALENDTIME_GMT, OPR_DT, 
OPR_HR, NODE_ID_XML, NODE_ID, NODE, MARKET_RUN_ID, LMP_TYPE, XML_DATA_ITEM, 
PNODE_RESMRID, GRP_TYPE, POS, VALUE, OPR_INTERVAL, GROUP]
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:302)
at 
com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:431)
at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:148)
at 
org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:131)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-09-21 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511767#comment-15511767
 ] 

Charles Allen commented on SPARK-6305:
--

Just FYI, as I found out recently, kafka (at least 8.x) requires log4j on the 
classpath 
(http://mail-archives.apache.org/mod_mbox/kafka-users/201401.mbox/%3ccaa7ooca0+3sltognxaxwofysedkysfyqt0hs_a6r3jy...@mail.gmail.com%3E
 for only other reference to this problem I could find). But the slf4j-log4j12 
bridge can at least be removed.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13640) Synchronize ScalaReflection.mirror method.

2016-09-17 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15499471#comment-15499471
 ] 

Charles Allen commented on SPARK-13640:
---

These failed on first attempt:
{code}
org.apache.spark.sql.catalyst.ScalaReflectionSuite.SPARK-13640: thread safety 
of constructorFor
org.apache.spark.sql.catalyst.ScalaReflectionSuite.SPARK-13640: thread safety 
of extractorsFor
org.apache.spark.sql.catalyst.ScalaReflectionSuite.SPARK-13640: thread safety 
of schemaFor
{code}


Second attempt:
{code}
org.apache.spark.sql.catalyst.ScalaReflectionSuite.SPARK-13640: thread safety 
of dataTypeFor
org.apache.spark.sql.catalyst.ScalaReflectionSuite.SPARK-13640: thread safety 
of extractorsFor
{code}

> Synchronize ScalaReflection.mirror method.
> --
>
> Key: SPARK-13640
> URL: https://issues.apache.org/jira/browse/SPARK-13640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.0.0
>
>
> {{ScalaReflection.mirror}} method should be synchronized when scala version 
> is 2.10 because {{universe.runtimeMirror}} is not thread safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13640) Synchronize ScalaReflection.mirror method.

2016-09-17 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15499457#comment-15499457
 ] 

Charles Allen commented on SPARK-13640:
---

I keep having scala 2.10 test failures in tests for this patch.

> Synchronize ScalaReflection.mirror method.
> --
>
> Key: SPARK-13640
> URL: https://issues.apache.org/jira/browse/SPARK-13640
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.0.0
>
>
> {{ScalaReflection.mirror}} method should be synchronized when scala version 
> is 2.10 because {{universe.runtimeMirror}} is not thread safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11714) Make Spark on Mesos honor port restrictions

2016-08-15 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421478#comment-15421478
 ] 

Charles Allen commented on SPARK-11714:
---

Awesome! Thanks guys!

> Make Spark on Mesos honor port restrictions
> ---
>
> Key: SPARK-11714
> URL: https://issues.apache.org/jira/browse/SPARK-11714
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
>Assignee: Stavros Kontopoulos
> Fix For: 2.1.0
>
>
> Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
> as a resource offer in Mesos. This ask is to have the ports which the 
> executor binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16952) [MESOS] MesosCoarseGrainedSchedulerBackend requires spark.mesos.executor.home even if spark.executor.uri is set

2016-08-08 Thread Charles Allen (JIRA)
Charles Allen created SPARK-16952:
-

 Summary: [MESOS] MesosCoarseGrainedSchedulerBackend requires 
spark.mesos.executor.home even if spark.executor.uri is set
 Key: SPARK-16952
 URL: https://issues.apache.org/jira/browse/SPARK-16952
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Scheduler
Affects Versions: 2.0.0, 1.6.1, 1.6.0, 1.5.2
Reporter: Charles Allen
Priority: Minor


In the Mesos coarse grained scheduler, setting `spark.executor.uri` bypasses 
the code path which requires `spark.mesos.executor.home` since the uri 
effectively provides the executor home.

But 
`org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend#createCommand`
 requires `spark.mesos.executor.home` to be set regardless.

Our workaround is to set `spark.mesos.executor.home=/dev/null` when using an 
executor uri.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-08 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412046#comment-15412046
 ] 

Charles Allen commented on SPARK-16798:
---

I have a much better automated packaging and deployment system  up now which is 
able to have much stricter guarantees over binary delivery (aka, no possibility 
for charles to fat-finger something) and this error is not showing up anymore 
on stock spark. So I consider this closed and a side effect of some oddity in 
the packaging or unexpected second order effects from patching 
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-04 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407877#comment-15407877
 ] 

Charles Allen commented on SPARK-16798:
---

I reproduced it with stock spark. I'm working on getting a tarball attached to 
this ticket which reproduces the error reliably.

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-04 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407877#comment-15407877
 ] 

Charles Allen edited comment on SPARK-16798 at 8/4/16 2:49 PM:
---

I reproduced it with stock spark. I'm working on getting a tarball attached to 
this ticket which reproduces the error reliably using stock spark.


was (Author: drcrallen):
I reproduced it with stock spark. I'm working on getting a tarball attached to 
this ticket which reproduces the error reliably.

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-03 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406967#comment-15406967
 ] 

Charles Allen edited comment on SPARK-16798 at 8/4/16 1:41 AM:
---

[~srowen] I manually went in with IntelliJ debugging and can confirm that the 
driver DOES have a valid positive integer value for numPartitions when in the 
DRIVER. But when running in the mesos executor or in local[4], the task has a 
value of 0 consistently.

I have been able to reproduce this with TPCH data, so I can share it around if 
you can point me to someone who can help debug what might have changed.

The code snippet below is from RDD.scala, with my own comments added

{code:title=RDD.scala}
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
   partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
  (implicit ord: Ordering[T] = null)
  : RDD[T] = withScope {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be 
positive.") // Correct on DRIVER
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions) // 
numPartitions == 0 in TASK
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
  new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
  }
{code}


was (Author: drcrallen):
[~srowen] I manually went in with IntelliJ debugging and can confirm that the 
driver DOES have a valid positive integer value for numPartitions when in the 
DRIVER. But when running in the mesos executor or in local[4], the task has a 
value of 0 consistently.

I have been able to reproduce this with TPCH data, so I can share it around if 
you can point me to someone who can help debug what might have changed.

The code snippet below is from RDD.scala, with my own comments addedl

{code:title=RDD.scala}
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
   partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
  (implicit ord: Ordering[T] = null)
  : RDD[T] = withScope {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be 
positive.") // Correct on DRIVER
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions) // 
numPartitions == 0 in TASK
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
  new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
  }
{code}

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> 

[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-03 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406967#comment-15406967
 ] 

Charles Allen commented on SPARK-16798:
---

[~srowen] I manually went in with IntelliJ debugging and can confirm that the 
driver DOES have a valid positive integer value for numPartitions when in the 
DRIVER. But when running in the mesos executor or in local[4], the task has a 
value of 0 consistently.

I have been able to reproduce this with TPCH data, so I can share it around if 
you can point me to someone who can help debug what might have changed.

The code snippet below is from RDD.scala, with my own comments addedl

{code:title=RDD.scala}
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
   partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
  (implicit ord: Ordering[T] = null)
  : RDD[T] = withScope {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be 
positive.") // Correct on DRIVER
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a 
random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions) // 
numPartitions == 0 in TASK
items.map { t =>
  // Note that the hash code of the key will just be the key itself. 
The HashPartitioner
  // will mod it with the number of total partitions.
  position = position + 1
  (position, t)
}
  } : Iterator[(Int, T)]

  // include a shuffle step so that our upstream tasks are still distributed
  new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
  new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
  }
{code}

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-01 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403116#comment-15403116
 ] 

Charles Allen commented on SPARK-16798:
---

Yep, still happens:

{code}
16/08/02 00:41:17 INFO HadoopRDD: Input split: REDACTED.gz:0+7389144
16/08/02 00:41:17 INFO TorrentBroadcast: Started reading broadcast variable 0
16/08/02 00:41:17 INFO TransportClientFactory: Successfully created connection 
to /<> after 1 ms (0 ms spent in bootstraps)
16/08/02 00:41:17 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 18.2 KB, free 3.6 GB)
16/08/02 00:41:17 INFO TorrentBroadcast: Reading broadcast variable 0 took 34 ms
16/08/02 00:41:17 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 209.2 KB, free 3.6 GB)
16/08/02 00:41:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
16/08/02 00:41:18 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
16/08/02 00:41:18 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
16/08/02 00:41:18 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
16/08/02 00:41:18 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
16/08/02 00:41:18 INFO NativeS3FileSystem: Opening 'REDACTED.gz' for reading
16/08/02 00:41:18 INFO CodecPool: Got brand-new decompressor [.gz]
16/08/02 00:41:19 ERROR Executor: Exception in task 11.0 in stage 0.0 (TID 11)
java.lang.IllegalArgumentException: bound must be positive
at java.util.Random.nextInt(Random.java:388)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Comment Edited] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-01 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402676#comment-15402676
 ] 

Charles Allen edited comment on SPARK-16798 at 8/1/16 7:30 PM:
---

Minor update. Due to library collisions I have to change around how some of the 
tagging works internally. I'm cutting an internal-only (MMX) release of 
https://github.com/metamx/spark/commit/13650fc58e1fcf2cf2a26ba11c819185ae1acc1f 
with a new tag/version to prevent potential version conflicts in our 
infrastructure. Didn't want to mess with it over the weekend so new build is 
making its way through now.


was (Author: drcrallen):
Minor update. Due to library collisions I have to change around how some of the 
tagging works internally. I'm cutting an internal-only release of 
https://github.com/metamx/spark/commit/13650fc58e1fcf2cf2a26ba11c819185ae1acc1f 
with a new tag/version to prevent potential version conflicts in our 
infrastructure. Didn't want to mess with it over the weekend so new build is 
making its way through now.

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-08-01 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402676#comment-15402676
 ] 

Charles Allen commented on SPARK-16798:
---

Minor update. Due to library collisions I have to change around how some of the 
tagging works internally. I'm cutting an internal-only release of 
https://github.com/metamx/spark/commit/13650fc58e1fcf2cf2a26ba11c819185ae1acc1f 
with a new tag/version to prevent potential version conflicts in our 
infrastructure. Didn't want to mess with it over the weekend so new build is 
making its way through now.

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-07-29 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400089#comment-15400089
 ] 

Charles Allen edited comment on SPARK-16798 at 7/29/16 10:06 PM:
-

Adding some more flavor, this is running in Mesos coarse mode against 0.28.2.

If I take a subset of the data that failed and run it locally (local[4] or 
local[1]), it succeeds, which is annoying.

here are the info logs from the failing tasks:

{code}
16/07/29 18:19:20 INFO HadoopRDD: Input split: REDACTED1:0+163064
16/07/29 18:19:20 INFO TorrentBroadcast: Started reading broadcast variable 0
16/07/29 18:19:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 18.2 KB, free 3.6 GB)
16/07/29 18:19:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 10 ms
16/07/29 18:19:20 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 209.2 KB, free 3.6 GB)
16/07/29 18:19:20 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
16/07/29 18:19:20 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
16/07/29 18:19:20 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
16/07/29 18:19:20 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
16/07/29 18:19:20 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
16/07/29 18:19:21 INFO NativeS3FileSystem: Opening 'REDACTED1' for reading
16/07/29 18:19:21 INFO CodecPool: Got brand-new decompressor [.gz]
16/07/29 18:19:21 ERROR Executor: Exception in task 9.0 in stage 0.0 (TID 9)
java.lang.IllegalArgumentException: bound must be positive
at java.util.Random.nextInt(Random.java:388)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/07/29 18:19:21 INFO CoarseGrainedExecutorBackend: Got assigned task 14
16/07/29 18:19:21 INFO Executor: Running task 14.0 in stage 0.0 (TID 14)
16/07/29 18:19:21 INFO HadoopRDD: Input split: REDACTED2:0+157816
16/07/29 18:19:21 INFO NativeS3FileSystem: Opening 'REDACTED2' for reading
16/07/29 18:19:21 INFO CodecPool: Got brand-new decompressor [.gz]
16/07/29 18:19:21 ERROR Executor: Exception in task 14.0 in stage 0.0 (TID 14)
java.lang.IllegalArgumentException: bound must be positive
at java.util.Random.nextInt(Random.java:388)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/07/29 18:19:21 INFO CoarseGrainedExecutorBackend: Got assigned task 15
16/07/29 18:19:21 INFO Executor: Running task 9.1 in stage 0.0 (TID 15)
{code}


was (Author: drcrallen):
Adding some more flavor, this is running in Mesos coarse mode against 

[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-07-29 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400089#comment-15400089
 ] 

Charles Allen commented on SPARK-16798:
---

Adding some more flavor, this is running in Mesos coarse mode against 0.28.2.

If I take a subset of the data that failed and run it locally (local[4] or 
local[1]), it succeeds, which is annoying.

here are the info logs from the failing tasks:

{code}
16/07/29 18:19:20 INFO HadoopRDD: Input split: REDACTED1.gz:0+163064
16/07/29 18:19:20 INFO TorrentBroadcast: Started reading broadcast variable 0
16/07/29 18:19:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 18.2 KB, free 3.6 GB)
16/07/29 18:19:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 10 ms
16/07/29 18:19:20 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 209.2 KB, free 3.6 GB)
16/07/29 18:19:20 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
16/07/29 18:19:20 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
16/07/29 18:19:20 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
16/07/29 18:19:20 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
16/07/29 18:19:20 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
16/07/29 18:19:21 INFO NativeS3FileSystem: Opening 'REDACTED1' for reading
16/07/29 18:19:21 INFO CodecPool: Got brand-new decompressor [.gz]
16/07/29 18:19:21 ERROR Executor: Exception in task 9.0 in stage 0.0 (TID 9)
java.lang.IllegalArgumentException: bound must be positive
at java.util.Random.nextInt(Random.java:388)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/07/29 18:19:21 INFO CoarseGrainedExecutorBackend: Got assigned task 14
16/07/29 18:19:21 INFO Executor: Running task 14.0 in stage 0.0 (TID 14)
16/07/29 18:19:21 INFO HadoopRDD: Input split: REDACTED2:0+157816
16/07/29 18:19:21 INFO NativeS3FileSystem: Opening 'REDACTED2' for reading
16/07/29 18:19:21 INFO CodecPool: Got brand-new decompressor [.gz]
16/07/29 18:19:21 ERROR Executor: Exception in task 14.0 in stage 0.0 (TID 14)
java.lang.IllegalArgumentException: bound must be positive
at java.util.Random.nextInt(Random.java:388)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/07/29 18:19:21 INFO CoarseGrainedExecutorBackend: Got assigned task 15
16/07/29 18:19:21 INFO Executor: Running task 9.1 in stage 0.0 (TID 15)
{code}

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 

[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-07-29 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400070#comment-15400070
 ] 

Charles Allen commented on SPARK-16798:
---

I am definitely running a *modified* 2.0.0, but modifications are in the 
scheduler, not the RDD paths.

Right now I'm running 1.5.2_2.11 through the deployment system to get as close 
to apples-to-apples as I can (and so that workflows can be swapped between the 
two ad-hoc)

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-07-29 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400071#comment-15400071
 ] 

Charles Allen commented on SPARK-16798:
---

I'll run 2.0.0 stock as another test that will go out during this push.

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-07-29 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400014#comment-15400014
 ] 

Charles Allen commented on SPARK-16798:
---

The super odd thing here is that RDD.scala:445 *SHOULD* be protected by the 
check introduced in https://github.com/apache/spark/pull/13282 , but for some 
reason it does not seem to be.

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-07-29 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399684#comment-15399684
 ] 

Charles Allen commented on SPARK-16798:
---

I guess it doesn't need to be open, it can be closed, but I'll see if I can get 
better testing around it regardless.

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-07-29 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399682#comment-15399682
 ] 

Charles Allen commented on SPARK-16798:
---

[~srowen] sorry about that, fixed priority. It's a blocker from my side and 
I'll get it reproducible for here. If its ok I'd like to keep this ticket open 
for a few days while I get a reproducible test to show the behavior.

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-07-29 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-16798:
--
Priority: Major  (was: Blocker)

> java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
> 
>
> Key: SPARK-16798
> URL: https://issues.apache.org/jira/browse/SPARK-16798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Code at https://github.com/metamx/druid-spark-batch which was working under 
> 1.5.2 has ceased to function under 2.0.0 with the below stacktrace.
> {code}
> java.lang.IllegalArgumentException: bound must be positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2

2016-07-29 Thread Charles Allen (JIRA)
Charles Allen created SPARK-16798:
-

 Summary: java.lang.IllegalArgumentException: bound must be 
positive : Worked in 1.5.2
 Key: SPARK-16798
 URL: https://issues.apache.org/jira/browse/SPARK-16798
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Charles Allen
Priority: Blocker


Code at https://github.com/metamx/druid-spark-batch which was working under 
1.5.2 has ceased to function under 2.0.0 with the below stacktrace.

{code}
java.lang.IllegalArgumentException: bound must be positive
at java.util.Random.nextInt(Random.java:388)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445)
at 
org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16379) Spark on mesos is broken due to race condition in Logging

2016-07-06 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365117#comment-15365117
 ] 

Charles Allen commented on SPARK-16379:
---

That's great, thanks a ton!

> Spark on mesos is broken due to race condition in Logging
> -
>
> Key: SPARK-16379
> URL: https://issues.apache.org/jira/browse/SPARK-16379
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Sean Owen
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: out.txt
>
>
> This commit introduced a transient lazy log val: 
> https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec
> This has caused problems in the past:
> https://github.com/apache/spark/pull/1004
> One commit before that everything works fine.
> I spotted that when my CI started to fail:
> https://ci.typesafe.com/job/mit-docker-test-ref/191/
> You can easily verify it by installing mesos on your machine and try to 
> connect with spark shell from bin dir:
> ./spark-shell --master mesos://zk://localhost:2181/mesos --conf 
> spark.executor.url=$(pwd)/../spark-2.0.0-SNAPSHOT-bin-test.tgz
> It gets stuck at the point where it tries to create the SparkContext.
> Logging gets stuck here:
> I0705 12:10:10.076617  9303 group.cpp:700] Trying to get 
> '/mesos/json.info_000152' in ZooKeeper
> I0705 12:10:10.076920  9304 detector.cpp:479] A new leading master 
> (UPID=master@127.0.1.1:5050) is detected
> I0705 12:10:10.076956  9303 sched.cpp:326] New master detected at 
> master@127.0.1.1:5050
> I0705 12:10:10.077057  9303 sched.cpp:336] No credentials provided. 
> Attempting to register without authentication
> I0705 12:10:10.090709  9301 sched.cpp:703] Framework registered with 
> 13553f8b-f42c-4f20-88cd-16f1cc153ede-0001
> I verified it also by changing @transient lazy val log to def and it works as 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16379) Spark on mesos is broken due to race condition in Logging

2016-07-06 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365109#comment-15365109
 ] 

Charles Allen commented on SPARK-16379:
---

[~srowen] is there a list of blockers somewhere? I also want to get branch-2.0 
tested from our side but would like to know what sort of caveats to expect.

> Spark on mesos is broken due to race condition in Logging
> -
>
> Key: SPARK-16379
> URL: https://issues.apache.org/jira/browse/SPARK-16379
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Sean Owen
>Priority: Blocker
> Fix For: 2.0.0
>
> Attachments: out.txt
>
>
> This commit introduced a transient lazy log val: 
> https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec
> This has caused problems in the past:
> https://github.com/apache/spark/pull/1004
> One commit before that everything works fine.
> I spotted that when my CI started to fail:
> https://ci.typesafe.com/job/mit-docker-test-ref/191/
> You can easily verify it by installing mesos on your machine and try to 
> connect with spark shell from bin dir:
> ./spark-shell --master mesos://zk://localhost:2181/mesos --conf 
> spark.executor.url=$(pwd)/../spark-2.0.0-SNAPSHOT-bin-test.tgz
> It gets stuck at the point where it tries to create the SparkContext.
> Logging gets stuck here:
> I0705 12:10:10.076617  9303 group.cpp:700] Trying to get 
> '/mesos/json.info_000152' in ZooKeeper
> I0705 12:10:10.076920  9304 detector.cpp:479] A new leading master 
> (UPID=master@127.0.1.1:5050) is detected
> I0705 12:10:10.076956  9303 sched.cpp:326] New master detected at 
> master@127.0.1.1:5050
> I0705 12:10:10.077057  9303 sched.cpp:336] No credentials provided. 
> Attempting to register without authentication
> I0705 12:10:10.090709  9301 sched.cpp:703] Framework registered with 
> 13553f8b-f42c-4f20-88cd-16f1cc153ede-0001
> I verified it also by changing @transient lazy val log to def and it works as 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2016-07-06 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364762#comment-15364762
 ] 

Charles Allen commented on SPARK-6028:
--

ClassLoader problem on my side. Loader was pulling in 1.5.2 classes for the 
driver but 1.6.1 classes in the tasks.

Ideally the default behavior would not have changed, the tasks would have 
launched, then class version conflicts would have given logging, rather than 
having a uri naming conflict.

> Provide an alternative RPC implementation based on the network transport 
> module
> ---
>
> Key: SPARK-6028
> URL: https://issues.apache.org/jira/browse/SPARK-6028
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0
>
>
> Network transport module implements a low level RPC interface. We can build a 
> new RPC implementation on top of that to replace Akka's.
> Design document: 
> https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2016-07-06 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364757#comment-15364757
 ] 

Charles Allen commented on SPARK-6028:
--

Was semi-related. The patch changed the default from akka to netty, and I had 
improper classloader in my app which was loading in the 1.5.2 classes instead 
of the 1.6.1 classes.

> Provide an alternative RPC implementation based on the network transport 
> module
> ---
>
> Key: SPARK-6028
> URL: https://issues.apache.org/jira/browse/SPARK-6028
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0
>
>
> Network transport module implements a low level RPC interface. We can build a 
> new RPC implementation on top of that to replace Akka's.
> Design document: 
> https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2016-07-06 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364589#comment-15364589
 ] 

Charles Allen commented on SPARK-6028:
--

Not sure if this change is the related reason

> Provide an alternative RPC implementation based on the network transport 
> module
> ---
>
> Key: SPARK-6028
> URL: https://issues.apache.org/jira/browse/SPARK-6028
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0
>
>
> Network transport module implements a low level RPC interface. We can build a 
> new RPC implementation on top of that to replace Akka's.
> Design document: 
> https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module

2016-07-06 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364587#comment-15364587
 ] 

Charles Allen commented on SPARK-6028:
--

I just tried out 1.6.1 upgrading from 1.5.2 running spark on mesos. Everything 
was fine in 1.5.2

None of the mesos backend executors launch anymore due to the following error 
(reported in failed tasks on the mesos slaves):

{code}
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1563)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:157)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:259)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Invalid Spark URL: 
akka.tcp://sparkDriver@HOST_REDACTED:43709/user/CoarseGrainedScheduler
at 
org.apache.spark.rpc.netty.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:62)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:140)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:97)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:171)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
... 4 more
{code}

> Provide an alternative RPC implementation based on the network transport 
> module
> ---
>
> Key: SPARK-6028
> URL: https://issues.apache.org/jira/browse/SPARK-6028
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0
>
>
> Network transport module implements a low level RPC interface. We can build a 
> new RPC implementation on top of that to replace Akka's.
> Design document: 
> https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11714) Make Spark on Mesos honor port restrictions

2016-07-01 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359430#comment-15359430
 ] 

Charles Allen commented on SPARK-11714:
---

Each entry would then be evaluated with {code}.format(port_num){code}

> Make Spark on Mesos honor port restrictions
> ---
>
> Key: SPARK-11714
> URL: https://issues.apache.org/jira/browse/SPARK-11714
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
>
> Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
> as a resource offer in Mesos. This ask is to have the ports which the 
> executor binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11714) Make Spark on Mesos honor port restrictions

2016-07-01 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359427#comment-15359427
 ] 

Charles Allen commented on SPARK-11714:
---

A real config might look something like

{code}-Dspark.blockManager.port=%s,-Dspark.executor.port=%s,-Dspark.shuffle.service.port=%s,-Dcom.sun.management.jmxremote.port=%s{code}

> Make Spark on Mesos honor port restrictions
> ---
>
> Key: SPARK-11714
> URL: https://issues.apache.org/jira/browse/SPARK-11714
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
>
> Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
> as a resource offer in Mesos. This ask is to have the ports which the 
> executor binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11714) Make Spark on Mesos honor port restrictions

2016-07-01 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359413#comment-15359413
 ] 

Charles Allen commented on SPARK-11714:
---

My proposed solution to this is to have a new property for coarse mode which 
takes in a coma separated list of format strings which will be passed as extra 
java options to the executor.

For example, if 
{code}-Dcom.sun.management.jmxremote.port=%s,-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=%s{code}
 is passed in the property, then both a JMX port and a remote debugging port 
will be acquired.

> Make Spark on Mesos honor port restrictions
> ---
>
> Key: SPARK-11714
> URL: https://issues.apache.org/jira/browse/SPARK-11714
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
>
> Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
> as a resource offer in Mesos. This ask is to have the ports which the 
> executor binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12248) Make Spark Coarse Mesos Scheduler obey limits on memory/cpu ratios

2016-06-16 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen closed SPARK-12248.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Make Spark Coarse Mesos Scheduler obey limits on memory/cpu ratios
> --
>
> Key: SPARK-12248
> URL: https://issues.apache.org/jira/browse/SPARK-12248
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
> Fix For: 2.0.0
>
>
> It is possible to have spark apps that work best with either more memory or 
> more CPU.
> In a multi-tenant environment (such as Mesos) it can be very beneficial to be 
> able to limit the Coarse scheduler to guarantee an executor doesn't subscribe 
> to too many cpus or too much memory.
> This ask is to add functionality to the Coarse Mesos Scheduler to have basic 
> limits to the ratio of memory to cpu, which default to the current behavior 
> of soaking up whatever resources it can.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12248) Make Spark Coarse Mesos Scheduler obey limits on memory/cpu ratios

2016-06-16 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334376#comment-15334376
 ] 

Charles Allen commented on SPARK-12248:
---

The limit of one task per slave seems to have been removed. That solves at 
least my use case in this matter. 

> Make Spark Coarse Mesos Scheduler obey limits on memory/cpu ratios
> --
>
> Key: SPARK-12248
> URL: https://issues.apache.org/jira/browse/SPARK-12248
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Charles Allen
> Fix For: 2.0.0
>
>
> It is possible to have spark apps that work best with either more memory or 
> more CPU.
> In a multi-tenant environment (such as Mesos) it can be very beneficial to be 
> able to limit the Coarse scheduler to guarantee an executor doesn't subscribe 
> to too many cpus or too much memory.
> This ask is to add functionality to the Coarse Mesos Scheduler to have basic 
> limits to the ratio of memory to cpu, which default to the current behavior 
> of soaking up whatever resources it can.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15992) Code cleanup mesos coarse backend offer evaluation workflow

2016-06-16 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-15992:
--
Attachment: (was: 
0001-Refactor-MesosCoarseGrainedSchedulerBackend-offer-co.patch)

> Code cleanup mesos coarse backend offer evaluation workflow
> ---
>
> Key: SPARK-15992
> URL: https://issues.apache.org/jira/browse/SPARK-15992
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>  Labels: code-cleanup
>
> The offer acceptance workflow is a little hard to follow and not very 
> extensible for future considerations for offers. This is a patch that makes 
> the workflow a little more explicit in its handling of offer resources.
> Patch incoming



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15994) Allow enabling Mesos fetch cache in coarse executor backend

2016-06-16 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-15994:
--
Attachment: (was: 0001-Add-ability-to-enable-mesos-fetch-cache.patch)

> Allow enabling Mesos fetch cache in coarse executor backend 
> 
>
> Key: SPARK-15994
> URL: https://issues.apache.org/jira/browse/SPARK-15994
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> Mesos 0.23.0 introduces a Fetch Cache feature 
> http://mesos.apache.org/documentation/latest/fetcher/ which allows caching of 
> resources specified in command URIs.
> This patch:
> * Updates the Mesos shaded protobuf dependency to 0.23.0
> * Allows setting `spark.mesos.fetchCache.enable` to enable the fetch cache 
> for all specified URIs. (URIs must be specified for the setting to have any 
> affect)
> * Updates documentation for Mesos configuration with the new setting.
> This patch does NOT:
> * Allow for per-URI caching configuration. The cache setting is global to ALL 
> URIs for the command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15994) Allow enabling Mesos fetch cache in coarse executor backend

2016-06-16 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-15994:
--
Attachment: 0001-Add-ability-to-enable-mesos-fetch-cache.patch

> Allow enabling Mesos fetch cache in coarse executor backend 
> 
>
> Key: SPARK-15994
> URL: https://issues.apache.org/jira/browse/SPARK-15994
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Charles Allen
> Attachments: 0001-Add-ability-to-enable-mesos-fetch-cache.patch
>
>
> Mesos 0.23.0 introduces a Fetch Cache feature 
> http://mesos.apache.org/documentation/latest/fetcher/ which allows caching of 
> resources specified in command URIs.
> This patch:
> * Updates the Mesos shaded protobuf dependency to 0.23.0
> * Allows setting `spark.mesos.fetchCache.enable` to enable the fetch cache 
> for all specified URIs. (URIs must be specified for the setting to have any 
> affect)
> * Updates documentation for Mesos configuration with the new setting.
> This patch does NOT:
> * Allow for per-URI caching configuration. The cache setting is global to ALL 
> URIs for the command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15994) Allow enabling Mesos fetch cache in coarse executor backend

2016-06-16 Thread Charles Allen (JIRA)
Charles Allen created SPARK-15994:
-

 Summary: Allow enabling Mesos fetch cache in coarse executor 
backend 
 Key: SPARK-15994
 URL: https://issues.apache.org/jira/browse/SPARK-15994
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.0.0
Reporter: Charles Allen


Mesos 0.23.0 introduces a Fetch Cache feature 
http://mesos.apache.org/documentation/latest/fetcher/ which allows caching of 
resources specified in command URIs.

This patch:

* Updates the Mesos shaded protobuf dependency to 0.23.0
* Allows setting `spark.mesos.fetchCache.enable` to enable the fetch cache for 
all specified URIs. (URIs must be specified for the setting to have any affect)
* Updates documentation for Mesos configuration with the new setting.

This patch does NOT:

* Allow for per-URI caching configuration. The cache setting is global to ALL 
URIs for the command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15992) Code cleanup mesos offer evaluation workflow

2016-06-16 Thread Charles Allen (JIRA)
Charles Allen created SPARK-15992:
-

 Summary: Code cleanup mesos offer evaluation workflow
 Key: SPARK-15992
 URL: https://issues.apache.org/jira/browse/SPARK-15992
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.0.0
Reporter: Charles Allen
 Attachments: 
0001-Refactor-MesosCoarseGrainedSchedulerBackend-offer-co.patch

The offer acceptance workflow is a little hard to follow and not very 
extensible for future considerations for offers. This is a patch that makes the 
workflow a little more explicit in its handling of offer resources.

Patch incoming



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15992) Code cleanup mesos coarse backend offer evaluation workflow

2016-06-16 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-15992:
--
Summary: Code cleanup mesos coarse backend offer evaluation workflow  (was: 
Code cleanup mesos offer evaluation workflow)

> Code cleanup mesos coarse backend offer evaluation workflow
> ---
>
> Key: SPARK-15992
> URL: https://issues.apache.org/jira/browse/SPARK-15992
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>  Labels: code-cleanup
> Attachments: 
> 0001-Refactor-MesosCoarseGrainedSchedulerBackend-offer-co.patch
>
>
> The offer acceptance workflow is a little hard to follow and not very 
> extensible for future considerations for offers. This is a patch that makes 
> the workflow a little more explicit in its handling of offer resources.
> Patch incoming



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15992) Code cleanup mesos coarse backend offer evaluation workflow

2016-06-16 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-15992:
--
Attachment: 0001-Refactor-MesosCoarseGrainedSchedulerBackend-offer-co.patch

> Code cleanup mesos coarse backend offer evaluation workflow
> ---
>
> Key: SPARK-15992
> URL: https://issues.apache.org/jira/browse/SPARK-15992
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>  Labels: code-cleanup
> Attachments: 
> 0001-Refactor-MesosCoarseGrainedSchedulerBackend-offer-co.patch
>
>
> The offer acceptance workflow is a little hard to follow and not very 
> extensible for future considerations for offers. This is a patch that makes 
> the workflow a little more explicit in its handling of offer resources.
> Patch incoming



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11183) enable support for mesos 0.24+

2016-06-08 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15320914#comment-15320914
 ] 

Charles Allen commented on SPARK-11183:
---

Eventually it could be wroth adopting something like 
https://github.com/mesosphere/mesos-rxjava to plug into the mesos cluster

> enable support for mesos 0.24+
> --
>
> Key: SPARK-11183
> URL: https://issues.apache.org/jira/browse/SPARK-11183
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Ioannis Polyzos
>
> mesos 0.24, the mesos leader info in ZK has changed to json tis result to 
> spark failed to running on 0.24+.
> References : 
>   https://issues.apache.org/jira/browse/MESOS-2340 
>   
> http://mail-archives.apache.org/mod_mbox/mesos-commits/201506.mbox/%3ced4698dc56444bcdac3bdf19134db...@git.apache.org%3E
>   https://github.com/mesos/elasticsearch/issues/338
>   https://github.com/spark-jobserver/spark-jobserver/issues/267



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11183) enable support for mesos 0.24+

2016-06-08 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15320840#comment-15320840
 ] 

Charles Allen commented on SPARK-11183:
---

Being able to enable the fetch cache 
http://mesos.apache.org/documentation/latest/fetcher/ would be nice also

> enable support for mesos 0.24+
> --
>
> Key: SPARK-11183
> URL: https://issues.apache.org/jira/browse/SPARK-11183
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Ioannis Polyzos
>
> mesos 0.24, the mesos leader info in ZK has changed to json tis result to 
> spark failed to running on 0.24+.
> References : 
>   https://issues.apache.org/jira/browse/MESOS-2340 
>   
> http://mail-archives.apache.org/mod_mbox/mesos-commits/201506.mbox/%3ced4698dc56444bcdac3bdf19134db...@git.apache.org%3E
>   https://github.com/mesos/elasticsearch/issues/338
>   https://github.com/spark-jobserver/spark-jobserver/issues/267



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-05-18 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289676#comment-15289676
 ] 

Charles Allen commented on SPARK-6305:
--

Shading is often used as an artificial ClassLoader with the exception that you 
can't replace classes by replacing jars. So if it is used as a "we don't want 
you to replace stock classes" then that's fine, but if it is used as 
"ClassLoader isolation and dependency tracking is hard" then ultimately that is 
bad. (Spark has lots of fun classloaders to deal with, I know it is not trivial)

For a usage example from my side, the spark druid indexer at 
https://github.com/metamx/druid-spark-batch uses good-'ol-fashioned-jars 
(without shading or assembly) with some primitive classloader isolation through 
https://github.com/druid-io/druid/blob/master/indexing-service/src/main/java/io/druid/indexing/common/task/HadoopTask.java#L128

this means the following is in a directory which is loaded in a classloader for 
the driver:

activation-1.1.1.jar
akka-actor_2.10-2.3.11.jar
akka-remote_2.10-2.3.11.jar
akka-slf4j_2.10-2.3.11.jar
aopalliance-1.0.jar
asm-3.1.jar
avro-1.7.7.jar
avro-ipc-1.7.7.jar
avro-ipc-1.7.7-tests.jar
avro-mapred-1.7.7-hadoop2.jar
base64-2.3.8.jar
bcprov-jdk15on-1.51.jar
chill_2.10-0.5.0.jar
chill-java-0.5.0.jar
commons-beanutils-1.7.0.jar
commons-beanutils-core-1.8.0.jar
commons-cli-1.2.jar
commons-codec-1.10.jar
commons-collections-3.2.1.jar
commons-compress-1.4.1.jar
commons-configuration-1.6.jar
commons-digester-1.8.jar
commons-httpclient-3.1.jar
commons-io-2.4.jar
commons-lang-2.6.jar
commons-lang3-3.3.2.jar
commons-math3-3.4.1.jar
commons-net-2.2.jar
compress-lzf-1.0.3.jar
config-1.2.1.jar
curator-client-2.4.0.jar
curator-framework-2.4.0.jar
curator-recipes-2.4.0.jar
gmbal-api-only-3.0.0-b023.jar
grizzly-framework-2.1.2.jar
grizzly-http-2.1.2.jar
grizzly-http-server-2.1.2.jar
grizzly-http-servlet-2.1.2.jar
grizzly-rcm-2.1.2.jar
guice-3.0.jar
hadoop-annotations-2.4.0-mmx6.jar
hadoop-auth-2.4.0-mmx6.jar
hadoop-client-2.4.0-mmx6.jar
hadoop-common-2.4.0-mmx6.jar
hadoop-hdfs-2.4.0-mmx6.jar
hadoop-mapreduce-client-app-2.4.0-mmx6.jar
hadoop-mapreduce-client-common-2.4.0-mmx6.jar
hadoop-mapreduce-client-core-2.4.0-mmx6.jar
hadoop-mapreduce-client-jobclient-2.4.0-mmx6.jar
hadoop-mapreduce-client-shuffle-2.4.0-mmx6.jar
hadoop-yarn-api-2.2.0.jar
hadoop-yarn-client-2.2.0.jar
hadoop-yarn-common-2.2.0.jar
hadoop-yarn-server-common-2.4.0-mmx6.jar
httpclient-4.3.6.jar
httpcore-4.3.3.jar
ivy-2.4.0.jar
jackson-annotations-2.4.0.jar
jackson-core-2.4.4.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.4.4.jar
jackson-jaxrs-1.9.13.jar
jackson-mapper-asl-1.9.13.jar
jackson-module-scala_2.10-2.4.4.jar
jackson-xc-1.9.13.jar
javax.inject-1.jar
java-xmlbuilder-1.0.jar
javax.servlet-3.0.0.v201112011016.jar
javax.servlet-3.1.jar
javax.servlet-api-3.0.1.jar
jaxb-api-2.2.2.jar
jaxb-impl-2.2.3-1.jar
jcl-over-slf4j-1.7.10.jar
jersey-client-1.9.jar
jersey-core-1.9.jar
jersey-grizzly2-1.9.jar
jersey-guice-1.9.jar
jersey-json-1.9.jar
jersey-server-1.9.jar
jersey-test-framework-core-1.9.jar
jersey-test-framework-grizzly2-1.9.jar
jets3t-0.9.3.jar
jettison-1.1.jar
jetty-util-6.1.26.jar
jline-0.9.94.jar
json4s-ast_2.10-3.2.10.jar
json4s-core_2.10-3.2.10.jar
json4s-jackson_2.10-3.2.10.jar
jsr305-1.3.9.jar
jul-to-slf4j-1.7.10.jar
kryo-2.21.jar
log4j-1.2.17.jar
lz4-1.3.0.jar
mail-1.4.7.jar
management-api-3.0.0-b012.jar
mesos-0.21.1-shaded-protobuf.jar
metrics-core-3.1.2.jar
metrics-graphite-3.1.2.jar
metrics-json-3.1.2.jar
metrics-jvm-3.1.2.jar
minlog-1.2.jar
mx4j-3.0.2.jar
netty-3.8.0.Final.jar
netty-all-4.0.29.Final.jar
objenesis-1.2.jar
oro-2.0.8.jar
paranamer-2.6.jar
protobuf-java-2.5.0.jar
py4j-0.8.2.1.jar
pyrolite-4.4.jar
reflectasm-1.07-shaded.jar
RoaringBitmap-0.4.5.jar
scala-compiler-2.10.4.jar
scala-library-2.10.4.jar
scalap-2.10.4.jar
scala-reflect-2.10.4.jar
slf4j-api-1.7.10.jar
slf4j-log4j12-1.7.10.jar
snappy-java-1.1.1.7.jar
spark-core_2.10-1.5.2-mmx4.jar
spark-launcher_2.10-1.5.2-mmx4.jar
spark-network-common_2.10-1.5.2-mmx4.jar
spark-network-shuffle_2.10-1.5.2-mmx4.jar
spark-unsafe_2.10-1.5.2-mmx4.jar
stream-2.7.0.jar
tachyon-client-0.7.1.jar
tachyon-underfs-hdfs-0.7.1.jar
tachyon-underfs-local-0.7.1.jar
uncommons-maths-1.2.2a.jar
unused-1.0.0.jar
xmlenc-0.52.jar
xz-1.0.jar
zookeeper-3.4.5.jar

So that's the list of jars spark thinks it needs to get a driver to connect and 
launch a task.

I haven't bothered to go through and clean out the unwanted jars because the 
classloader isolation is smart (lucky?) enough to where they don't interfere.

The point is, I can go replace specific jars to control what versions of stuff 
are used. For example, I can update mesos versions for the driver independent 
of spark versions, or change the version of hadoop utilized by spark. There is 
an argument to be made that enforcing running the Spark test suite against 
these libs is probably a good 

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-05-18 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289508#comment-15289508
 ] 

Charles Allen commented on SPARK-6305:
--

For what it's worth, I went through a similar exercise for druid.io recently. 
Here's my resulting list of "Hadoop go away" exclusions: 
https://github.com/druid-io/druid/blob/druid-0.9.0/extensions-core/hdfs-storage/pom.xml#L49

Getting the provided, exclusions, and true 
sorted out for dependencies is not trivial. And one of the frustrating things 
with spark is its "screw it, make it a shaded assembly" approach to 
dependencies (anyone know how to get the new s3a stuff from the hadoop storage 
extension to work?). Not sure if there is an overall epic of "handle jar 
dependencies better" but I think this ask would fit better under that than 
simply a blank update of what slf4j impl spark wants to use.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14537) [CORE] SparkContext init hangs if master removes application before backend is ready.

2016-04-11 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-14537:
--
Description: 
During the course of the init of the spark context, the following code is 
executed.

{code}

setupAndStartListenerBus()
postEnvironmentUpdate()
postApplicationStart()

// Post init
_taskScheduler.postStartHook()
_env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))
_executorAllocationManager.foreach { e =>
  _env.metricsSystem.registerSource(e.executorAllocationManagerSource)
}

{code}


If the _taskScheduler.postStartHook() is waiting for a signal from the backend 
that it is ready, AND the driver is disconnected from the master scheduler due 
to a message similar to the one below:

{code}
ERROR [sparkDriver-akka.actor.default-dispatcher-20] 
org.apache.spark.rpc.akka.AkkaRpcEnv - Ignore error: Exiting due to error from 
cluster scheduler: Master removed our application: FAILED
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: 
Master removed our application: FAILED
at 
org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:431) 
~[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:122)
 ~[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.deploy.client.AppClient$ClientEndpoint.markDead(AppClient.scala:243)
 ~[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(AppClient.scala:167)
 ~[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
 [spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
 ~[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
 [spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
 [spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 [scala-library-2.10.5.jar:?]
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 [scala-library-2.10.5.jar:?]
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 [scala-library-2.10.5.jar:?]
at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) 
[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) 
[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) 
[scala-library-2.10.5.jar:?]
at 
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
 [spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at akka.actor.Actor$class.aroundReceive(Actor.scala:467) 
[akka-actor_2.10-2.3.11.jar:?]
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
 [spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) 
[akka-actor_2.10-2.3.11.jar:?]
at akka.actor.ActorCell.invoke(ActorCell.scala:487) 
[akka-actor_2.10-2.3.11.jar:?]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) 
[akka-actor_2.10-2.3.11.jar:?]
at akka.dispatch.Mailbox.run(Mailbox.scala:220) 
[akka-actor_2.10-2.3.11.jar:?]
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
 [akka-actor_2.10-2.3.11.jar:?]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
[scala-library-2.10.5.jar:?]
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 [scala-library-2.10.5.jar:?]
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[scala-library-2.10.5.jar:?]
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 [scala-library-2.10.5.jar:?]
{code}

Then the SparkContext will hang on init because the waiting for the backend to 
be ready never checks to make sure the context is still running:

{code:title=TaskSchedulerImpl.scala}

  private def waitBackendReady(): Unit = {
if (backend.isReady) {
  return
 

[jira] [Created] (SPARK-14537) [CORE] SparkContext init hangs if master removes application before backend is ready.

2016-04-11 Thread Charles Allen (JIRA)
Charles Allen created SPARK-14537:
-

 Summary: [CORE] SparkContext init hangs if master removes 
application before backend is ready.
 Key: SPARK-14537
 URL: https://issues.apache.org/jira/browse/SPARK-14537
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.5.2
Reporter: Charles Allen


During the course of the init of the spark context, the following code is 
executed.

{code:scala}

setupAndStartListenerBus()
postEnvironmentUpdate()
postApplicationStart()

// Post init
_taskScheduler.postStartHook()
_env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))
_executorAllocationManager.foreach { e =>
  _env.metricsSystem.registerSource(e.executorAllocationManagerSource)
}

{code}


If the _taskScheduler.postStartHook() is waiting for a signal from the backend 
that it is ready, AND the driver is disconnected from the master scheduler due 
to a message similar to the one below:

{code}
ERROR [sparkDriver-akka.actor.default-dispatcher-20] 
org.apache.spark.rpc.akka.AkkaRpcEnv - Ignore error: Exiting due to error from 
cluster scheduler: Master removed our application: FAILED
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: 
Master removed our application: FAILED
at 
org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:431) 
~[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:122)
 ~[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.deploy.client.AppClient$ClientEndpoint.markDead(AppClient.scala:243)
 ~[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(AppClient.scala:167)
 ~[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
 [spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
 ~[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
 [spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
 [spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 [scala-library-2.10.5.jar:?]
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 [scala-library-2.10.5.jar:?]
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 [scala-library-2.10.5.jar:?]
at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59) 
[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at 
org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) 
[spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) 
[scala-library-2.10.5.jar:?]
at 
org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
 [spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at akka.actor.Actor$class.aroundReceive(Actor.scala:467) 
[akka-actor_2.10-2.3.11.jar:?]
at 
org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
 [spark-core_2.10-1.5.2-mmx1.jar:1.5.2-mmx1]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) 
[akka-actor_2.10-2.3.11.jar:?]
at akka.actor.ActorCell.invoke(ActorCell.scala:487) 
[akka-actor_2.10-2.3.11.jar:?]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) 
[akka-actor_2.10-2.3.11.jar:?]
at akka.dispatch.Mailbox.run(Mailbox.scala:220) 
[akka-actor_2.10-2.3.11.jar:?]
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
 [akka-actor_2.10-2.3.11.jar:?]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
[scala-library-2.10.5.jar:?]
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 [scala-library-2.10.5.jar:?]
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[scala-library-2.10.5.jar:?]
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 [scala-library-2.10.5.jar:?]
{code}

Then the SparkContext will hang on 

[jira] [Created] (SPARK-13085) Add scalastyle command used in build testing

2016-01-29 Thread Charles Allen (JIRA)
Charles Allen created SPARK-13085:
-

 Summary: Add scalastyle command used in build testing
 Key: SPARK-13085
 URL: https://issues.apache.org/jira/browse/SPARK-13085
 Project: Spark
  Issue Type: Wish
  Components: Build, Tests
Reporter: Charles Allen


As an occasional or new contributor, it is easy to screw up scala style. But 
looking at the output logs (for example 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50300/consoleFull
 ) it is not obvious how to fix the scala style tests, even when reading the 
scala guide.

{code}

Running Scala style checks

Scalastyle checks failed at following occurrences:
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:22:0:
 import.ordering.wrongOrderInGroup.message
[error] (core/compile:scalastyle) errors exist
[error] Total time: 9 s, completed Jan 28, 2016 2:11:00 PM
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala 
; received return code 1
{code}

This ask is that the command used to check scalastyle is presented in the log 
so a developer does not have to wait for the build process to check if a pull 
request should pass scala style checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13085) Add scalastyle command used in build testing

2016-01-29 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-13085:
--
Description: 
As an occasional or new contributor, it is easy to screw up scala style. But 
looking at the output logs (for example 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50300/consoleFull
 ) it is not obvious how to fix the scala style tests, even when reading the 
scala style guide.

{code}

Running Scala style checks

Scalastyle checks failed at following occurrences:
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:22:0:
 import.ordering.wrongOrderInGroup.message
[error] (core/compile:scalastyle) errors exist
[error] Total time: 9 s, completed Jan 28, 2016 2:11:00 PM
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala 
; received return code 1
{code}

This ask is that the command used to check scalastyle is presented in the log 
so a developer does not have to wait for the build process to check if a pull 
request should pass scala style checks.

  was:
As an occasional or new contributor, it is easy to screw up scala style. But 
looking at the output logs (for example 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50300/consoleFull
 ) it is not obvious how to fix the scala style tests, even when reading the 
scala guide.

{code}

Running Scala style checks

Scalastyle checks failed at following occurrences:
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:22:0:
 import.ordering.wrongOrderInGroup.message
[error] (core/compile:scalastyle) errors exist
[error] Total time: 9 s, completed Jan 28, 2016 2:11:00 PM
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala 
; received return code 1
{code}

This ask is that the command used to check scalastyle is presented in the log 
so a developer does not have to wait for the build process to check if a pull 
request should pass scala style checks.


> Add scalastyle command used in build testing
> 
>
> Key: SPARK-13085
> URL: https://issues.apache.org/jira/browse/SPARK-13085
> Project: Spark
>  Issue Type: Wish
>  Components: Build, Tests
>Reporter: Charles Allen
>
> As an occasional or new contributor, it is easy to screw up scala style. But 
> looking at the output logs (for example 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50300/consoleFull
>  ) it is not obvious how to fix the scala style tests, even when reading the 
> scala style guide.
> {code}
> 
> Running Scala style checks
> 
> Scalastyle checks failed at following occurrences:
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:22:0:
>  import.ordering.wrongOrderInGroup.message
> [error] (core/compile:scalastyle) errors exist
> [error] Total time: 9 s, completed Jan 28, 2016 2:11:00 PM
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received 
> return code 1
> {code}
> This ask is that the command used to check scalastyle is presented in the log 
> so a developer does not have to wait for the build process to check if a pull 
> request should pass scala style checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13085) Add scalastyle command used in build testing

2016-01-29 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124518#comment-15124518
 ] 

Charles Allen commented on SPARK-13085:
---

I wanted to know what command was failing the build and it was not obvious from 
the build logs.

> Add scalastyle command used in build testing
> 
>
> Key: SPARK-13085
> URL: https://issues.apache.org/jira/browse/SPARK-13085
> Project: Spark
>  Issue Type: Wish
>  Components: Build, Tests
>Reporter: Charles Allen
>
> As an occasional or new contributor, it is easy to screw up scala style. But 
> looking at the output logs (for example 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50300/consoleFull
>  ) it is not obvious how to fix the scala style tests, even when reading the 
> scala style guide.
> {code}
> 
> Running Scala style checks
> 
> Scalastyle checks failed at following occurrences:
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:22:0:
>  import.ordering.wrongOrderInGroup.message
> [error] (core/compile:scalastyle) errors exist
> [error] Total time: 9 s, completed Jan 28, 2016 2:11:00 PM
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received 
> return code 1
> {code}
> This ask is that the command used to check scalastyle is presented in the log 
> so a developer does not have to wait for the build process to check if a pull 
> request should pass scala style checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13085) Add scalastyle command used in build testing

2016-01-29 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123946#comment-15123946
 ] 

Charles Allen commented on SPARK-13085:
---

{code}
mvn scalastyle:check
{code}

was able to produce a similar error, but it is not obvious if that is the same 
command the build uses.

> Add scalastyle command used in build testing
> 
>
> Key: SPARK-13085
> URL: https://issues.apache.org/jira/browse/SPARK-13085
> Project: Spark
>  Issue Type: Wish
>  Components: Build, Tests
>Reporter: Charles Allen
>
> As an occasional or new contributor, it is easy to screw up scala style. But 
> looking at the output logs (for example 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50300/consoleFull
>  ) it is not obvious how to fix the scala style tests, even when reading the 
> scala style guide.
> {code}
> 
> Running Scala style checks
> 
> Scalastyle checks failed at following occurrences:
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:22:0:
>  import.ordering.wrongOrderInGroup.message
> [error] (core/compile:scalastyle) errors exist
> [error] Total time: 9 s, completed Jan 28, 2016 2:11:00 PM
> [error] running 
> /home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received 
> return code 1
> {code}
> This ask is that the command used to check scalastyle is presented in the log 
> so a developer does not have to wait for the build process to check if a pull 
> request should pass scala style checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1865) Improve behavior of cleanup of disk state

2015-12-22 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068542#comment-15068542
 ] 

Charles Allen commented on SPARK-1865:
--

This is compounded by the fact that some of the shutdown processes will stop() 
during the normal course of the main thread, then will fail to wait for 
completion if stop() is ALSO called via the shutdown hook.

> Improve behavior of cleanup of disk state
> -
>
> Key: SPARK-1865
> URL: https://issues.apache.org/jira/browse/SPARK-1865
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Reporter: Aaron Davidson
>
> Right now the behavior of disk cleanup is centered around the exit hook of 
> the executor, which attempts to cleanup shuffle files and disk manager 
> blocks, but may fail. We should make this behavior more predictable, perhaps 
> by letting the Standalone Worker cleanup the disk state, and adding a flag to 
> disable having the executor cleanup its own state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12330) Mesos coarse executor does not cleanup blockmgr properly on termination if data is stored on disk

2015-12-15 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058819#comment-15058819
 ] 

Charles Allen commented on SPARK-12330:
---

Looks like the mesos coarse scheduler underwent a lot of changes in 1.6 vs 1.5. 
In the 1.6 branch I'm getting errors on the terminating of the coarse tasks and 
reporting of failed tasks.

https://github.com/metamx/spark/commit/338f511e3f6ef03f457555d838ed3a694a77dece

That same stuff against 1.5.1 has a nice and clean shutdown of the executors on 
mesos, reporting FINISHED instead of FAILED or KILLED.

> Mesos coarse executor does not cleanup blockmgr properly on termination if 
> data is stored on disk
> -
>
> Key: SPARK-12330
> URL: https://issues.apache.org/jira/browse/SPARK-12330
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Mesos
>Affects Versions: 1.5.1
>Reporter: Charles Allen
>
> A simple line count example can be launched as similar to 
> {code}
> SPARK_HOME=/mnt/tmp/spark 
> MASTER=mesos://zk://zk.metamx-prod.com:2181/mesos-druid/metrics 
> ./bin/spark-shell --conf spark.mesos.coarse=true --conf spark.cores.max=7 
> --conf spark.mesos.executor.memoryOverhead=2048 --conf 
> spark.mesos.executor.home=/mnt/tmp/spark --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC -Dfile.encoding=UTF-8 
> -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution 
> -XX:+PrintFlagsFinal -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC 
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=1024m 
> -verbose:gc -XX:+PrintFlagsFinal -Djava.io.tmpdir=/mnt/tmp/scratch' --conf 
> spark.hadoop.fs.s3n.awsAccessKeyId='REDACTED' --conf 
> spark.hadoop.fs.s3n.awsSecretAccessKey='REDACTED' --conf 
> spark.executor.memory=7g --conf spark.executorEnv.GLOG_v=9 --conf 
> spark.storage.memoryFraction=0.0 --conf spark.shuffle.memoryFraction=0.0
> {code}
> In the shell the following lines can be executed:
> {code}
> val text_file = 
> sc.textFile("s3n://REDACTED/charlesallen/tpch/lineitem.tbl").persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
> {code}
> {code}
> text_file.map(l => 1).sum
> {code}
> which will result in
> {code}
> res0: Double = 6001215.0
> {code}
> for the TPCH 1GB dataset
> Unfortunately the blockmgr directory remains on the executor node after 
> termination of the spark context.
> The log on the executor looks like this near the termination:
> {code}
> I1215 02:12:31.190878 130732 process.cpp:566] Parsed message name 
> 'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.30:58604 
> from slave(1)@172.19.67.30:5051
> I1215 02:12:31.190928 130732 process.cpp:2382] Spawned process 
> __http__(4)@172.19.67.30:58604
> I1215 02:12:31.190932 130721 process.cpp:2392] Resuming 
> executor(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.190924800+00:00
> I1215 02:12:31.190958 130702 process.cpp:2392] Resuming 
> __http__(4)@172.19.67.30:58604 at 2015-12-15 02:12:31.190951936+00:00
> I1215 02:12:31.190976 130721 exec.cpp:381] Executor asked to shutdown
> I1215 02:12:31.190943 130727 process.cpp:2392] Resuming 
> __gc__@172.19.67.30:58604 at 2015-12-15 02:12:31.190937088+00:00
> I1215 02:12:31.190991 130702 process.cpp:2497] Cleaning up 
> __http__(4)@172.19.67.30:58604
> I1215 02:12:31.191032 130721 process.cpp:2382] Spawned process 
> (2)@172.19.67.30:58604
> I1215 02:12:31.191040 130702 process.cpp:2392] Resuming 
> (2)@172.19.67.30:58604 at 2015-12-15 02:12:31.191037952+00:00
> I1215 02:12:31.191054 130702 exec.cpp:80] Scheduling shutdown of the executor
> I1215 02:12:31.191069 130721 exec.cpp:396] Executor::shutdown took 21572ns
> I1215 02:12:31.191073 130702 clock.cpp:260] Created a timer for 
> (2)@172.19.67.30:58604 in 5secs in the future (2015-12-15 
> 02:12:36.191062016+00:00)
> I1215 02:12:31.191066 130720 process.cpp:2392] Resuming 
> (1)@172.19.67.30:58604 at 2015-12-15 02:12:31.191059200+00:00
> 15/12/15 02:12:31 INFO CoarseGrainedExecutorBackend: Driver commanded a 
> shutdown
> I1215 02:12:31.240103 130732 clock.cpp:151] Handling timers up to 2015-12-15 
> 02:12:31.240091136+00:00
> I1215 02:12:31.240123 130732 clock.cpp:158] Have timeout(s) at 2015-12-15 
> 02:12:31.240036096+00:00
> I1215 02:12:31.240183 130730 process.cpp:2392] Resuming 
> reaper(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.240178176+00:00
> I1215 02:12:31.240226 130730 clock.cpp:260] Created a timer for 
> reaper(1)@172.19.67.30:58604 in 100ms in the future (2015-12-15 
> 02:12:31.340212992+00:00)
> I1215 02:12:31.247019 130720 clock.cpp:260] Created a timer for 
> (1)@172.19.67.30:58604 in 3secs in the future (2015-12-15 
> 02:12:34.247005952+00:00)
> 15/12/15 02:12:31 ERROR 

[jira] [Commented] (SPARK-12330) Mesos coarse executor does not cleanup blockmgr properly on termination if data is stored on disk

2015-12-15 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058729#comment-15058729
 ] 

Charles Allen commented on SPARK-12330:
---

This is because the CoarseMesosSchedulerBackend does not wait for the 
environment to report tasks finished before shutting down the mesos driver. I 
have a fix that seems to be working against 1.5.1. will see if I can 
cherry-pick it to master

> Mesos coarse executor does not cleanup blockmgr properly on termination if 
> data is stored on disk
> -
>
> Key: SPARK-12330
> URL: https://issues.apache.org/jira/browse/SPARK-12330
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Mesos
>Affects Versions: 1.5.1
>Reporter: Charles Allen
>
> A simple line count example can be launched as similar to 
> {code}
> SPARK_HOME=/mnt/tmp/spark 
> MASTER=mesos://zk://zk.metamx-prod.com:2181/mesos-druid/metrics 
> ./bin/spark-shell --conf spark.mesos.coarse=true --conf spark.cores.max=7 
> --conf spark.mesos.executor.memoryOverhead=2048 --conf 
> spark.mesos.executor.home=/mnt/tmp/spark --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC -Dfile.encoding=UTF-8 
> -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution 
> -XX:+PrintFlagsFinal -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC 
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=1024m 
> -verbose:gc -XX:+PrintFlagsFinal -Djava.io.tmpdir=/mnt/tmp/scratch' --conf 
> spark.hadoop.fs.s3n.awsAccessKeyId='REDACTED' --conf 
> spark.hadoop.fs.s3n.awsSecretAccessKey='REDACTED' --conf 
> spark.executor.memory=7g --conf spark.executorEnv.GLOG_v=9 --conf 
> spark.storage.memoryFraction=0.0 --conf spark.shuffle.memoryFraction=0.0
> {code}
> In the shell the following lines can be executed:
> {code}
> val text_file = 
> sc.textFile("s3n://REDACTED/charlesallen/tpch/lineitem.tbl").persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
> {code}
> {code}
> text_file.map(l => 1).sum
> {code}
> which will result in
> {code}
> res0: Double = 6001215.0
> {code}
> for the TPCH 1GB dataset
> Unfortunately the blockmgr directory remains on the executor node after 
> termination of the spark context.
> The log on the executor looks like this near the termination:
> {code}
> I1215 02:12:31.190878 130732 process.cpp:566] Parsed message name 
> 'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.30:58604 
> from slave(1)@172.19.67.30:5051
> I1215 02:12:31.190928 130732 process.cpp:2382] Spawned process 
> __http__(4)@172.19.67.30:58604
> I1215 02:12:31.190932 130721 process.cpp:2392] Resuming 
> executor(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.190924800+00:00
> I1215 02:12:31.190958 130702 process.cpp:2392] Resuming 
> __http__(4)@172.19.67.30:58604 at 2015-12-15 02:12:31.190951936+00:00
> I1215 02:12:31.190976 130721 exec.cpp:381] Executor asked to shutdown
> I1215 02:12:31.190943 130727 process.cpp:2392] Resuming 
> __gc__@172.19.67.30:58604 at 2015-12-15 02:12:31.190937088+00:00
> I1215 02:12:31.190991 130702 process.cpp:2497] Cleaning up 
> __http__(4)@172.19.67.30:58604
> I1215 02:12:31.191032 130721 process.cpp:2382] Spawned process 
> (2)@172.19.67.30:58604
> I1215 02:12:31.191040 130702 process.cpp:2392] Resuming 
> (2)@172.19.67.30:58604 at 2015-12-15 02:12:31.191037952+00:00
> I1215 02:12:31.191054 130702 exec.cpp:80] Scheduling shutdown of the executor
> I1215 02:12:31.191069 130721 exec.cpp:396] Executor::shutdown took 21572ns
> I1215 02:12:31.191073 130702 clock.cpp:260] Created a timer for 
> (2)@172.19.67.30:58604 in 5secs in the future (2015-12-15 
> 02:12:36.191062016+00:00)
> I1215 02:12:31.191066 130720 process.cpp:2392] Resuming 
> (1)@172.19.67.30:58604 at 2015-12-15 02:12:31.191059200+00:00
> 15/12/15 02:12:31 INFO CoarseGrainedExecutorBackend: Driver commanded a 
> shutdown
> I1215 02:12:31.240103 130732 clock.cpp:151] Handling timers up to 2015-12-15 
> 02:12:31.240091136+00:00
> I1215 02:12:31.240123 130732 clock.cpp:158] Have timeout(s) at 2015-12-15 
> 02:12:31.240036096+00:00
> I1215 02:12:31.240183 130730 process.cpp:2392] Resuming 
> reaper(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.240178176+00:00
> I1215 02:12:31.240226 130730 clock.cpp:260] Created a timer for 
> reaper(1)@172.19.67.30:58604 in 100ms in the future (2015-12-15 
> 02:12:31.340212992+00:00)
> I1215 02:12:31.247019 130720 clock.cpp:260] Created a timer for 
> (1)@172.19.67.30:58604 in 3secs in the future (2015-12-15 
> 02:12:34.247005952+00:00)
> 15/12/15 02:12:31 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
> SIGTERM
> 15/12/15 02:12:31 INFO ShutdownHookManager: Shutdown hook called
> no more java logs
> {code}
> If the 

[jira] [Created] (SPARK-12330) Mesos coarse executor does not cleanup blockmgr properly on termination if data is stored on disk

2015-12-14 Thread Charles Allen (JIRA)
Charles Allen created SPARK-12330:
-

 Summary: Mesos coarse executor does not cleanup blockmgr properly 
on termination if data is stored on disk
 Key: SPARK-12330
 URL: https://issues.apache.org/jira/browse/SPARK-12330
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Mesos
Affects Versions: 1.5.1
Reporter: Charles Allen


A simple line count example can be launched as similar to 

{code}
SPARK_HOME=/mnt/tmp/spark 
MASTER=mesos://zk://zk.metamx-prod.com:2181/mesos-druid/metrics 
./bin/spark-shell --conf spark.mesos.coarse=true --conf spark.cores.max=7 
--conf spark.mesos.executor.memoryOverhead=2048 --conf 
spark.mesos.executor.home=/mnt/tmp/spark --conf 
spark.executor.extraJavaOptions='-Duser.timezone=UTC -Dfile.encoding=UTF-8 
-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution 
-XX:+PrintFlagsFinal -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=1024m 
-verbose:gc -XX:+PrintFlagsFinal -Djava.io.tmpdir=/mnt/tmp/scratch' --conf 
spark.hadoop.fs.s3n.awsAccessKeyId='REDACTED' --conf 
spark.hadoop.fs.s3n.awsSecretAccessKey='REDACTED' --conf 
spark.executor.memory=7g --conf spark.executorEnv.GLOG_v=9 --conf 
spark.storage.memoryFraction=0.0 --conf spark.shuffle.memoryFraction=0.0
{code}

In the shell the following lines can be executed:

{code}
val text_file = 
sc.textFile("s3n://REDACTED/charlesallen/tpch/lineitem.tbl").persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
{code}
{code}
text_file.map(l => 1).sum
{code}
which will result in
{code}
res0: Double = 6001215.0
{code}
for the TPCH 1GB dataset

Unfortunately the blockmgr directory remains on the executor node after 
termination of the spark context.

The log on the executor looks like this near the termination:

{code}
I1215 02:12:31.190878 130732 process.cpp:566] Parsed message name 
'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.30:58604 
from slave(1)@172.19.67.30:5051
I1215 02:12:31.190928 130732 process.cpp:2382] Spawned process 
__http__(4)@172.19.67.30:58604
I1215 02:12:31.190932 130721 process.cpp:2392] Resuming 
executor(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.190924800+00:00
I1215 02:12:31.190958 130702 process.cpp:2392] Resuming 
__http__(4)@172.19.67.30:58604 at 2015-12-15 02:12:31.190951936+00:00
I1215 02:12:31.190976 130721 exec.cpp:381] Executor asked to shutdown
I1215 02:12:31.190943 130727 process.cpp:2392] Resuming 
__gc__@172.19.67.30:58604 at 2015-12-15 02:12:31.190937088+00:00
I1215 02:12:31.190991 130702 process.cpp:2497] Cleaning up 
__http__(4)@172.19.67.30:58604
I1215 02:12:31.191032 130721 process.cpp:2382] Spawned process 
(2)@172.19.67.30:58604
I1215 02:12:31.191040 130702 process.cpp:2392] Resuming (2)@172.19.67.30:58604 
at 2015-12-15 02:12:31.191037952+00:00
I1215 02:12:31.191054 130702 exec.cpp:80] Scheduling shutdown of the executor
I1215 02:12:31.191069 130721 exec.cpp:396] Executor::shutdown took 21572ns
I1215 02:12:31.191073 130702 clock.cpp:260] Created a timer for 
(2)@172.19.67.30:58604 in 5secs in the future (2015-12-15 
02:12:36.191062016+00:00)
I1215 02:12:31.191066 130720 process.cpp:2392] Resuming (1)@172.19.67.30:58604 
at 2015-12-15 02:12:31.191059200+00:00
15/12/15 02:12:31 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
I1215 02:12:31.240103 130732 clock.cpp:151] Handling timers up to 2015-12-15 
02:12:31.240091136+00:00
I1215 02:12:31.240123 130732 clock.cpp:158] Have timeout(s) at 2015-12-15 
02:12:31.240036096+00:00
I1215 02:12:31.240183 130730 process.cpp:2392] Resuming 
reaper(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.240178176+00:00
I1215 02:12:31.240226 130730 clock.cpp:260] Created a timer for 
reaper(1)@172.19.67.30:58604 in 100ms in the future (2015-12-15 
02:12:31.340212992+00:00)
I1215 02:12:31.247019 130720 clock.cpp:260] Created a timer for 
(1)@172.19.67.30:58604 in 3secs in the future (2015-12-15 
02:12:34.247005952+00:00)
15/12/15 02:12:31 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
SIGTERM
15/12/15 02:12:31 INFO ShutdownHookManager: Shutdown hook called

no more java logs
{code}

If the shuffle fraction is NOT set to 0.0, and the data is allowed to stay in 
memory, then the following log can be seen at termination instead:
{code}
I1215 01:19:16.247705 120052 process.cpp:566] Parsed message name 
'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.24:60016 
from slave(1)@172.19.67.24:5051
I1215 01:19:16.247745 120052 process.cpp:2382] Spawned process 
__http__(4)@172.19.67.24:60016
I1215 01:19:16.247747 120034 process.cpp:2392] Resuming 
executor(1)@172.19.67.24:60016 at 2015-12-15 01:19:16.247741952+00:00
I1215 01:19:16.247758 120030 process.cpp:2392] Resuming 
__gc__@172.19.67.24:60016 at 2015-12-15 

[jira] [Updated] (SPARK-12330) Mesos coarse executor does not cleanup blockmgr properly on termination if data is stored on disk

2015-12-14 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-12330:
--
Description: 
A simple line count example can be launched as similar to 

{code}
SPARK_HOME=/mnt/tmp/spark 
MASTER=mesos://zk://zk.metamx-prod.com:2181/mesos-druid/metrics 
./bin/spark-shell --conf spark.mesos.coarse=true --conf spark.cores.max=7 
--conf spark.mesos.executor.memoryOverhead=2048 --conf 
spark.mesos.executor.home=/mnt/tmp/spark --conf 
spark.executor.extraJavaOptions='-Duser.timezone=UTC -Dfile.encoding=UTF-8 
-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution 
-XX:+PrintFlagsFinal -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=1024m 
-verbose:gc -XX:+PrintFlagsFinal -Djava.io.tmpdir=/mnt/tmp/scratch' --conf 
spark.hadoop.fs.s3n.awsAccessKeyId='REDACTED' --conf 
spark.hadoop.fs.s3n.awsSecretAccessKey='REDACTED' --conf 
spark.executor.memory=7g --conf spark.executorEnv.GLOG_v=9 --conf 
spark.storage.memoryFraction=0.0 --conf spark.shuffle.memoryFraction=0.0
{code}

In the shell the following lines can be executed:

{code}
val text_file = 
sc.textFile("s3n://REDACTED/charlesallen/tpch/lineitem.tbl").persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
{code}
{code}
text_file.map(l => 1).sum
{code}
which will result in
{code}
res0: Double = 6001215.0
{code}
for the TPCH 1GB dataset

Unfortunately the blockmgr directory remains on the executor node after 
termination of the spark context.

The log on the executor looks like this near the termination:

{code}
I1215 02:12:31.190878 130732 process.cpp:566] Parsed message name 
'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.30:58604 
from slave(1)@172.19.67.30:5051
I1215 02:12:31.190928 130732 process.cpp:2382] Spawned process 
__http__(4)@172.19.67.30:58604
I1215 02:12:31.190932 130721 process.cpp:2392] Resuming 
executor(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.190924800+00:00
I1215 02:12:31.190958 130702 process.cpp:2392] Resuming 
__http__(4)@172.19.67.30:58604 at 2015-12-15 02:12:31.190951936+00:00
I1215 02:12:31.190976 130721 exec.cpp:381] Executor asked to shutdown
I1215 02:12:31.190943 130727 process.cpp:2392] Resuming 
__gc__@172.19.67.30:58604 at 2015-12-15 02:12:31.190937088+00:00
I1215 02:12:31.190991 130702 process.cpp:2497] Cleaning up 
__http__(4)@172.19.67.30:58604
I1215 02:12:31.191032 130721 process.cpp:2382] Spawned process 
(2)@172.19.67.30:58604
I1215 02:12:31.191040 130702 process.cpp:2392] Resuming (2)@172.19.67.30:58604 
at 2015-12-15 02:12:31.191037952+00:00
I1215 02:12:31.191054 130702 exec.cpp:80] Scheduling shutdown of the executor
I1215 02:12:31.191069 130721 exec.cpp:396] Executor::shutdown took 21572ns
I1215 02:12:31.191073 130702 clock.cpp:260] Created a timer for 
(2)@172.19.67.30:58604 in 5secs in the future (2015-12-15 
02:12:36.191062016+00:00)
I1215 02:12:31.191066 130720 process.cpp:2392] Resuming (1)@172.19.67.30:58604 
at 2015-12-15 02:12:31.191059200+00:00
15/12/15 02:12:31 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
I1215 02:12:31.240103 130732 clock.cpp:151] Handling timers up to 2015-12-15 
02:12:31.240091136+00:00
I1215 02:12:31.240123 130732 clock.cpp:158] Have timeout(s) at 2015-12-15 
02:12:31.240036096+00:00
I1215 02:12:31.240183 130730 process.cpp:2392] Resuming 
reaper(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.240178176+00:00
I1215 02:12:31.240226 130730 clock.cpp:260] Created a timer for 
reaper(1)@172.19.67.30:58604 in 100ms in the future (2015-12-15 
02:12:31.340212992+00:00)
I1215 02:12:31.247019 130720 clock.cpp:260] Created a timer for 
(1)@172.19.67.30:58604 in 3secs in the future (2015-12-15 
02:12:34.247005952+00:00)
15/12/15 02:12:31 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
SIGTERM
15/12/15 02:12:31 INFO ShutdownHookManager: Shutdown hook called

no more java logs
{code}

If the shuffle fraction is NOT set to 0.0, and the data is allowed to stay in 
memory, then the following log can be seen at termination instead:
{code}
I1215 01:19:16.247705 120052 process.cpp:566] Parsed message name 
'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.24:60016 
from slave(1)@172.19.67.24:5051
I1215 01:19:16.247745 120052 process.cpp:2382] Spawned process 
__http__(4)@172.19.67.24:60016
I1215 01:19:16.247747 120034 process.cpp:2392] Resuming 
executor(1)@172.19.67.24:60016 at 2015-12-15 01:19:16.247741952+00:00
I1215 01:19:16.247758 120030 process.cpp:2392] Resuming 
__gc__@172.19.67.24:60016 at 2015-12-15 01:19:16.247755008+00:00
I1215 01:19:16.247772 120034 exec.cpp:381] Executor asked to shutdown
I1215 01:19:16.247772 120038 process.cpp:2392] Resuming 
__http__(4)@172.19.67.24:60016 at 2015-12-15 01:19:16.247767808+00:00
I1215 01:19:16.247791 120038 process.cpp:2497] 

[jira] [Resolved] (SPARK-12226) Docs for Mesos don't mention shaded protobuf version

2015-12-09 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen resolved SPARK-12226.
---
Resolution: Won't Fix

Closing as won't fix for now since its not obvious it is a spark supported use 
case

> Docs for Mesos don't mention shaded protobuf version
> 
>
> Key: SPARK-12226
> URL: https://issues.apache.org/jira/browse/SPARK-12226
> Project: Spark
>  Issue Type: Documentation
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Charles Allen
>Priority: Minor
>
> http://spark.apache.org/docs/latest/running-on-mesos.html does not mention 
> that org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend is 
> compiled against the shaded version of the mesos java library. As such the 
> need to use mesos--shaded-protobuf.jar instead of mesos-.jar is not 
> apparent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12226) Docs for Mesos don't mention shaded protobuf version

2015-12-09 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049393#comment-15049393
 ] 

Charles Allen commented on SPARK-12226:
---

[~srowen] Since this is in a project that I'm assembling from independent jars, 
and not using a spark-assembly directly, I've been considering how to best make 
sure this requirement (for an arguably corner case) should best be communicated.

Putting it on http://spark.apache.org/docs/latest/running-on-mesos.html seems 
like it might be more confusing than helpful since it really only pertains to 
people building new spark bundles. Any suggestion on where would be a good 
place to document it?

> Docs for Mesos don't mention shaded protobuf version
> 
>
> Key: SPARK-12226
> URL: https://issues.apache.org/jira/browse/SPARK-12226
> Project: Spark
>  Issue Type: Documentation
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Charles Allen
>Priority: Minor
>
> http://spark.apache.org/docs/latest/running-on-mesos.html does not mention 
> that org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend is 
> compiled against the shaded version of the mesos java library. As such the 
> need to use mesos--shaded-protobuf.jar instead of mesos-.jar is not 
> apparent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12248) Make Spark Coarse Mesos Scheduler obey limits on memory/cpu ratios

2015-12-09 Thread Charles Allen (JIRA)
Charles Allen created SPARK-12248:
-

 Summary: Make Spark Coarse Mesos Scheduler obey limits on 
memory/cpu ratios
 Key: SPARK-12248
 URL: https://issues.apache.org/jira/browse/SPARK-12248
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Charles Allen


It is possible to have spark apps that work best with either more memory or 
more CPU.

In a multi-tenant environment (such as Mesos) it can be very beneficial to be 
able to limit the Coarse scheduler to guarantee an executor doesn't subscribe 
to too many cpus or too much memory.

This ask is to add functionality to the Coarse Mesos Scheduler to have basic 
limits to the ratio of memory to cpu, which default to the current behavior of 
soaking up whatever resources it can.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12226) Docs for Mesos don't mention shaded protobuf version

2015-12-08 Thread Charles Allen (JIRA)
Charles Allen created SPARK-12226:
-

 Summary: Docs for Mesos don't mention shaded protobuf version
 Key: SPARK-12226
 URL: https://issues.apache.org/jira/browse/SPARK-12226
 Project: Spark
  Issue Type: Documentation
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Charles Allen


http://spark.apache.org/docs/latest/running-on-mesos.html does not mention that 
org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend is compiled 
against the shaded version of the mesos java library. As such the need to use 
mesos--shaded-protobuf.jar instead of mesos-.jar is not apparent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12226) Docs for Mesos don't mention shaded protobuf version

2015-12-08 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-12226:
--
Priority: Minor  (was: Major)

> Docs for Mesos don't mention shaded protobuf version
> 
>
> Key: SPARK-12226
> URL: https://issues.apache.org/jira/browse/SPARK-12226
> Project: Spark
>  Issue Type: Documentation
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Charles Allen
>Priority: Minor
>
> http://spark.apache.org/docs/latest/running-on-mesos.html does not mention 
> that org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend is 
> compiled against the shaded version of the mesos java library. As such the 
> need to use mesos--shaded-protobuf.jar instead of mesos-.jar is not 
> apparent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-11-16 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007640#comment-15007640
 ] 

Charles Allen commented on SPARK-11016:
---

[~davies] Was in a meeting, looks like you got it :)

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
> Fix For: 1.6.0
>
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11714) Make Spark on Mesos honor port restrictions

2015-11-12 Thread Charles Allen (JIRA)
Charles Allen created SPARK-11714:
-

 Summary: Make Spark on Mesos honor port restrictions
 Key: SPARK-11714
 URL: https://issues.apache.org/jira/browse/SPARK-11714
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Charles Allen


Currently the MesosSchedulerBackend does not make any effort to honor "ports" 
as a resource offer in Mesos. This ask is to have the ports which the executor 
binds to honor the limits of the "ports" resource of an offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-19 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964043#comment-14964043
 ] 

Charles Allen commented on SPARK-11016:
---

[~srowen] I confirmed locally that https://github.com/metamx/spark/pull/1 
prevents this error, but as per your prior comment a "more correct" 
implementation would probably provide a Kryo Externalizable bridge of some 
kind. 

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8142) Spark Job Fails with ResultTask ClassCastException

2015-10-09 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951238#comment-14951238
 ] 

Charles Allen commented on SPARK-8142:
--

I had a similar failure as topic and solved it by setting 
"spark.executor.userClassPathFirst" to "false" and 
"spark.driver.userClassPathFirst" to "false"

> Spark Job Fails with ResultTask ClassCastException
> --
>
> Key: SPARK-8142
> URL: https://issues.apache.org/jira/browse/SPARK-8142
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Dev Lakhani
>
> When running a Spark Job, I get no failures in the application code 
> whatsoever but a weird ResultTask Class exception. In my job, I create a RDD 
> from HBase and for each partition do a REST call on an API, using a REST 
> client.  This has worked in IntelliJ but when I deploy to a cluster using 
> spark-submit.sh I get :
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, host): java.lang.ClassCastException: 
> org.apache.spark.scheduler.ResultTask cannot be cast to 
> org.apache.spark.scheduler.Task
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> These are the configs I set to override the spark classpath because I want to 
> use my own glassfish jersey version:
>  
> sparkConf.set("spark.driver.userClassPathFirst","true");
> sparkConf.set("spark.executor.userClassPathFirst","true");
> I see no other warnings or errors in any of the logs.
> Unfortunately I cannot post my code, but please ask me questions that will 
> help debug the issue. Using spark 1.3.1 hadoop 2.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-09 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950651#comment-14950651
 ] 

Charles Allen commented on SPARK-11016:
---

[~srowen] As mentioned in 
https://issues.apache.org/jira/browse/SPARK-5949?focusedCommentId=14949819=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14949819
 spark is relying on native Kryo serde for RoaringBitmap stuff in 
KryoSerializer: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L368
 including the protected Element class: 
https://github.com/lemire/RoaringBitmap/blob/RoaringBitmap-0.4.5/src/main/java/org/roaringbitmap/RoaringArray.java#L361
 which was removed in 0.5.0 and later (Spark is on 0.4.5 currently)

The SerDe method sanctioned by the RoaringBitmap library is to use the 
serialize and deserialize methods provided by the RoaringBitmap or RoaringArray 
object. Access to a protected class causes conflicts if a 0.5.0 or later 
version of the RoaringBitmap library is used because Spark will unavoidably 
fail when it tries to register everything in 
org.apache.spark.serializer.KryoSerializer#toRegister , including the 
no-longer-existing protected inner static class

I did a quick jab at a patch locally by registering RoaringBitmap and 
RoaringArray with a com.esotericsoftware.kryo.Serializer, but it is not clear 
how close KryoInput and KryoOutput are to DataInput / DataOutput, which means a 
bridging approach might violate the contract of one or the other.

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5949) Driver program has to register roaring bitmap classes used by spark with Kryo when number of partitions is greater than 2000

2015-10-08 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949525#comment-14949525
 ] 

Charles Allen commented on SPARK-5949:
--

[~lemire] pinging to see if you have any suggestions on how to handle 
situations like this.

> Driver program has to register roaring bitmap classes used by spark with Kryo 
> when number of partitions is greater than 2000
> 
>
> Key: SPARK-5949
> URL: https://issues.apache.org/jira/browse/SPARK-5949
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Peter Torok
>Assignee: Imran Rashid
>  Labels: kryo, partitioning, serialization
> Fix For: 1.4.0
>
>
> When more than 2000 partitions are being used with Kryo, the following 
> classes need to be registered by driver program:
> - org.apache.spark.scheduler.HighlyCompressedMapStatus
> - org.roaringbitmap.RoaringBitmap
> - org.roaringbitmap.RoaringArray
> - org.roaringbitmap.ArrayContainer
> - org.roaringbitmap.RoaringArray$Element
> - org.roaringbitmap.RoaringArray$Element[]
> - short[]
> Our project doesn't have dependency on roaring bitmap and 
> HighlyCompressedMapStatus is intended for internal spark usage. Spark should 
> take care of this registration when Kryo is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5949) Driver program has to register roaring bitmap classes used by spark with Kryo when number of partitions is greater than 2000

2015-10-08 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949518#comment-14949518
 ] 

Charles Allen commented on SPARK-5949:
--

This breaks when using more recent versions of Roaring where 
org.roaringbitmap.RoaringArray$Element is no longer present. The following 
stack trace appears:

{code}
A needed class was not found. This could be due to an error in your runpath. 
Missing class: org/roaringbitmap/RoaringArray$Element
java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
at 
org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
at 
org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
at 
org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
at 
org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
at 
org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
at 
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
at 
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
at 
io.druid.indexer.spark.SparkDruidIndexer$$anonfun$2.apply(SparkDruidIndexer.scala:84)
at 
io.druid.indexer.spark.SparkDruidIndexer$$anonfun$2.apply(SparkDruidIndexer.scala:84)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
io.druid.indexer.spark.SparkDruidIndexer$.loadData(SparkDruidIndexer.scala:84)
at 
io.druid.indexer.spark.TestSparkDruidIndexer$$anonfun$1.apply$mcV$sp(TestSparkDruidIndexer.scala:131)
at 
io.druid.indexer.spark.TestSparkDruidIndexer$$anonfun$1.apply(TestSparkDruidIndexer.scala:40)
at 
io.druid.indexer.spark.TestSparkDruidIndexer$$anonfun$1.apply(TestSparkDruidIndexer.scala:40)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1683)
at 
org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1644)
at 
org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656)
at 
org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1656)
at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1683)
at 
org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714)
at 
org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714)
at 

[jira] [Created] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-08 Thread Charles Allen (JIRA)
Charles Allen created SPARK-11016:
-

 Summary: Spark fails when running with a task that requires a more 
recent version of RoaringBitmaps
 Key: SPARK-11016
 URL: https://issues.apache.org/jira/browse/SPARK-11016
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.0
Reporter: Charles Allen


The following error appears during Kryo init whenever a more recent version 
(>0.5.0) of Roaring bitmaps is required by a job. 
org/roaringbitmap/RoaringArray$Element was removed in 0.5.0

{code}
A needed class was not found. This could be due to an error in your runpath. 
Missing class: org/roaringbitmap/RoaringArray$Element
java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
at 
org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
at 
org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
at 
org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
at 
org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
at 
org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
at 
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
at 
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
{code}

See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org