[jira] [Updated] (SPARK-13704) TaskSchedulerImpl.createTaskSetManager can be expensive, and result in lost executors due to blocked heartbeats

2020-09-29 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-13704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-13704:

Description: 
In some cases, TaskSchedulerImpl.createTaskSetManager can be expensive. For 
example, in a Yarn cluster, it may call the topology script for rack awareness. 
When submit a very large job in a very large Yarn cluster, the topology script 
may take signifiant time to run. And this blocks receiving executors' 
heartbeats, which may result in lost executors

Stacktraces we observed which is related to this issue:
{code}https://issues.apache.org/jira/browse/SPARK-13704#
"dag-scheduler-event-loop" daemon prio=10 tid=0x7f8392875800 nid=0x26e8 
runnable [0x7f83576f4000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:272)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked <0xf551f460> (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
- locked <0xf5529740> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.read1(BufferedReader.java:205)
at java.io.BufferedReader.read(BufferedReader.java:279)
- locked <0xf5529740> (a java.io.InputStreamReader)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:728)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:524)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at 
org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
at 
org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
at 
org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
at 
org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
at 
org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
at 
org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:38)
at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:210)
at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:189)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:189)
at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:158)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:157)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:187)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:161)
- locked <0xea3b8a88> (a 
org.apache.spark.scheduler.cluster.YarnScheduler)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:872)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1362)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

"sparkDriver-akka.actor.default-dispatcher-15" daemon prio=10 
tid=0x7f829c02 nid=0x2737 waiting for monitor entry [0x7f8355ebd000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:362)
- waiting to lock <0xea3b8a88> (a 
org.apache.spark.scheduler.cluster.YarnScheduler)
at 

[jira] [Commented] (SPARK-30586) NPE in LiveRDDDistribution (AppStatusListener)

2020-02-13 Thread Saisai Shao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036734#comment-17036734
 ] 

Saisai Shao commented on SPARK-30586:
-

We also met the same issue. Seems like the code doesn't check the nullable of 
string and directly called String intern, which throws NPE from guava. My first 
thinking is to add nullable check in {{weakIntern}}. Still investigating how 
this could be happened, might be due to the lost or out-of-order spark listener 
event.

> NPE in LiveRDDDistribution (AppStatusListener)
> --
>
> Key: SPARK-30586
> URL: https://issues.apache.org/jira/browse/SPARK-30586
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: A Hadoop cluster consisting of Centos 7.4 machines.
>Reporter: Jan Van den bosch
>Priority: Major
>
> We've been noticing a great amount of NullPointerExceptions in our 
> long-running Spark job driver logs:
> {noformat}
> 20/01/17 23:40:12 ERROR AsyncEventQueue: Listener AppStatusListener threw an 
> exception
> java.lang.NullPointerException
> at 
> org.spark_project.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
> at 
> org.spark_project.guava.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
> at 
> org.spark_project.guava.collect.Interners$WeakInterner.intern(Interners.java:85)
> at 
> org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:603)
> at 
> org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:486)
> at 
> org.apache.spark.status.LiveRDD$$anonfun$2.apply(LiveEntity.scala:548)
> at 
> org.apache.spark.status.LiveRDD$$anonfun$2.apply(LiveEntity.scala:548)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:139)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:548)
> at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:49)
> at 
> org.apache.spark.status.AppStatusListener.org$apache$spark$status$AppStatusListener$$update(AppStatusListener.scala:991)
> at 
> org.apache.spark.status.AppStatusListener.org$apache$spark$status$AppStatusListener$$maybeUpdate(AppStatusListener.scala:997)
> at 
> org.apache.spark.status.AppStatusListener$$anonfun$onExecutorMetricsUpdate$2.apply(AppStatusListener.scala:764)
> at 
> org.apache.spark.status.AppStatusListener$$anonfun$onExecutorMetricsUpdate$2.apply(AppStatusListener.scala:764)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:139)
> at 
> org.apache.spark.status.AppStatusListener.org$apache$spark$status$AppStatusListener$$flush(AppStatusListener.scala:788)
> at 
> org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:764)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:59)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
> at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
> at 
> 

[jira] [Comment Edited] (SPARK-30586) NPE in LiveRDDDistribution (AppStatusListener)

2020-02-13 Thread Saisai Shao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036734#comment-17036734
 ] 

Saisai Shao edited comment on SPARK-30586 at 2/14/20 7:13 AM:
--

We also met the same issue. Seems like the code doesn't check the nullable of 
string and directly called String intern, which throws NPE from guava. My first 
thinking is to add nullable check in {{weakIntern}}. Still investigating how 
this could be happened, might be due to the lost or out-of-order spark listener 
event.

CC [~vanzin]


was (Author: jerryshao):
We also met the same issue. Seems like the code doesn't check the nullable of 
string and directly called String intern, which throws NPE from guava. My first 
thinking is to add nullable check in {{weakIntern}}. Still investigating how 
this could be happened, might be due to the lost or out-of-order spark listener 
event.

> NPE in LiveRDDDistribution (AppStatusListener)
> --
>
> Key: SPARK-30586
> URL: https://issues.apache.org/jira/browse/SPARK-30586
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: A Hadoop cluster consisting of Centos 7.4 machines.
>Reporter: Jan Van den bosch
>Priority: Major
>
> We've been noticing a great amount of NullPointerExceptions in our 
> long-running Spark job driver logs:
> {noformat}
> 20/01/17 23:40:12 ERROR AsyncEventQueue: Listener AppStatusListener threw an 
> exception
> java.lang.NullPointerException
> at 
> org.spark_project.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
> at 
> org.spark_project.guava.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
> at 
> org.spark_project.guava.collect.Interners$WeakInterner.intern(Interners.java:85)
> at 
> org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:603)
> at 
> org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:486)
> at 
> org.apache.spark.status.LiveRDD$$anonfun$2.apply(LiveEntity.scala:548)
> at 
> org.apache.spark.status.LiveRDD$$anonfun$2.apply(LiveEntity.scala:548)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:139)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:548)
> at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:49)
> at 
> org.apache.spark.status.AppStatusListener.org$apache$spark$status$AppStatusListener$$update(AppStatusListener.scala:991)
> at 
> org.apache.spark.status.AppStatusListener.org$apache$spark$status$AppStatusListener$$maybeUpdate(AppStatusListener.scala:997)
> at 
> org.apache.spark.status.AppStatusListener$$anonfun$onExecutorMetricsUpdate$2.apply(AppStatusListener.scala:764)
> at 
> org.apache.spark.status.AppStatusListener$$anonfun$onExecutorMetricsUpdate$2.apply(AppStatusListener.scala:764)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:139)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:139)
> at 
> org.apache.spark.status.AppStatusListener.org$apache$spark$status$AppStatusListener$$flush(AppStatusListener.scala:788)
> at 
> org.apache.spark.status.AppStatusListener.onExecutorMetricsUpdate(AppStatusListener.scala:764)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:59)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
> at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
> at 
> 

[jira] [Resolved] (SPARK-29112) Expose more details when ApplicationMaster reporter faces a fatal exception

2019-09-18 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-29112.
-
Fix Version/s: 3.0.0
 Assignee: Lantao Jin
   Resolution: Fixed

Issue resolved by pull request 25810
https://github.com/apache/spark/pull/25810

> Expose more details when ApplicationMaster reporter faces a fatal exception
> ---
>
> Key: SPARK-29112
> URL: https://issues.apache.org/jira/browse/SPARK-29112
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Minor
> Fix For: 3.0.0
>
>
> In {{ApplicationMaster.Reporter}} thread, fatal exception information is 
> swallowed. It's better to expose.
> A thrift server was shutdown due to some fatal exception but no useful 
> information from log.
> {code}
> 19/09/16 06:59:54,498 INFO [Reporter] yarn.ApplicationMaster:54 : Final app 
> status: FAILED, exitCode: 12, (reason: Exception was thrown 1 time(s) from 
> Reporter thread.)
> 19/09/16 06:59:54,500 ERROR [Driver] thriftserver.HiveThriftServer2:91 : 
> Error starting HiveThriftServer2
> java.lang.InterruptedException: sleep interrupted
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:160)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:708)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29114) Dataset.coalesce(10) throw ChunkFetchFailureException when original Dataset size is big

2019-09-16 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-29114:

Priority: Major  (was: Blocker)

> Dataset.coalesce(10) throw ChunkFetchFailureException when original 
> Dataset size is big
> 
>
> Key: SPARK-29114
> URL: https://issues.apache.org/jira/browse/SPARK-29114
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.0
>Reporter: ZhanxiongWang
>Priority: Major
>
> I create a Dataset df with 200 partitions. I applied for 100 executors 
> for my task. Each executor with 1 core, and driver memory is 8G executor is 
> 16G. I use df.cache() before df.coalesce(10). When{color:#de350b} 
> Dataset{color} {color:#de350b}size is small{color}, the program works 
> well. But when I {color:#de350b}increase{color} the size of the Dataset, 
> the function {color:#de350b}df.coalesce(10){color} will throw 
> ChunkFetchFailureException.
> 19/09/17 08:26:44 INFO CoarseGrainedExecutorBackend: Got assigned task 210
> 19/09/17 08:26:44 INFO Executor: Running task 0.0 in stage 3.0 (TID 210)
> 19/09/17 08:26:44 INFO MapOutputTrackerWorker: Updating epoch to 1 and 
> clearing cache
> 19/09/17 08:26:44 INFO TorrentBroadcast: Started reading broadcast variable 
> 1003
> 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003_piece0 stored as 
> bytes in memory (estimated size 49.4 KB, free 3.8 GB)
> 19/09/17 08:26:44 INFO TorrentBroadcast: Reading broadcast variable 1003 took 
> 7 ms
> 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003 stored as values in 
> memory (estimated size 154.5 KB, free 3.8 GB)
> 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_0 locally
> 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_1 locally
> 19/09/17 08:26:44 INFO TransportClientFactory: Successfully created 
> connection to /100.76.29.130:54238 after 1 ms (0 ms spent in bootstraps)
> 19/09/17 08:26:46 ERROR RetryingBlockFetcher: Failed to fetch block 
> rdd_1005_18, and will not retry (0 retries)
> org.apache.spark.network.client.ChunkFetchFailureException: Failure while 
> fetching StreamChunkId\{streamId=69368607002, chunkIndex=0}: readerIndex: 0, 
> writerIndex: -2137154997 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-2137154997))
>  at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:182)
>  at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
>  at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
>  at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
>  at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
>  at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962)
>  at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399)
>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371)
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
>  at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>  at java.lang.Thread.run(Thread.java:745)
> 19/09/17 08:26:46 WARN BlockManager: Failed to fetch block after 1 fetch 
> failures. Most recent failure cause:
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>  at 

[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Saisai Shao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927265#comment-16927265
 ] 

Saisai Shao commented on SPARK-29038:
-

[~cltlfcjin] I think we need a SPIP review and vote on the dev mail list before 
starting the works.

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Saisai Shao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927263#comment-16927263
 ] 

Saisai Shao commented on SPARK-29038:
-

IIUC, I think the key difference between MV and Spark's built-in {{CACHE}} 
support is: 1. MV needs update when source table is updated, which I think 
current Spark's {{CACHE}} cannot support; 2. classical MV requires writing of 
source query based on the existing MV, which I think current Spark doesn't 
have. Please correct me if I'm wrong.

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19147) netty throw NPE

2019-09-04 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reopened SPARK-19147:
-

> netty throw NPE
> ---
>
> Key: SPARK-19147
> URL: https://issues.apache.org/jira/browse/SPARK-19147
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Priority: Major
>  Labels: bulk-closed
>
> {code}
> 17/01/10 19:17:20 ERROR ShuffleBlockFetcherIterator: Failed to get block(s) 
> from bigdata-hdp-apache1828.xg01.diditaxi.com:7337
> java.lang.NullPointerException: group
>   at io.netty.bootstrap.AbstractBootstrap.group(AbstractBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:203)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:354)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:138)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> 

[jira] [Commented] (SPARK-19147) netty throw NPE

2019-09-04 Thread Saisai Shao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923014#comment-16923014
 ] 

Saisai Shao commented on SPARK-19147:
-

Yes, we also met the similar issue when executor is stopping, floods of netty 
NPE appears. I'm going to reopen this issue, at least we should improve the 
exception message. 

> netty throw NPE
> ---
>
> Key: SPARK-19147
> URL: https://issues.apache.org/jira/browse/SPARK-19147
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Priority: Major
>  Labels: bulk-closed
>
> {code}
> 17/01/10 19:17:20 ERROR ShuffleBlockFetcherIterator: Failed to get block(s) 
> from bigdata-hdp-apache1828.xg01.diditaxi.com:7337
> java.lang.NullPointerException: group
>   at io.netty.bootstrap.AbstractBootstrap.group(AbstractBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:203)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:354)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:138)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> 

[jira] [Commented] (SPARK-28340) Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file: java.nio.channels.ClosedByInterruptException

2019-08-29 Thread Saisai Shao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918386#comment-16918386
 ] 

Saisai Shao commented on SPARK-28340:
-

My simple concern is that there may be other places which will potentially 
throw this "ClosedByInterruptException" during task killing, it seems hard to 
figure out all of them.

> Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught 
> exception while reverting partial writes to file: 
> java.nio.channels.ClosedByInterruptException"
> 
>
> Key: SPARK-28340
> URL: https://issues.apache.org/jira/browse/SPARK-28340
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> If a Spark task is killed while writing blocks to disk (due to intentional 
> job kills, automated killing of redundant speculative tasks, etc) then Spark 
> may log exceptions like
> {code:java}
> 19/07/10 21:31:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception 
> while reverting partial writes to file /
> java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at sun.nio.ch.FileChannelImpl.truncate(FileChannelImpl.java:372)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:218)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1369)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:214)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:105)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748){code}
> If {{BypassMergeSortShuffleWriter}} is being used then a single cancelled 
> task can result in hundreds of these stacktraces being logged.
> Here are some StackOverflow questions asking about this:
>  * [https://stackoverflow.com/questions/40027870/spark-jobserver-job-crash]
>  * 
> [https://stackoverflow.com/questions/50646953/why-is-java-nio-channels-closedbyinterruptexceptio-called-when-caling-multiple]
>  * 
> [https://stackoverflow.com/questions/41867053/java-nio-channels-closedbyinterruptexception-in-spark]
>  * 
> [https://stackoverflow.com/questions/56845041/are-closedbyinterruptexception-exceptions-expected-when-spark-speculation-kills]
>  
> Can we prevent this exception from occurring? If not, can we treat this 
> "expected exception" in a special manner to avoid log spam? My concern is 
> that the presence of large numbers of spurious exceptions is confusing to 
> users when they are inspecting Spark logs to diagnose other issues.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28340) Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file: java.nio.channels.ClosedByInterruptException

2019-08-29 Thread Saisai Shao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918306#comment-16918306
 ] 

Saisai Shao commented on SPARK-28340:
-

We also saw a bunch of exceptions in our production environment. Looks like it 
is hard to prevent unless we change to not use `interrupt`, maybe we can just 
ignore logging such exceptions.

> Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught 
> exception while reverting partial writes to file: 
> java.nio.channels.ClosedByInterruptException"
> 
>
> Key: SPARK-28340
> URL: https://issues.apache.org/jira/browse/SPARK-28340
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> If a Spark task is killed while writing blocks to disk (due to intentional 
> job kills, automated killing of redundant speculative tasks, etc) then Spark 
> may log exceptions like
> {code:java}
> 19/07/10 21:31:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception 
> while reverting partial writes to file /
> java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at sun.nio.ch.FileChannelImpl.truncate(FileChannelImpl.java:372)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:218)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1369)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:214)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:105)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748){code}
> If {{BypassMergeSortShuffleWriter}} is being used then a single cancelled 
> task can result in hundreds of these stacktraces being logged.
> Here are some StackOverflow questions asking about this:
>  * [https://stackoverflow.com/questions/40027870/spark-jobserver-job-crash]
>  * 
> [https://stackoverflow.com/questions/50646953/why-is-java-nio-channels-closedbyinterruptexceptio-called-when-caling-multiple]
>  * 
> [https://stackoverflow.com/questions/41867053/java-nio-channels-closedbyinterruptexception-in-spark]
>  * 
> [https://stackoverflow.com/questions/56845041/are-closedbyinterruptexception-exceptions-expected-when-spark-speculation-kills]
>  
> Can we prevent this exception from occurring? If not, can we treat this 
> "expected exception" in a special manner to avoid log spam? My concern is 
> that the presence of large numbers of spurious exceptions is confusing to 
> users when they are inspecting Spark logs to diagnose other issues.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spilling the shuffle data to 
disk, but the executor is still alive.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system call, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spilling the shuffle data to 
disk, but the executor is still alive.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until it is killed manually. Here's 
> the log you can see, there's no any log after spilling the shuffle data to 
> disk, but the executor is still alive.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system call, we found that this thread is 
> always calling {{fstat}}, and the system usage is pretty high, here is the 
> screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spilling the shuffle data to 
disk, but the executor is still alive.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spill the shuffle data to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until it is killed manually. Here's 
> the log you can see, there's no any log after spilling the shuffle data to 
> disk, but the executor is still alive.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, and the system usage is pretty high, here is the 
> screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spill the shuffle data to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until it is killed manually. Here's 
> the log you can see, there's no any log after spill the shuffle data to disk.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, and the system usage is pretty high, here is the 
> screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, here is the screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, and the system usage is pretty high, here is the 
> screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk for 
several hours.

 !95330.png! 

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk.
>  !95330.png! 
> And here is the thread dump, we could see that it is calling native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, here is the screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Attachment: D18F4.png
95330.png
91ADA.png

> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk for 
> several hours.
> And here is the thread dump, we could see that it is calling native method 
> {{size0}}.
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, here is the screenshot. 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk for 
several hours.

 !95330.png! 

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk for 
several hours.

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk for 
> several hours.
>  !95330.png! 
> And here is the thread dump, we could see that it is calling native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, here is the screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)
Saisai Shao created SPARK-28849:
---

 Summary: Spark's UnsafeShuffleWriter may run into infinite loop in 
transferTo occasionally
 Key: SPARK-28849
 URL: https://issues.apache.org/jira/browse/SPARK-28849
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Saisai Shao


Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk for 
several hours.

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28475) Add regex MetricFilter to GraphiteSink

2019-08-02 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-28475:
---

Assignee: Nick Karpov

> Add regex MetricFilter to GraphiteSink
> --
>
> Key: SPARK-28475
> URL: https://issues.apache.org/jira/browse/SPARK-28475
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Nick Karpov
>Assignee: Nick Karpov
>Priority: Major
> Fix For: 3.0.0
>
>
> Today all registered metric sources are reported to GraphiteSink with no 
> filtering mechanism, although the codahale project does support it.
> GraphiteReporter (ScheduledReporter) from the codahale project requires you 
> implement and supply the MetricFilter interface (there is only a single 
> implementation by default in the codahale project, MetricFilter.ALL).
> Propose to add an additional regex config to match and filter metrics to the 
> GraphiteSink



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28475) Add regex MetricFilter to GraphiteSink

2019-08-02 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28475:

Priority: Minor  (was: Major)

> Add regex MetricFilter to GraphiteSink
> --
>
> Key: SPARK-28475
> URL: https://issues.apache.org/jira/browse/SPARK-28475
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Nick Karpov
>Assignee: Nick Karpov
>Priority: Minor
> Fix For: 3.0.0
>
>
> Today all registered metric sources are reported to GraphiteSink with no 
> filtering mechanism, although the codahale project does support it.
> GraphiteReporter (ScheduledReporter) from the codahale project requires you 
> implement and supply the MetricFilter interface (there is only a single 
> implementation by default in the codahale project, MetricFilter.ALL).
> Propose to add an additional regex config to match and filter metrics to the 
> GraphiteSink



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28475) Add regex MetricFilter to GraphiteSink

2019-08-02 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-28475.
-
   Resolution: Resolved
Fix Version/s: 3.0.0

> Add regex MetricFilter to GraphiteSink
> --
>
> Key: SPARK-28475
> URL: https://issues.apache.org/jira/browse/SPARK-28475
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Nick Karpov
>Priority: Major
> Fix For: 3.0.0
>
>
> Today all registered metric sources are reported to GraphiteSink with no 
> filtering mechanism, although the codahale project does support it.
> GraphiteReporter (ScheduledReporter) from the codahale project requires you 
> implement and supply the MetricFilter interface (there is only a single 
> implementation by default in the codahale project, MetricFilter.ALL).
> Propose to add an additional regex config to match and filter metrics to the 
> GraphiteSink



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28475) Add regex MetricFilter to GraphiteSink

2019-08-02 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898749#comment-16898749
 ] 

Saisai Shao commented on SPARK-28475:
-

This is resolved via https://github.com/apache/spark/pull/25232

> Add regex MetricFilter to GraphiteSink
> --
>
> Key: SPARK-28475
> URL: https://issues.apache.org/jira/browse/SPARK-28475
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Nick Karpov
>Priority: Major
>
> Today all registered metric sources are reported to GraphiteSink with no 
> filtering mechanism, although the codahale project does support it.
> GraphiteReporter (ScheduledReporter) from the codahale project requires you 
> implement and supply the MetricFilter interface (there is only a single 
> implementation by default in the codahale project, MetricFilter.ALL).
> Propose to add an additional regex config to match and filter metrics to the 
> GraphiteSink



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28106) Spark SQL add jar with wrong hdfs path, SparkContext still add it to jar path ,and cause Task Failed

2019-07-16 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-28106:
---

Assignee: angerszhu

> Spark SQL add jar with wrong hdfs path, SparkContext still add it to jar path 
> ,and cause Task Failed
> 
>
> Key: SPARK-28106
> URL: https://issues.apache.org/jira/browse/SPARK-28106
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Attachments: image-2019-06-19-21-23-22-061.png, 
> image-2019-06-20-11-49-13-691.png, image-2019-06-20-11-50-36-418.png, 
> image-2019-06-20-11-51-06-889.png
>
>
> When we use SparkSQL, about add jar command, if we add a wrong path of HDFS 
> such as "add jar hdfs:///home/hadoop/test/test.jar", when execute it:
>  * In hive case , HiveClientImple call add jar, when runHiveSql() called, it 
> will cause error but will still run next code , then call  
> SparkContext.addJar, but this method don't have a path check when path schema 
> is HDFS , then do other sql, TaskDescribtion will carry jarPath of 
> SparkContext's registered JarPath. Then it will carry wrong path then cause 
> error happen
>  * None hive case, the same, will only check local path but not check hdfs 
> path.
>  
> {code:java}
> 19/06/19 19:55:12 INFO SessionState: converting to local 
> hdfs://home/hadoop/aaa.jar
> Failed to read external resource hdfs://home/hadoop/aaa.jar
> 19/06/19 19:55:12 ERROR SessionState: Failed to read external resource 
> hdfs://home/hadoop/aaa.jar
> java.lang.RuntimeException: Failed to read external resource 
> hdfs://home/hadoop/aaa.jar
> at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
> atorg.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
> at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:866)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:835)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:835)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:825)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:983)
> at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)
> at 
> org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomJobTag(SQLExecution.scala:119)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:79)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:143)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
> at org.apache.spark.sql.Dataset.(Dataset.scala:195)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:233)
> at 
> 

[jira] [Resolved] (SPARK-28106) Spark SQL add jar with wrong hdfs path, SparkContext still add it to jar path ,and cause Task Failed

2019-07-16 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-28106.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24909
[https://github.com/apache/spark/pull/24909]

> Spark SQL add jar with wrong hdfs path, SparkContext still add it to jar path 
> ,and cause Task Failed
> 
>
> Key: SPARK-28106
> URL: https://issues.apache.org/jira/browse/SPARK-28106
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: image-2019-06-19-21-23-22-061.png, 
> image-2019-06-20-11-49-13-691.png, image-2019-06-20-11-50-36-418.png, 
> image-2019-06-20-11-51-06-889.png
>
>
> When we use SparkSQL, about add jar command, if we add a wrong path of HDFS 
> such as "add jar hdfs:///home/hadoop/test/test.jar", when execute it:
>  * In hive case , HiveClientImple call add jar, when runHiveSql() called, it 
> will cause error but will still run next code , then call  
> SparkContext.addJar, but this method don't have a path check when path schema 
> is HDFS , then do other sql, TaskDescribtion will carry jarPath of 
> SparkContext's registered JarPath. Then it will carry wrong path then cause 
> error happen
>  * None hive case, the same, will only check local path but not check hdfs 
> path.
>  
> {code:java}
> 19/06/19 19:55:12 INFO SessionState: converting to local 
> hdfs://home/hadoop/aaa.jar
> Failed to read external resource hdfs://home/hadoop/aaa.jar
> 19/06/19 19:55:12 ERROR SessionState: Failed to read external resource 
> hdfs://home/hadoop/aaa.jar
> java.lang.RuntimeException: Failed to read external resource 
> hdfs://home/hadoop/aaa.jar
> at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
> atorg.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
> at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:866)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:835)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:835)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:825)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:983)
> at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)
> at 
> org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomJobTag(SQLExecution.scala:119)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:79)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:143)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
> at org.apache.spark.sql.Dataset.(Dataset.scala:195)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
> at 
> 

[jira] [Updated] (SPARK-28106) Spark SQL add jar with wrong hdfs path, SparkContext still add it to jar path ,and cause Task Failed

2019-07-16 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28106:

Component/s: Spark Core

> Spark SQL add jar with wrong hdfs path, SparkContext still add it to jar path 
> ,and cause Task Failed
> 
>
> Key: SPARK-28106
> URL: https://issues.apache.org/jira/browse/SPARK-28106
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: image-2019-06-19-21-23-22-061.png, 
> image-2019-06-20-11-49-13-691.png, image-2019-06-20-11-50-36-418.png, 
> image-2019-06-20-11-51-06-889.png
>
>
> When we use SparkSQL, about add jar command, if we add a wrong path of HDFS 
> such as "add jar hdfs:///home/hadoop/test/test.jar", when execute it:
>  * In hive case , HiveClientImple call add jar, when runHiveSql() called, it 
> will cause error but will still run next code , then call  
> SparkContext.addJar, but this method don't have a path check when path schema 
> is HDFS , then do other sql, TaskDescribtion will carry jarPath of 
> SparkContext's registered JarPath. Then it will carry wrong path then cause 
> error happen
>  * None hive case, the same, will only check local path but not check hdfs 
> path.
>  
> {code:java}
> 19/06/19 19:55:12 INFO SessionState: converting to local 
> hdfs://home/hadoop/aaa.jar
> Failed to read external resource hdfs://home/hadoop/aaa.jar
> 19/06/19 19:55:12 ERROR SessionState: Failed to read external resource 
> hdfs://home/hadoop/aaa.jar
> java.lang.RuntimeException: Failed to read external resource 
> hdfs://home/hadoop/aaa.jar
> at 
> org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1288)
> atorg.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
> at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:866)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:835)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:835)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:825)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:983)
> at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:112)
> at 
> org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
> at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomJobTag(SQLExecution.scala:119)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:79)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:143)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
> at org.apache.spark.sql.Dataset.(Dataset.scala:195)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80)
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:233)
> at 
> 

[jira] [Assigned] (SPARK-28202) [Core] [Test] Avoid noises of system props in SparkConfSuite

2019-07-01 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-28202:
---

Assignee: ShuMing Li

> [Core] [Test] Avoid noises of system props in SparkConfSuite
> 
>
> Key: SPARK-28202
> URL: https://issues.apache.org/jira/browse/SPARK-28202
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: ShuMing Li
>Assignee: ShuMing Li
>Priority: Trivial
> Fix For: 3.0.0
>
>
> When SPARK_HOME of env is set and contains a specific `spark-defaults,conf`, 
> `org.apache.spark.util.loadDefaultSparkProperties` method may noise `system 
> props`. So when runs `core/test` module, it is possible to fail to run 
> `SparkConfSuite` .
>  
> It's easy to repair by setting `loadDefaults` in `SparkConf` to be false.
> ```
> [info] - accumulators (5 seconds, 565 milliseconds)
> [info] - deprecated configs *** FAILED *** (79 milliseconds)
> [info] 7 did not equal 4 (SparkConfSuite.scala:266)
> [info] org.scalatest.exceptions.TestFailedException:
> [info] at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
> [info] at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
> [info] at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info] at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info] at 
> org.apache.spark.SparkConfSuite.$anonfun$new$26(SparkConfSuite.scala:266)
> [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
> [info] at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info] at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28202) [Core] [Test] Avoid noises of system props in SparkConfSuite

2019-07-01 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-28202.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24998
[https://github.com/apache/spark/pull/24998]

> [Core] [Test] Avoid noises of system props in SparkConfSuite
> 
>
> Key: SPARK-28202
> URL: https://issues.apache.org/jira/browse/SPARK-28202
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: ShuMing Li
>Priority: Trivial
> Fix For: 3.0.0
>
>
> When SPARK_HOME of env is set and contains a specific `spark-defaults,conf`, 
> `org.apache.spark.util.loadDefaultSparkProperties` method may noise `system 
> props`. So when runs `core/test` module, it is possible to fail to run 
> `SparkConfSuite` .
>  
> It's easy to repair by setting `loadDefaults` in `SparkConf` to be false.
> ```
> [info] - accumulators (5 seconds, 565 milliseconds)
> [info] - deprecated configs *** FAILED *** (79 milliseconds)
> [info] 7 did not equal 4 (SparkConfSuite.scala:266)
> [info] org.scalatest.exceptions.TestFailedException:
> [info] at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
> [info] at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
> [info] at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info] at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info] at 
> org.apache.spark.SparkConfSuite.$anonfun$new$26(SparkConfSuite.scala:266)
> [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
> [info] at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info] at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2019-07-01 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876070#comment-16876070
 ] 

Saisai Shao commented on SPARK-25299:
-

Better to post a pdf version [~mcheah] :).

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.
> Edit June 28 2019: Our SPIP is here: 
> [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2019-06-28 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874703#comment-16874703
 ] 

Saisai Shao commented on SPARK-25299:
-

Votes were passed, so what is our plan for code submission? [~yifeih] [~mcheah]

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25299) Use remote storage for persisting shuffle data

2019-06-13 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-25299:

Labels: SPIP  (was: )

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27996) Spark UI redirect will be failed behind the https reverse proxy

2019-06-10 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-27996:
---

 Summary: Spark UI redirect will be failed behind the https reverse 
proxy
 Key: SPARK-27996
 URL: https://issues.apache.org/jira/browse/SPARK-27996
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.3
Reporter: Saisai Shao


When Spark live/history UI is proxied behind the reverse proxy, the redirect 
will return wrong scheme, for example:

If reverse proxy is SSL enabled, so the client to reverse proxy is a HTTPS 
request, whereas if Spark's UI is not SSL enabled, then the request from 
reverse proxy to Spark UI is a HTTP request, Spark itself treats all the 
requests as HTTP requests, so the redirect URL is just started with "http", 
which will be failed to redirect from client. 

Actually for most of the reverse proxy, the proxy will add an additional header 
"X-Forwarded-Proto" to tell the backend server that the client request is a 
https request, so Spark should leverage this header to return the correct URL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15348) Hive ACID

2019-05-21 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844750#comment-16844750
 ] 

Saisai Shao commented on SPARK-15348:
-

No, it doesn't support hive ACID, it has its own mechanism to support ACID.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15348) Hive ACID

2019-05-21 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844597#comment-16844597
 ] 

Saisai Shao commented on SPARK-15348:
-

I think delta lake project is exactly what you want.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15348) Hive ACID

2019-05-21 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844597#comment-16844597
 ] 

Saisai Shao edited comment on SPARK-15348 at 5/21/19 7:38 AM:
--

I think delta lake project is exactly what you want. This was recently 
announced in Spark AI summit


was (Author: jerryshao):
I think delta lake project is exactly what you want.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2019-01-24 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16750872#comment-16750872
 ] 

Saisai Shao commented on SPARK-24615:
-

I'm really sorry about the delay. Due to some changes on my side, I didn't have 
enough time to work on this before. But I talked to Xiangrui offline recently, 
we will continue to work on this and finalize it in 3.0. 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10

2019-01-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16734072#comment-16734072
 ] 

Saisai Shao commented on SPARK-26512:
-

This seems like a Netty version problem, netty-3.9.9.Final.jar is unrelated. I 
was thinking if we can put spark classpath in front of Hadoop classpath, maybe 
this can be worked. There's a such configuration for driver/executor, not such 
if there's a similar one for AM only.

> Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10
> --
>
> Key: SPARK-26512
> URL: https://issues.apache.org/jira/browse/SPARK-26512
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 2.4.0
> Environment: operating system : Windows 10
> Spark Version : 2.4.0
> Hadoop Version : 2.8.3
>Reporter: Anubhav Jain
>Priority: Minor
>  Labels: windows
> Attachments: log.png
>
>
> I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
> working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
> as cluster manager and its not working. When i try to submit a spark job 
> using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI 
> after that it fail



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?

2019-01-03 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733740#comment-16733740
 ] 

Saisai Shao commented on SPARK-26512:
-

Please list the problems you saw, any log or exception. We can't tell anything 
with above information.

> Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?
> ---
>
> Key: SPARK-26512
> URL: https://issues.apache.org/jira/browse/SPARK-26512
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 2.4.0
> Environment: operating system : Windows 10
> Spark Version : 2.4.0
> Hadoop Version : 2.8.3
>Reporter: Anubhav Jain
>Priority: Minor
>  Labels: windows
> Attachments: log.png
>
>
> I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
> working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
> as cluster manager and its not working. When i try to submit a spark job 
> using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI 
> after that it fail



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26457) Show hadoop configurations in HistoryServer environment tab

2019-01-02 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-26457:

Priority: Minor  (was: Major)

> Show hadoop configurations in HistoryServer environment tab
> ---
>
> Key: SPARK-26457
> URL: https://issues.apache.org/jira/browse/SPARK-26457
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2, 2.4.0
> Environment: Maybe it is good to show some configurations in 
> HistoryServer environment tab for debugging some bugs about hadoop
>Reporter: deshanxiao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26516) zeppelin with spark on mesos: environment variable setting

2019-01-02 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-26516.
-
Resolution: Invalid

> zeppelin with spark on mesos: environment variable setting
> --
>
> Key: SPARK-26516
> URL: https://issues.apache.org/jira/browse/SPARK-26516
> Project: Spark
>  Issue Type: Question
>  Components: Mesos, Spark Core
>Affects Versions: 2.4.0
>Reporter: Yui Hirasawa
>Priority: Major
>
> I am trying to use zeppelin with spark on mesos mode following [Apache 
> Zeppelin on Spark Cluster 
> Mode|https://zeppelin.apache.org/docs/0.8.0/setup/deployment/spark_cluster_mode.html#4-configure-spark-interpreter-in-zeppelin-1].
> In the instruction, we should set these environment variables:
> {code:java}
> export MASTER=mesos://127.0.1.1:5050
> export MESOS_NATIVE_JAVA_LIBRARY=[PATH OF libmesos.so]
> export SPARK_HOME=[PATH OF SPARK HOME]
> {code}
> As far as I know, these environment variables are used by zeppelin, so it 
> should be set in localhost rather than in docker container(if i am wrong 
> please correct me).
> But mesos and spark is running inside docker container, so do we need to set 
> these environment variables so that they are pointing to the path inside the 
> docker container? If so, how should one achieve that?
> Thanks in advance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26513) Trigger GC on executor node idle

2019-01-02 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-26513:

Fix Version/s: (was: 3.0.0)

> Trigger GC on executor node idle
> 
>
> Key: SPARK-26513
> URL: https://issues.apache.org/jira/browse/SPARK-26513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
>
>  
> Correct me if I'm wrong.
>  *Stage:*
>       On a large cluster, each stage would have some executors. were a few 
> executors would finish a couple of tasks first and wait for whole stage or 
> remaining tasks to finish which are executed by different executors nodes in 
> a cluster. a stage will only be completed when all tasks in a current stage 
> finish its execution. and the next stage execution has to wait till all tasks 
> of the current stage are completed. 
>  
> why don't we trigger GC, when the executor node is waiting for remaining 
> tasks to finish, or executor Idle? anyways executor has to wait for the 
> remaining tasks to finish which can at least take a couple of seconds. why 
> don't we trigger GC? which will max take <300ms
>  
> I have proposed a small code snippet which triggers GC when running tasks are 
> empty and heap usage in current executor node is more than the given 
> threshold.
> This could improve performance for long-running spark job's. 
> we referred this paper 
> [https://www.computer.org/csdl/proceedings/hipc/2016/5411/00/07839705.pdf] 
> and we found performance improvements in our long-running spark batch job's.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26516) zeppelin with spark on mesos: environment variable setting

2019-01-02 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732587#comment-16732587
 ] 

Saisai Shao commented on SPARK-26516:
-

Questions should go to user@spark mail list. Also if this is a problem of 
Zeppelin, it would be better to ask in the Zeppelin mail list.

> zeppelin with spark on mesos: environment variable setting
> --
>
> Key: SPARK-26516
> URL: https://issues.apache.org/jira/browse/SPARK-26516
> Project: Spark
>  Issue Type: Question
>  Components: Mesos, Spark Core
>Affects Versions: 2.4.0
>Reporter: Yui Hirasawa
>Priority: Major
>
> I am trying to use zeppelin with spark on mesos mode following [Apache 
> Zeppelin on Spark Cluster 
> Mode|https://zeppelin.apache.org/docs/0.8.0/setup/deployment/spark_cluster_mode.html#4-configure-spark-interpreter-in-zeppelin-1].
> In the instruction, we should set these environment variables:
> {code:java}
> export MASTER=mesos://127.0.1.1:5050
> export MESOS_NATIVE_JAVA_LIBRARY=[PATH OF libmesos.so]
> export SPARK_HOME=[PATH OF SPARK HOME]
> {code}
> As far as I know, these environment variables are used by zeppelin, so it 
> should be set in localhost rather than in docker container(if i am wrong 
> please correct me).
> But mesos and spark is running inside docker container, so do we need to set 
> these environment variables so that they are pointing to the path inside the 
> docker container? If so, how should one achieve that?
> Thanks in advance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS

2018-12-27 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730084#comment-16730084
 ] 

Saisai Shao commented on SPARK-20415:
-

Have you tried latest version of Spark, does this problem still exist in latest 
version? Also can we have a way to reproduce this problem easily?

> SPARK job hangs while writing DataFrame to HDFS
> ---
>
> Key: SPARK-20415
> URL: https://issues.apache.org/jira/browse/SPARK-20415
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 2.1.0
> Environment: EMR 5.4.0
>Reporter: P K
>Priority: Major
>
> We are in POC phase with Spark. One of the Steps is reading compressed json 
> files that come from sources, "explode" them into tabular format and then 
> write them to HDFS. This worked for about three weeks until a few days ago, 
> for a particular dataset, the writer just hangs. I logged in to the worker 
> machines and see this stack trace:
> "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 
> tid=0x7f6210352800 nid=0x4542 runnable [0x7f61f52b3000]
>java.lang.Thread.State: RUNNABLE
> at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> The last messages ever printed in stderr before the hang are:
> 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at 
> NativeMethodAccessorImpl.java:0)
> 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 
> (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0), which has 
> no missing parents
> 17/04/18 01:41:14 INFO MemoryStore: Block 

[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2018-12-26 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729357#comment-16729357
 ] 

Saisai Shao commented on SPARK-25299:
-

[~jealous] Can we have a doc about this proposed solution for us to review?

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25501) Kafka delegation token support

2018-09-29 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-25501:

Labels: SPIP  (was: )

> Kafka delegation token support
> --
>
> Key: SPARK-25501
> URL: https://issues.apache.org/jira/browse/SPARK-25501
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>  Labels: SPIP
>
> In kafka version 1.1 delegation token support is released. As spark updated 
> it's kafka client to 2.0.0 now it's possible to implement delegation token 
> support. Please see description: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-09-25 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-25206:

Labels: Parquet correctness known_issue  (was: Parquet correctness)

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness, known_issue
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-09-01 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24615:
---

Assignee: (was: Saisai Shao)

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25183) Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; race conditions can arise

2018-08-31 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-25183:
---

Assignee: Steve Loughran

> Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; 
> race conditions can arise
> --
>
> Key: SPARK-25183
> URL: https://issues.apache.org/jira/browse/SPARK-25183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 2.4.0
>
>
> Spark's HiveServer2 registers a shutdown hook with the JVM 
> {{Runtime.addShutdownHook()}} which can happen in parallel with the 
> ShutdownHookManager sequence of spark & Hadoop, which operate the shutdowns 
> in an ordered sequence.
> This has some risks
> * FS shutdown before rename of logs completes, SPARK-6933
> * Delays of rename on object stores may block FS close operation, which, on 
> clusters with timeouts hooks (HADOOP-12950) of FileSystem.closeAll() can 
> force a kill of that shutdown hook and other problems.
> General outcome: logs aren't present.
> Proposed fix: 
> * register hook with {{org.apache.spark.util.ShutdownHookManager}}
> * HADOOP-15679 to make shutdown wait time configurable, so O(data) renames 
> don't trigger timeouts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25183) Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; race conditions can arise

2018-08-31 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-25183.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22186
[https://github.com/apache/spark/pull/22186]

> Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; 
> race conditions can arise
> --
>
> Key: SPARK-25183
> URL: https://issues.apache.org/jira/browse/SPARK-25183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 2.4.0
>
>
> Spark's HiveServer2 registers a shutdown hook with the JVM 
> {{Runtime.addShutdownHook()}} which can happen in parallel with the 
> ShutdownHookManager sequence of spark & Hadoop, which operate the shutdowns 
> in an ordered sequence.
> This has some risks
> * FS shutdown before rename of logs completes, SPARK-6933
> * Delays of rename on object stores may block FS close operation, which, on 
> clusters with timeouts hooks (HADOOP-12950) of FileSystem.closeAll() can 
> force a kill of that shutdown hook and other problems.
> General outcome: logs aren't present.
> Proposed fix: 
> * register hook with {{org.apache.spark.util.ShutdownHookManager}}
> * HADOOP-15679 to make shutdown wait time configurable, so O(data) renames 
> don't trigger timeouts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-30 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598115#comment-16598115
 ] 

Saisai Shao commented on SPARK-25206:
-

I see, if it is not going to be merged, let's close this JIRA and add to the 
release note.

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25135) insert datasource table may all null when select from view on parquet

2018-08-30 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598111#comment-16598111
 ] 

Saisai Shao commented on SPARK-25135:
-

What's the ETA of this issue [~yumwang]?

> insert datasource table may all null when select from view on parquet
> -
>
> Key: SPARK-25135
> URL: https://issues.apache.org/jira/browse/SPARK-25135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Yuming Wang
>Priority: Blocker
>  Labels: Parquet, correctness
>
> This happens on parquet.
> How to reproduce in parquet.
> {code:scala}
> val path = "/tmp/spark/parquet"
> val cnt = 30
> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
> as col2").write.mode("overwrite").parquet(path)
> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
> location '$path'")
> spark.sql("create view view1 as select col1, col2 from table1 where col1 > 
> -20")
> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> spark.table("table2").show
> {code}
> FYI, the following is orc.
> {code}
> scala> val path = "/tmp/spark/orc"
> scala> val cnt = 30
> scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as 
> bigint) as col2").write.mode("overwrite").orc(path)
> scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc 
> location '$path'")
> scala> spark.sql("create view view1 as select col1, col2 from table1 where 
> col1 > -20")
> scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc")
> scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> scala> spark.table("table2").show
> +++
> |COL1|COL2|
> +++
> |  15|  15|
> |  16|  16|
> |  17|  17|
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-30 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598110#comment-16598110
 ] 

Saisai Shao commented on SPARK-25206:
-

What is the status of this JIRA, are we going to backport, or just mark as a 
known issue?

[~yucai] [~cloud_fan] [~smilegator]

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24785) Making sure REPL prints Spark UI info and then Welcome message

2018-08-29 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24785:

Fix Version/s: 2.4.0

> Making sure REPL prints Spark UI info and then Welcome message
> --
>
> Key: SPARK-24785
> URL: https://issues.apache.org/jira/browse/SPARK-24785
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> After [SPARK-24418] the welcome message will be printed first, and then scala 
> prompt will be shown before the Spark UI info is printed as the following.
> {code:java}
>  apache-spark git:(scala-2.11.12) ✗ ./bin/spark-shell 
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> Spark context Web UI available at http://192.168.1.169:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1528180279528).
> Spark session available as 'spark'.
> scala> 
> {code}
> Although it's a minor issue, but visually, it doesn't look nice as the 
> existing behavior. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25221) [DEPLOY] Consistent trailing whitespace treatment of conf values

2018-08-26 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-25221:

Target Version/s:   (was: 2.3.2, 2.4.0)

> [DEPLOY] Consistent trailing whitespace treatment of conf values
> 
>
> Key: SPARK-25221
> URL: https://issues.apache.org/jira/browse/SPARK-25221
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.1
>Reporter: Gera Shegalov
>Priority: Major
>
> When you use a custom line delimiter 
> {{spark.hadoop.textinputformat.record.delimiter}} that has a leading or a 
> trailing whitespace character it's only possible when specified via  
> {{--conf}} . Our pipeline consists of a highly customized generated jobs. 
> Storing all the config in a properities file is not only better for 
> readability but even necessary to avoid dealing with {{ARGS_MAX}} on 
> different OS. Spark should uniformly avoid trimming conf values in both 
> cases. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25221) [DEPLOY] Consistent trailing whitespace treatment of conf values

2018-08-26 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593124#comment-16593124
 ] 

Saisai Shao commented on SPARK-25221:
-

I'm going to remove the target version, I don't think it is a critical/blocker 
issue, committers will set the proper fix version when merged.

> [DEPLOY] Consistent trailing whitespace treatment of conf values
> 
>
> Key: SPARK-25221
> URL: https://issues.apache.org/jira/browse/SPARK-25221
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.1
>Reporter: Gera Shegalov
>Priority: Major
>
> When you use a custom line delimiter 
> {{spark.hadoop.textinputformat.record.delimiter}} that has a leading or a 
> trailing whitespace character it's only possible when specified via  
> {{--conf}} . Our pipeline consists of a highly customized generated jobs. 
> Storing all the config in a properities file is not only better for 
> readability but even necessary to avoid dealing with {{ARGS_MAX}} on 
> different OS. Spark should uniformly avoid trimming conf values in both 
> cases. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25131) Event logs missing applicationAttemptId for SparkListenerApplicationStart

2018-08-21 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587062#comment-16587062
 ] 

Saisai Shao commented on SPARK-25131:
-

This is by design, not a bug.

> Event logs missing applicationAttemptId for SparkListenerApplicationStart
> -
>
> Key: SPARK-25131
> URL: https://issues.apache.org/jira/browse/SPARK-25131
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.1
>Reporter: Ajith S
>Priority: Minor
>
> When *master=yarn* and *deploy-mode=client*, event logs do not contain 
> applicationAttemptId for SparkListenerApplicationStart. This is caused at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend#start where we 
> do bindToYarn(client.submitApplication(), None) which sets appAttemptId to 
> None. We can however, get the appAttemptId after waitForApplication() and set 
> it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-25144:

Target Version/s: 2.2.3, 2.3.2

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2
>Reporter: Ayoub Benali
>Priority: Major
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585482#comment-16585482
 ] 

Saisai Shao commented on SPARK-25144:
-

Does this exist in master branch?

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2
>Reporter: Ayoub Benali
>Priority: Major
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23679) uiWebUrl show inproper URL when running on YARN

2018-08-20 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585473#comment-16585473
 ] 

Saisai Shao commented on SPARK-23679:
-

Let me fix this issue. This is mainly a RM HA introduced issue, Spark on Yarn's 
current AmIpFilter cannot work well with RM HA scenario.

> uiWebUrl show inproper URL when running on YARN
> ---
>
> Key: SPARK-23679
> URL: https://issues.apache.org/jira/browse/SPARK-23679
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>Priority: Major
>
> uiWebUrl returns local url.
> Using it will cause HTTP ERROR 500
> {code}
> Problem accessing /. Reason:
> Server Error
> Caused by:
> javax.servlet.ServletException: Could not determine the proxy server for 
> redirection
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.findRedirectUrl(AmIpFilter.java:205)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:145)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> We should give address to yarn proxy instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23679) uiWebUrl show inproper URL when running on YARN

2018-08-20 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-23679:

Component/s: YARN

> uiWebUrl show inproper URL when running on YARN
> ---
>
> Key: SPARK-23679
> URL: https://issues.apache.org/jira/browse/SPARK-23679
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>Priority: Major
>
> uiWebUrl returns local url.
> Using it will cause HTTP ERROR 500
> {code}
> Problem accessing /. Reason:
> Server Error
> Caused by:
> javax.servlet.ServletException: Could not determine the proxy server for 
> redirection
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.findRedirectUrl(AmIpFilter.java:205)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:145)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> We should give address to yarn proxy instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585321#comment-16585321
 ] 

Saisai Shao commented on SPARK-25114:
-

What's the ETA of this issue [~jiangxb1987]?

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25020) Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8

2018-08-12 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16577846#comment-16577846
 ] 

Saisai Shao commented on SPARK-25020:
-

Yes, we also met the same issue. This is more like a Hadoop problem rather than 
a Spark problem. Would you please also report this issue to Hadoop community?

CC [~ste...@apache.org], I think I talked to you the same issue before, I think 
this is something should be fixed in the Hadoop side.

> Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8
> --
>
> Key: SPARK-25020
> URL: https://issues.apache.org/jira/browse/SPARK-25020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: Spark Streaming
> -- Tested on 2.2 & 2.3 (more than likely affects all versions with graceful 
> shutdown) 
> Hadoop 2.8
>Reporter: Ricky Saltzer
>Priority: Major
>
> Opening this up to give you guys some insight in an issue that will occur 
> when using Spark Streaming with Hadoop 2.8. 
> Hadoop 2.8 added HADOOP-12950 which adds a upper limit of a 10 second timeout 
> for its shutdown hook. From our tests, if the Spark job takes longer than 10 
> seconds to shutdown gracefully, the Hadoop shutdown thread seems to trample 
> over the graceful shutdown and throw an exception like
> {code:java}
> 18/08/03 17:21:04 ERROR Utils: Uncaught exception in thread pool-1-thread-1
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker.stop(ReceiverTracker.scala:177)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:114)
> at 
> org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:682)
> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
> at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:681)
> at 
> org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:715)
> at 
> org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:599)
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> The reason I hit this issue is because we recently upgraded to EMR 5.15, 
> which has both Spark 2.3 & Hadoop 2.8. The following workaround has proven 
> successful to us (in limited testing)
> Instead of just running
> {code:java}
> ...
> ssc.start()
> ssc.awaitTermination(){code}
> We needed to do the following
> {code:java}
> ...
> ssc.start()
> sys.ShutdownHookThread {
>   ssc.stop(true, true)
> }
> ssc.awaitTermination(){code}
> As far as I can tell, there is no way to override the default {{10 second}} 
> timeout in HADOOP-12950, which is why we had to go with the workaround. 
> Note: I also verified this bug exists even with EMR 5.12.1 which runs Spark 
> 2.2.x & Hadoop 2.8. 
> Ricky
>  Epic Games



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575778#comment-16575778
 ] 

Saisai Shao commented on SPARK-25084:
-

I see. Unfortunately I've cut the RC4, if it worth to include in 2.3.2, I will 
cut a new RC.

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575774#comment-16575774
 ] 

Saisai Shao commented on SPARK-25084:
-

Is this a regression or just a bug existed in old version?

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575772#comment-16575772
 ] 

Saisai Shao commented on SPARK-25084:
-

I'm already preparing new RC4. If this is not a severe issue, I would not block 
the RC4 release.

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-07 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24948:

Fix Version/s: 2.2.3

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-07 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24948:
---

Assignee: Marco Gaido

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.3.2, 2.4.0
>
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-07 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24948:

Fix Version/s: 2.3.2

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Blocker
> Fix For: 2.3.2, 2.4.0
>
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22634) Update Bouncy castle dependency

2018-08-07 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572542#comment-16572542
 ] 

Saisai Shao commented on SPARK-22634:
-

[~srowen] I'm wondering if it is possible to upgrade to version 1.6.0, as this 
version fixed to CVEs (https://www.bouncycastle.org/latest_releases.html).

> Update Bouncy castle dependency
> ---
>
> Key: SPARK-22634
> URL: https://issues.apache.org/jira/browse/SPARK-22634
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Lior Regev
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>
> Spark's usage of jets3t library as well as Spark's own Flume and Kafka 
> streaming uses bouncy castle version 1.51
> This is an outdated version as the latest one is 1.58
> This, in turn renders packages such as 
> [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds]
>  unusable since these require 1.58 and spark's distributions come along with 
> 1.51
> My own attempt was to run on EMR, and since I automatically get all of 
> spark's dependecies (bouncy castle 1.51 being one of them) into the 
> classpath, using the library to parse blockchain data failed due to missing 
> functionality.
> I have also opened an 
> [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency]
>  with jets3t to update their dependecy as well, but along with that Spark 
> would have to update it's own or at least be packaged with a newer version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25030) SparkSubmit.doSubmit will not return result if the mainClass submitted creates a Timer()

2018-08-06 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570995#comment-16570995
 ] 

Saisai Shao commented on SPARK-25030:
-

I would like to see more about this issue.

> SparkSubmit.doSubmit will not return result if the mainClass submitted 
> creates a Timer()
> 
>
> Key: SPARK-25030
> URL: https://issues.apache.org/jira/browse/SPARK-25030
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Jiang Xingbo
>Priority: Major
>
> Create a Timer() in the mainClass submitted to SparkSubmit makes it unable to 
> fetch result, it is very easy to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-02 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24948:

Priority: Blocker  (was: Major)

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Blocker
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-02 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24948:

Target Version/s: 2.2.3, 2.3.2, 2.4.0

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Major
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564562#comment-16564562
 ] 

Saisai Shao commented on SPARK-24615:
-

Leveraging dynamic allocation to tear down and bring up desired executor is a 
non-goal here, we will address it in feature, currently we're still focusing on 
using static allocation like spark.executor.gpus.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564562#comment-16564562
 ] 

Saisai Shao edited comment on SPARK-24615 at 8/1/18 12:35 AM:
--

Leveraging dynamic allocation to tear down and bring up desired executor is a 
non-goal here, we will address it in future, currently we're still focusing on 
using static allocation like spark.executor.gpus.


was (Author: jerryshao):
Leveraging dynamic allocation to tear down and bring up desired executor is a 
non-goal here, we will address it in feature, currently we're still focusing on 
using static allocation like spark.executor.gpus.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563730#comment-16563730
 ] 

Saisai Shao edited comment on SPARK-24615 at 7/31/18 2:16 PM:
--

Hi [~tgraves], I think eval() might unnecessarily break the lineage which could 
execute in one stage, for example: data transforming -> training -> 
transforming, this could possibly run in one stage, using eval will break into 
several stages, I'm not sure if it is the good choice. Also if we use eval to 
break the lineage, how do we store the intermediate data, like shuffle, or in 
memory/ on disk?

Yes, how to break the boundaries is hard for user to know, but currently I 
cannot figure out a good solution, unless we use eval() to explicitly separate 
them. To solve the conflicts, failing might be one choice. In the SQL or DF 
area, I don't think we have to expose such low level RDD APIs to user, maybe 
some hints should be enough (though I haven't thought about it).

Currently in my design, withResources only applies to the stage in which RDD 
will be executed, the following stages will still be ordinary stages without 
additional resources.


was (Author: jerryshao):
Hi [~tgraves], I think eval() might unnecessarily break the lineage which could 
execute in one stage, for example: data transforming -> training -> 
transforming, this could possibly run in one stage, using eval will break into 
several stages, I'm not sure if it is the good choice. Also if we use eval to 
break the lineage, how do we store the intermediate data, like shuffle, or in 
memory/ on disk?

Yes, how to break the boundaries is hard for user to know, but currently I 
cannot figure out a good solution, unless we use eval() to explicitly separate 
them. To solve the conflicts, failing might be one choice.

Currently in my design, withResources only applies to the stage in which RDD 
will be executed, the following stages will still be ordinary stages without 
additional resouces.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563730#comment-16563730
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves], I think eval() might unnecessarily break the lineage which could 
execute in one stage, for example: data transforming -> training -> 
transforming, this could possibly run in one stage, using eval will break into 
several stages, I'm not sure if it is the good choice. Also if we use eval to 
break the lineage, how do we store the intermediate data, like shuffle, or in 
memory/ on disk?

Yes, how to break the boundaries is hard for user to know, but currently I 
cannot figure out a good solution, unless we use eval() to explicitly separate 
them. To solve the conflicts, failing might be one choice.

Currently in my design, withResources only applies to the stage in which RDD 
will be executed, the following stages will still be ordinary stages without 
additional resouces.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24975) Spark history server REST API /api/v1/version returns error 404

2018-07-31 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24975.
-
Resolution: Duplicate

> Spark history server REST API /api/v1/version returns error 404
> ---
>
> Key: SPARK-24975
> URL: https://issues.apache.org/jira/browse/SPARK-24975
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: shanyu zhao
>Priority: Major
>
> Spark history server REST API provides /api/v1/version, according to doc:
> [https://spark.apache.org/docs/latest/monitoring.html]
> However, for Spark 2.3, we see:
> {code:java}
> curl http://localhost:18080/api/v1/version
> 
> 
> 
> Error 404 Not Found
> 
> HTTP ERROR 404
> Problem accessing /api/v1/version. Reason:
>  Not Foundhttp://eclipse.org/jetty;>Powered by 
> Jetty:// 9.3.z-SNAPSHOT
> 
> {code}
> On a Spark 2.2 cluster, we see:
> {code:java}
> curl http://localhost:18080/api/v1/version
> {
> "spark" : "2.2.0"
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-29 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24957:

Target Version/s: 2.4.0

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-29 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561334#comment-16561334
 ] 

Saisai Shao commented on SPARK-24957:
-

Not necessary to mark as blocker, we still have plenty of time for 2.4 release, 
I will mark the target version as 2.4.0.

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24932) Allow update mode for streaming queries with join

2018-07-29 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24932:

Target Version/s:   (was: 2.3.2)

> Allow update mode for streaming queries with join
> -
>
> Key: SPARK-24932
> URL: https://issues.apache.org/jira/browse/SPARK-24932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Eric Fu
>Priority: Major
>
> In issue SPARK-19140 we supported update output mode for non-aggregation 
> streaming queries. This should also be applied to streaming join to keep 
> semantic consistent.
> PS. Streaming join feature is added after SPARK-19140. 
> When using update _output_ mode the join will works exactly as _append_ mode. 
> However, for example, this will allow user to run an aggregation-after-join 
> query in update mode in order to get a more real-time result output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24932) Allow update mode for streaming queries with join

2018-07-29 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561331#comment-16561331
 ] 

Saisai Shao commented on SPARK-24932:
-

I'm going to remove the target version for this JIRA. Committers will set the 
fix version properly when merged.

> Allow update mode for streaming queries with join
> -
>
> Key: SPARK-24932
> URL: https://issues.apache.org/jira/browse/SPARK-24932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Eric Fu
>Priority: Major
>
> In issue SPARK-19140 we supported update output mode for non-aggregation 
> streaming queries. This should also be applied to streaming join to keep 
> semantic consistent.
> PS. Streaming join feature is added after SPARK-19140. 
> When using update _output_ mode the join will works exactly as _append_ mode. 
> However, for example, this will allow user to run an aggregation-after-join 
> query in update mode in order to get a more real-time result output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)

2018-07-29 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24964:

Target Version/s:   (was: 2.3.2, 2.4.0, 3.0.0, 2.3.3)

>  Please add OWASP Dependency Check to all comonent builds(pom.xml)
> --
>
> Key: SPARK-24964
> URL: https://issues.apache.org/jira/browse/SPARK-24964
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, MLlib, Spark Core, SparkR
>Affects Versions: 2.3.1
> Environment: All development, build, test, environments.
> ~/workspace/spark-2.3.1/pom.xml
> ~/workspace/spark-2.3.1/assembly/pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/pom.xml
> ~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-common/pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/pom.xml
> ~/workspace/spark-2.3.1/common/network-yarn/pom.xml
> ~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/sketch/pom.xml
> ~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/tags/pom.xml
> ~/workspace/spark-2.3.1/common/unsafe/pom.xml
> ~/workspace/spark-2.3.1/core/pom.xml
> ~/workspace/spark-2.3.1/examples/pom.xml
> ~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml
> ~/workspace/spark-2.3.1/external/flume/pom.xml
> ~/workspace/spark-2.3.1/external/flume-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/flume-sink/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml
> ~/workspace/spark-2.3.1/graphx/pom.xml
> ~/workspace/spark-2.3.1/hadoop-cloud/pom.xml
> ~/workspace/spark-2.3.1/launcher/pom.xml
> ~/workspace/spark-2.3.1/mllib/pom.xml
> ~/workspace/spark-2.3.1/mllib-local/pom.xml
> ~/workspace/spark-2.3.1/repl/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml
> ~/workspace/spark-2.3.1/sql/catalyst/pom.xml
> ~/workspace/spark-2.3.1/sql/core/pom.xml
> ~/workspace/spark-2.3.1/sql/hive/pom.xml
> ~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml
> ~/workspace/spark-2.3.1/streaming/pom.xml
> ~/workspace/spark-2.3.1/tools/pom.xml
>Reporter: Albert Baker
>Priority: Major
>  Labels: build, easy-fix, security
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & 
> Exposures (CVE) to perform a lookup for each dependant .jar to list any/all 
> known vulnerabilities for each jar. This step is needed because a manual 
> MITRE CVE lookup/check on the main component does not include checking for 
> vulnerabilities in dependant libraries.
> OWASP Dependency check : 
> https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most 
> Java build/make types (ant, maven, ivy, gradle). Also, add the appropriate 
> command to the nightly build to generate a report of all known 
> vulnerabilities in any/all third party libraries/dependencies that get pulled 
> in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean aggregate
> Generating this report nightly/weekly will help inform the project's 
> development team if any dependant libraries have a reported known 
> vulneraility. Project teams that keep up with removing vulnerabilities on a 
> weekly basis will help protect businesses that rely on these open source 
> componets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)

2018-07-29 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561327#comment-16561327
 ] 

Saisai Shao commented on SPARK-24964:
-

I'm going to remove the set target version, usually we don't set this for a 
feature, committers will set a fix version when merged.

>  Please add OWASP Dependency Check to all comonent builds(pom.xml)
> --
>
> Key: SPARK-24964
> URL: https://issues.apache.org/jira/browse/SPARK-24964
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, MLlib, Spark Core, SparkR
>Affects Versions: 2.3.1
> Environment: All development, build, test, environments.
> ~/workspace/spark-2.3.1/pom.xml
> ~/workspace/spark-2.3.1/assembly/pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/pom.xml
> ~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-common/pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/pom.xml
> ~/workspace/spark-2.3.1/common/network-yarn/pom.xml
> ~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/sketch/pom.xml
> ~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/tags/pom.xml
> ~/workspace/spark-2.3.1/common/unsafe/pom.xml
> ~/workspace/spark-2.3.1/core/pom.xml
> ~/workspace/spark-2.3.1/examples/pom.xml
> ~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml
> ~/workspace/spark-2.3.1/external/flume/pom.xml
> ~/workspace/spark-2.3.1/external/flume-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/flume-sink/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml
> ~/workspace/spark-2.3.1/graphx/pom.xml
> ~/workspace/spark-2.3.1/hadoop-cloud/pom.xml
> ~/workspace/spark-2.3.1/launcher/pom.xml
> ~/workspace/spark-2.3.1/mllib/pom.xml
> ~/workspace/spark-2.3.1/mllib-local/pom.xml
> ~/workspace/spark-2.3.1/repl/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml
> ~/workspace/spark-2.3.1/sql/catalyst/pom.xml
> ~/workspace/spark-2.3.1/sql/core/pom.xml
> ~/workspace/spark-2.3.1/sql/hive/pom.xml
> ~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml
> ~/workspace/spark-2.3.1/streaming/pom.xml
> ~/workspace/spark-2.3.1/tools/pom.xml
>Reporter: Albert Baker
>Priority: Major
>  Labels: build, easy-fix, security
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & 
> Exposures (CVE) to perform a lookup for each dependant .jar to list any/all 
> known vulnerabilities for each jar. This step is needed because a manual 
> MITRE CVE lookup/check on the main component does not include checking for 
> vulnerabilities in dependant libraries.
> OWASP Dependency check : 
> https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most 
> Java build/make types (ant, maven, ivy, gradle). Also, add the appropriate 
> command to the nightly build to generate a report of all known 
> vulnerabilities in any/all third party libraries/dependencies that get pulled 
> in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean aggregate
> Generating this report nightly/weekly will help inform the project's 
> development team if any dependant libraries have a reported known 
> vulneraility. Project teams that keep up with removing vulnerabilities on a 
> weekly basis will help protect businesses that rely on these open source 
> componets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter

2018-07-25 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556760#comment-16556760
 ] 

Saisai Shao commented on SPARK-24867:
-

I see, thanks! Please let me know when the JIRA is opened.

> Add AnalysisBarrier to DataFrameWriter 
> ---
>
> Key: SPARK-24867
> URL: https://issues.apache.org/jira/browse/SPARK-24867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.2
>
>
> {code}
>   val udf1 = udf({(x: Int, y: Int) => x + y})
>   val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
>   df.cache()
>   df.write.saveAsTable("t")
>   df.write.saveAsTable("t1")
> {code}
> Cache is not being used because the plans do not match with the cached plan. 
> This is a regression caused by the changes we made in AnalysisBarrier, since 
> not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-24 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555080#comment-16555080
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] [~irashid] thanks a lot for your comments.

Currently in my design I don't insert a specific stage boundary with different 
resources, the stage boundary is still the same (by shuffle or by result). so 
{{withResouces}} is not an eval() action which trigger a stage. Instead, it 
just adds a resource hint to the RDD.

So which means RDDs with different resources requirements in one stage may have 
conflicts. For example: {{rdd1.withResources.mapPartitions \{ xxx 
\}.withResources.mapPartitions \{ xxx \}.collect}},  resources in rdd1 may be 
different from map rdd, so currently what I can think is that:

1. always pick the latter with warning log to say that multiple different 
resources in one stage is illegal.
2. fail the stage with warning log to say that multiple different resources in 
one stage is illegal.
3. merge conflicts with maximum resources needs. For example rdd1 requires 3 
gpus per task, rdd2 requires 4 gpus per task, then the merged requirement would 
be 4 gpus per task. (This is the high level description, details will be per 
partition based merging) [chosen].

Take join for example, where rdd1 and rdd2 may have different resource 
requirements, and joined RDD will potentially have other resource requirements.

For example:

{code}
val rddA = rdd.mapPartitions().withResources
val rddB = rdd.mapPartitions().withResources
val rddC = rddA.join(rddB).withResources
rddC.collect()
{code}

Here this 3 {{withResources}} may have different requirements. Since {{rddC}} 
is running in a different stage, so there's no need to merge the resource 
conflicts. But {{rddA}} and {{rddB}} are running in the same stage with 
different tasks (partitions). So the merging strategy I'm thinking is based on 
the partition, tasks running with partitions from {{rddA}} will use the 
resource specified by {{rddA}}, so does {{rddB}}.





> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter

2018-07-24 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554986#comment-16554986
 ] 

Saisai Shao commented on SPARK-24867:
-

[~smilegator] what's the ETA of this issue?

> Add AnalysisBarrier to DataFrameWriter 
> ---
>
> Key: SPARK-24867
> URL: https://issues.apache.org/jira/browse/SPARK-24867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> {code}
>   val udf1 = udf({(x: Int, y: Int) => x + y})
>   val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
>   df.cache()
>   df.write.saveAsTable("t")
>   df.write.saveAsTable("t1")
> {code}
> Cache is not being used because the plans do not match with the cached plan. 
> This is a regression caused by the changes we made in AnalysisBarrier, since 
> not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB

2018-07-24 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24297:
---

Assignee: Imran Rashid

> Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
> ---
>
> Key: SPARK-24297
> URL: https://issues.apache.org/jira/browse/SPARK-24297
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Any network request which does not use stream-to-disk that is sending over 
> 2GB is doomed to fail, so we might as well at least set the default value of 
> spark.maxRemoteBlockSizeFetchToMem to something < 2GB.
> It probably makes sense to set it to something even lower still, but that 
> might require more careful testing; this is a totally safe first step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB

2018-07-24 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24297.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21474
[https://github.com/apache/spark/pull/21474]

> Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
> ---
>
> Key: SPARK-24297
> URL: https://issues.apache.org/jira/browse/SPARK-24297
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Any network request which does not use stream-to-disk that is sending over 
> 2GB is doomed to fail, so we might as well at least set the default value of 
> spark.maxRemoteBlockSizeFetchToMem to something < 2GB.
> It probably makes sense to set it to something even lower still, but that 
> might require more careful testing; this is a totally safe first step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-23 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553679#comment-16553679
 ] 

Saisai Shao commented on SPARK-24615:
-

[~Tagar] yes!

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24594) Introduce metrics for YARN executor allocation problems

2018-07-23 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24594.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21635
[https://github.com/apache/spark/pull/21635]

> Introduce metrics for YARN executor allocation problems 
> 
>
> Key: SPARK-24594
> URL: https://issues.apache.org/jira/browse/SPARK-24594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
>
> Within SPARK-16630 it come up to introduce metrics for  YARN executor 
> allocation problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24594) Introduce metrics for YARN executor allocation problems

2018-07-23 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24594:
---

Assignee: Attila Zsolt Piros

> Introduce metrics for YARN executor allocation problems 
> 
>
> Key: SPARK-24594
> URL: https://issues.apache.org/jira/browse/SPARK-24594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
>
> Within SPARK-16630 it come up to introduce metrics for  YARN executor 
> allocation problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-22 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552233#comment-16552233
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] what you mentioned above is also what we think about and try to 
figure out a way to solve it. (this problem also existed in barrier execution).

>From user point, specifying resource through RDD is the only feasible way 
>currently what I can think, though resource is bound to stage/task not 
>particular RDD. This means user could specify resources for different RDDs in 
>a single stage, Spark can only use one resource within this stage. This will 
>bring out several problems as you mentioned:

*Specify resources to which RDD*

For example {{rddA.withResource.mapPartition \{ xxx \}.collect()}} is not 
different from {{rddA.mapPartition \{ xxx \}.withResource.collect}}. Since all 
the rdds are executed in the same stage. So in the current design, not matter 
the resource is specified with {{rddA}} or mapped RDD, the result is the same.

*one to one dependency RDDs with different resources*

For example {{rddA.withResource.mapPartition \{ xxx \}.withResource.collec()}}, 
here assuming the resource request for {{rddA}} and mapped RDD is different, 
since they're running in a single stage, so we should fix such conflict.

*multiple dependencies RDDs with different resources*

For example:

{code}
val rddA = rdd.withResources.mapPartitions()

val rddB = rdd.withResources.mapPartitions()

val rddC = rddA.join(rddB)
{code}

If the resources in {{rddA}} is different from {{rddB}}, then we should also 
fix such conflicts.

Previously I proposed to use largest resource requirement to satisfy all the 
needs. But it may also cause the resource wasting, [~mengxr] mentioned to 
set/merge resources per partition to avoid waste. In the meanwhile, it there's 
a API exposed to set resources in the stage level, then this problem will not 
be existed, but Spark doesn't expose such APIs to user, the only thing user can 
specify is from RDD level, I'm still thinking of a good way to fix it.
 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550266#comment-16550266
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] I'm still not sure how to handle memory per stage. Unlike MR, 
Spark shares the task runtime in a single JVM, I'm not sure how to control the 
memory usage within the JVM. Are you suggesting the similar approach like using 
GPU, when memory requirement cannot be satisfied, release the current executors 
and requesting new executors by dynamic resource allocation?

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550182#comment-16550182
 ] 

Saisai Shao commented on SPARK-24723:
-

Hi [~mengxr], I don't think YARN has such feature to configure password-less 
SSH on all containers. YARN itself doesn't rely on SSH, and in our deployment 
(Ambari), we don't have use password-less ssh.
{quote}And does container by default run sshd? If not, which process is 
responsible for starting/terminating the daemon?
{quote}
If the container is is not dockerized, so it will share with system's sshd, it 
is system's responsibility to start/terminate this daemon.

If the container is dockerized, I think the docker container should be 
responsible for starting sshd (IIUC).

Maybe we should check if sshd is started before starting MPI job, if sshd is 
not started, simply we cannot run MPI job no matter who is responsible for sshd 
daemon.

[~leftnoteasy] might have some thoughts, since he is the originator of 
mpich2-yarn.

 

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Saisai Shao
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.
>  
> Requirements:
>  * understand how to set up YARN to run MPI job as a YARN application
>  * figure out how to do it with Spark w/ Barrier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24195) sc.addFile for local:/ path is broken

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24195:
---

Assignee: Li Yuanjian

> sc.addFile for local:/ path is broken
> -
>
> Key: SPARK-24195
> URL: https://issues.apache.org/jira/browse/SPARK-24195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Assignee: Li Yuanjian
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> In changing SPARK-6300
> https://github.com/apache/spark/commit/00e730b94cba1202a73af1e2476ff5a44af4b6b2
> essentially the change to
> new File(path).getCanonicalFile.toURI.toString
> breaks when path is local:, as java.io.File doesn't handle it.
> eg.
> new 
> File("local:///home/user/demo/logger.config").getCanonicalFile.toURI.toString
> res1: String = file:/user/anotheruser/local:/home/user/demo/logger.config



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24195) sc.addFile for local:/ path is broken

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24195.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21533
[https://github.com/apache/spark/pull/21533]

> sc.addFile for local:/ path is broken
> -
>
> Key: SPARK-24195
> URL: https://issues.apache.org/jira/browse/SPARK-24195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> In changing SPARK-6300
> https://github.com/apache/spark/commit/00e730b94cba1202a73af1e2476ff5a44af4b6b2
> essentially the change to
> new File(path).getCanonicalFile.toURI.toString
> breaks when path is local:, as java.io.File doesn't handle it.
> eg.
> new 
> File("local:///home/user/demo/logger.config").getCanonicalFile.toURI.toString
> res1: String = file:/user/anotheruser/local:/home/user/demo/logger.config



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24307) Support sending messages over 2GB from memory

2018-07-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550167#comment-16550167
 ] 

Saisai Shao commented on SPARK-24307:
-

Issue resolved by pull request 21440
[https://github.com/apache/spark/pull/21440]

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in 

[jira] [Resolved] (SPARK-24307) Support sending messages over 2GB from memory

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24307.
-
Resolution: Fixed
  Assignee: Imran Rashid

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in memory). 
>  A drawback to this approach is that blocks that are cached 

[jira] [Updated] (SPARK-24307) Support sending messages over 2GB from memory

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24307:

Fix Version/s: 2.4.0

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in memory). 
>  A drawback to this approach is that blocks that are cached in memory as 
> deserialized values would need to have the 

  1   2   3   4   5   6   7   8   9   10   >