[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550266#comment-16550266
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] I'm still not sure how to handle memory per stage. Unlike MR, 
Spark shares the task runtime in a single JVM, I'm not sure how to control the 
memory usage within the JVM. Are you suggesting the similar approach like using 
GPU, when memory requirement cannot be satisfied, release the current executors 
and requesting new executors by dynamic resource allocation?

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24861) create corrected temp directories in RateSourceSuite

2018-07-19 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24861.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21817
[https://github.com/apache/spark/pull/21817]

> create corrected temp directories in RateSourceSuite
> 
>
> Key: SPARK-24861
> URL: https://issues.apache.org/jira/browse/SPARK-24861
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24871) Refactor Concat and MapConcat to avoid creating concatenator object for each row.

2018-07-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550218#comment-16550218
 ] 

Apache Spark commented on SPARK-24871:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21824

> Refactor Concat and MapConcat to avoid creating concatenator object for each 
> row.
> -
>
> Key: SPARK-24871
> URL: https://issues.apache.org/jira/browse/SPARK-24871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Refactor {{Concat}} and {{MapConcat}} to:
>  - avoid creating concatenator object for each row.
>  - make {{Concat}} handle {{containsNull}} properly.
>  - make {{Concat}} shortcut if {{null}} child is found.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24871) Refactor Concat and MapConcat to avoid creating concatenator object for each row.

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24871:


Assignee: Apache Spark

> Refactor Concat and MapConcat to avoid creating concatenator object for each 
> row.
> -
>
> Key: SPARK-24871
> URL: https://issues.apache.org/jira/browse/SPARK-24871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Refactor {{Concat}} and {{MapConcat}} to:
>  - avoid creating concatenator object for each row.
>  - make {{Concat}} handle {{containsNull}} properly.
>  - make {{Concat}} shortcut if {{null}} child is found.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24871) Refactor Concat and MapConcat to avoid creating concatenator object for each row.

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24871:


Assignee: (was: Apache Spark)

> Refactor Concat and MapConcat to avoid creating concatenator object for each 
> row.
> -
>
> Key: SPARK-24871
> URL: https://issues.apache.org/jira/browse/SPARK-24871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Refactor {{Concat}} and {{MapConcat}} to:
>  - avoid creating concatenator object for each row.
>  - make {{Concat}} handle {{containsNull}} properly.
>  - make {{Concat}} shortcut if {{null}} child is found.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24871) Refactor Concat and MapConcat to avoid creating concatenator object for each row.

2018-07-19 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-24871:
-

 Summary: Refactor Concat and MapConcat to avoid creating 
concatenator object for each row.
 Key: SPARK-24871
 URL: https://issues.apache.org/jira/browse/SPARK-24871
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Takuya Ueshin


Refactor {{Concat}} and {{MapConcat}} to:
 - avoid creating concatenator object for each row.
 - make {{Concat}} handle {{containsNull}} properly.
 - make {{Concat}} shortcut if {{null}} child is found.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550182#comment-16550182
 ] 

Saisai Shao commented on SPARK-24723:
-

Hi [~mengxr], I don't think YARN has such feature to configure password-less 
SSH on all containers. YARN itself doesn't rely on SSH, and in our deployment 
(Ambari), we don't have use password-less ssh.
{quote}And does container by default run sshd? If not, which process is 
responsible for starting/terminating the daemon?
{quote}
If the container is is not dockerized, so it will share with system's sshd, it 
is system's responsibility to start/terminate this daemon.

If the container is dockerized, I think the docker container should be 
responsible for starting sshd (IIUC).

Maybe we should check if sshd is started before starting MPI job, if sshd is 
not started, simply we cannot run MPI job no matter who is responsible for sshd 
daemon.

[~leftnoteasy] might have some thoughts, since he is the originator of 
mpich2-yarn.

 

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Saisai Shao
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.
>  
> Requirements:
>  * understand how to set up YARN to run MPI job as a YARN application
>  * figure out how to do it with Spark w/ Barrier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24195) sc.addFile for local:/ path is broken

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24195:
---

Assignee: Li Yuanjian

> sc.addFile for local:/ path is broken
> -
>
> Key: SPARK-24195
> URL: https://issues.apache.org/jira/browse/SPARK-24195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Assignee: Li Yuanjian
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> In changing SPARK-6300
> https://github.com/apache/spark/commit/00e730b94cba1202a73af1e2476ff5a44af4b6b2
> essentially the change to
> new File(path).getCanonicalFile.toURI.toString
> breaks when path is local:, as java.io.File doesn't handle it.
> eg.
> new 
> File("local:///home/user/demo/logger.config").getCanonicalFile.toURI.toString
> res1: String = file:/user/anotheruser/local:/home/user/demo/logger.config



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24195) sc.addFile for local:/ path is broken

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24195.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21533
[https://github.com/apache/spark/pull/21533]

> sc.addFile for local:/ path is broken
> -
>
> Key: SPARK-24195
> URL: https://issues.apache.org/jira/browse/SPARK-24195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> In changing SPARK-6300
> https://github.com/apache/spark/commit/00e730b94cba1202a73af1e2476ff5a44af4b6b2
> essentially the change to
> new File(path).getCanonicalFile.toURI.toString
> breaks when path is local:, as java.io.File doesn't handle it.
> eg.
> new 
> File("local:///home/user/demo/logger.config").getCanonicalFile.toURI.toString
> res1: String = file:/user/anotheruser/local:/home/user/demo/logger.config



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24307) Support sending messages over 2GB from memory

2018-07-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550167#comment-16550167
 ] 

Saisai Shao commented on SPARK-24307:
-

Issue resolved by pull request 21440
[https://github.com/apache/spark/pull/21440]

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in 

[jira] [Resolved] (SPARK-24307) Support sending messages over 2GB from memory

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24307.
-
Resolution: Fixed
  Assignee: Imran Rashid

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in memory). 
>  A drawback to this approach is that blocks that are cached 

[jira] [Updated] (SPARK-24307) Support sending messages over 2GB from memory

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24307:

Fix Version/s: 2.4.0

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in memory). 
>  A drawback to this approach is that blocks that are cached in memory as 
> deserialized values would need to have the 

[jira] [Updated] (SPARK-24037) stateful operators

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24037:

Fix Version/s: (was: 2.4.0)

> stateful operators
> --
>
> Key: SPARK-24037
> URL: https://issues.apache.org/jira/browse/SPARK-24037
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> pointer to https://issues.apache.org/jira/browse/SPARK-24036



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24037) stateful operators

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24037:

Fix Version/s: 2.4.0

> stateful operators
> --
>
> Key: SPARK-24037
> URL: https://issues.apache.org/jira/browse/SPARK-24037
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> pointer to https://issues.apache.org/jira/browse/SPARK-24036



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24859) Predicates pushdown on outer joins

2018-07-19 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550159#comment-16550159
 ] 

Hyukjin Kwon edited comment on SPARK-24859 at 7/20/18 3:04 AM:
---

Does the same thing happens in Apache Spark's master branch, and mind if I ask 
a self-contained reproducer and explain results if you are not working on this?


was (Author: hyukjin.kwon):
Does the same thing happens in Apache Spark's master branch, and mind if I ask 
a reproducer if you are not working on this?

> Predicates pushdown on outer joins
> --
>
> Key: SPARK-24859
> URL: https://issues.apache.org/jira/browse/SPARK-24859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Cloudera CDH 5.13.1
>Reporter: Johannes Mayer
>Priority: Major
>
> I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a 
> common column called part_col. Now I want to join both tables on their id but 
> only for some of partitions.
> If I use an inner join, everything works well:
>  
> {code:java}
> select *
> from FA f
> join DI d
> on(f.id = d.id and f.part_col = d.part_col)
> where f.part_col = 'xyz'
> {code}
>  
> In the sql explain plan I can see, that the predicate part_col = 'xyz' is 
> also used in the DIm HiveTableScan.
>  
> When I execute the same query using a left join the full dim table is 
> scanned. There are some workarounds for this issue, but I wanted to report 
> this as a bug, since it works on an inner join, and i think the behaviour 
> should be the same for an outer join
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24859) Predicates pushdown on outer joins

2018-07-19 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550159#comment-16550159
 ] 

Hyukjin Kwon commented on SPARK-24859:
--

Does the same thing happens in Apache Spark's master branch, and mind if I ask 
a reproducer if you are not working on this?

> Predicates pushdown on outer joins
> --
>
> Key: SPARK-24859
> URL: https://issues.apache.org/jira/browse/SPARK-24859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Cloudera CDH 5.13.1
>Reporter: Johannes Mayer
>Priority: Major
>
> I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a 
> common column called part_col. Now I want to join both tables on their id but 
> only for some of partitions.
> If I use an inner join, everything works well:
>  
> {code:java}
> select *
> from FA f
> join DI d
> on(f.id = d.id and f.part_col = d.part_col)
> where f.part_col = 'xyz'
> {code}
>  
> In the sql explain plan I can see, that the predicate part_col = 'xyz' is 
> also used in the DIm HiveTableScan.
>  
> When I execute the same query using a left join the full dim table is 
> scanned. There are some workarounds for this issue, but I wanted to report 
> this as a bug, since it works on an inner join, and i think the behaviour 
> should be the same for an outer join
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24859) Predicates pushdown on outer joins

2018-07-19 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550158#comment-16550158
 ] 

Hyukjin Kwon commented on SPARK-24859:
--

Please avoid set {{Criticial}}+ which is usually reserved for committers.

> Predicates pushdown on outer joins
> --
>
> Key: SPARK-24859
> URL: https://issues.apache.org/jira/browse/SPARK-24859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Cloudera CDH 5.13.1
>Reporter: Johannes Mayer
>Priority: Major
>
> I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a 
> common column called part_col. Now I want to join both tables on their id but 
> only for some of partitions.
> If I use an inner join, everything works well:
>  
> {code:java}
> select *
> from FA f
> join DI d
> on(f.id = d.id and f.part_col = d.part_col)
> where f.part_col = 'xyz'
> {code}
>  
> In the sql explain plan I can see, that the predicate part_col = 'xyz' is 
> also used in the DIm HiveTableScan.
>  
> When I execute the same query using a left join the full dim table is 
> scanned. There are some workarounds for this issue, but I wanted to report 
> this as a bug, since it works on an inner join, and i think the behaviour 
> should be the same for an outer join
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24859) Predicates pushdown on outer joins

2018-07-19 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24859:
-
Priority: Major  (was: Critical)

> Predicates pushdown on outer joins
> --
>
> Key: SPARK-24859
> URL: https://issues.apache.org/jira/browse/SPARK-24859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Cloudera CDH 5.13.1
>Reporter: Johannes Mayer
>Priority: Major
>
> I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a 
> common column called part_col. Now I want to join both tables on their id but 
> only for some of partitions.
> If I use an inner join, everything works well:
>  
> {code:java}
> select *
> from FA f
> join DI d
> on(f.id = d.id and f.part_col = d.part_col)
> where f.part_col = 'xyz'
> {code}
>  
> In the sql explain plan I can see, that the predicate part_col = 'xyz' is 
> also used in the DIm HiveTableScan.
>  
> When I execute the same query using a left join the full dim table is 
> scanned. There are some workarounds for this issue, but I wanted to report 
> this as a bug, since it works on an inner join, and i think the behaviour 
> should be the same for an outer join
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24856) spark need upgrade Guava for use gRPC

2018-07-19 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24856.
--
Resolution: Duplicate

Please search JIRAs before filling it.

> spark need upgrade Guava for use gRPC
> -
>
> Key: SPARK-24856
> URL: https://issues.apache.org/jira/browse/SPARK-24856
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Input/Output, Spark Core
>Affects Versions: 2.3.1
>Reporter: alibaltschun
>Priority: Major
>
> hello, i have a problem about load spark model while using gRPC dependencies
>  
> i was posted on StackOverflow and someone says that coz spark used an old 
> version of guava and gRPC need Guava V.20+. so that's mean spark need to 
> update they guava version to fix this issue.
>  
> thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24857) required the sample code test the spark steaming job in kubernates and write the data in remote hdfs file system

2018-07-19 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550153#comment-16550153
 ] 

Hyukjin Kwon commented on SPARK-24857:
--

[~kmkrishna1...@gmail.com], mind clarifying the input (self-contained and 
repducible), the expected output and the current output with some reasons to 
explain that?

> required the sample code test the spark steaming job in kubernates and write 
> the data in remote hdfs file system
> 
>
> Key: SPARK-24857
> URL: https://issues.apache.org/jira/browse/SPARK-24857
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Spark Submit
>Affects Versions: 2.3.1
>Reporter: kumpatla murali krishna
>Priority: Major
>
> ./bin/spark-submit --master k8s://https://api.kubernates.aws.phenom.local 
> --deploy-mode cluster --name spark-pi --class  
> com.phenom.analytics.executor.SummarizationJobExecutor --conf 
> spark.executor.instances=5 --conf 
> spark.kubernetes.container.image=phenommurali/spark_new  --jars  
> hdfs://test-dev.com:8020/user/spark/jobs/Test_jar_without_jars.jar
> error 
> Normal SuccessfulMountVolume 2m kubelet, ip-x.ec2.internal 
> MountVolume.SetUp succeeded for volume "download-files-volume" Warning 
> FailedMount 2m kubelet, ip-.ec2.internal MountVolume.SetUp failed for 
> volume "spark-init-properties" : configmaps 
> "spark-pi-b5be4308783c3c479c6bf2f9da9b49dc-init-config" not found



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24870) Cache can't work normally if there are case letters in SQL

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24870:


Assignee: (was: Apache Spark)

> Cache can't work normally if there are case letters in SQL
> --
>
> Key: SPARK-24870
> URL: https://issues.apache.org/jira/browse/SPARK-24870
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: eaton
>Priority: Major
>
> Cache can't work normally if there are case letters in SQL, 
> for example:
>  sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
> sql("select key, sum(case when Key > 0 then 1 else 0 end) as positiveNum " +
>  "from src group by key").cache().createOrReplaceTempView("src_cache")
>  sql(
>  s"""select a.key
>  from
>  (select key from src_cache where positiveNum = 1)a
>  left join
>  (select key from src_cache )b
>  on a.key=b.key
>  """).explain
>  
> The subquery "select key from src_cache where positiveNum = 1" on the left of 
> join can use the cache data, but the subquery "select key from src_cache" on 
> the right of join cannot use the cache data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24870) Cache can't work normally if there are case letters in SQL

2018-07-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550131#comment-16550131
 ] 

Apache Spark commented on SPARK-24870:
--

User 'eatoncys' has created a pull request for this issue:
https://github.com/apache/spark/pull/21823

> Cache can't work normally if there are case letters in SQL
> --
>
> Key: SPARK-24870
> URL: https://issues.apache.org/jira/browse/SPARK-24870
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: eaton
>Priority: Major
>
> Cache can't work normally if there are case letters in SQL, 
> for example:
>  sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
> sql("select key, sum(case when Key > 0 then 1 else 0 end) as positiveNum " +
>  "from src group by key").cache().createOrReplaceTempView("src_cache")
>  sql(
>  s"""select a.key
>  from
>  (select key from src_cache where positiveNum = 1)a
>  left join
>  (select key from src_cache )b
>  on a.key=b.key
>  """).explain
>  
> The subquery "select key from src_cache where positiveNum = 1" on the left of 
> join can use the cache data, but the subquery "select key from src_cache" on 
> the right of join cannot use the cache data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24870) Cache can't work normally if there are case letters in SQL

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24870:


Assignee: Apache Spark

> Cache can't work normally if there are case letters in SQL
> --
>
> Key: SPARK-24870
> URL: https://issues.apache.org/jira/browse/SPARK-24870
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: eaton
>Assignee: Apache Spark
>Priority: Major
>
> Cache can't work normally if there are case letters in SQL, 
> for example:
>  sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
> sql("select key, sum(case when Key > 0 then 1 else 0 end) as positiveNum " +
>  "from src group by key").cache().createOrReplaceTempView("src_cache")
>  sql(
>  s"""select a.key
>  from
>  (select key from src_cache where positiveNum = 1)a
>  left join
>  (select key from src_cache )b
>  on a.key=b.key
>  """).explain
>  
> The subquery "select key from src_cache where positiveNum = 1" on the left of 
> join can use the cache data, but the subquery "select key from src_cache" on 
> the right of join cannot use the cache data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24870) Cache can't work normally if there are case letters in SQL

2018-07-19 Thread eaton (JIRA)
eaton created SPARK-24870:
-

 Summary: Cache can't work normally if there are case letters in SQL
 Key: SPARK-24870
 URL: https://issues.apache.org/jira/browse/SPARK-24870
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: eaton


Cache can't work normally if there are case letters in SQL, 
for example:
 sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")

sql("select key, sum(case when Key > 0 then 1 else 0 end) as positiveNum " +
 "from src group by key").cache().createOrReplaceTempView("src_cache")
 sql(
 s"""select a.key
 from
 (select key from src_cache where positiveNum = 1)a
 left join
 (select key from src_cache )b
 on a.key=b.key
 """).explain

 

The subquery "select key from src_cache where positiveNum = 1" on the left of 
join can use the cache data, but the subquery "select key from src_cache" on 
the right of join cannot use the cache data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24847) ScalaReflection#schemaFor occasionally fails to detect schema for Seq of type alias

2018-07-19 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550091#comment-16550091
 ] 

Liang-Chi Hsieh commented on SPARK-24847:
-

I can't reproduce this currently.

> ScalaReflection#schemaFor occasionally fails to detect schema for Seq of type 
> alias
> ---
>
> Key: SPARK-24847
> URL: https://issues.apache.org/jira/browse/SPARK-24847
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ahmed Mahran
>Priority: Major
>
> org.apache.spark.sql.catalyst.ScalaReflection#schemaFor occasionally fails to 
> detect schema for Seq of type alias (and it occasionally succeeds).
>  
> {code:java}
> object Types {
>   type Alias1 = Long
>   type Alias2 = Int
>   type Alias3 = Int
> }
> case class B(b1: Alias1, b2: Seq[Alias2], b3: Option[Alias3])
> case class A(a1: B, a2: Int)
> {code}
>  
> {code}
> import sparkSession.implicits._
> val seq = Seq(
>   A(B(2L, Seq(3), Some(1)), 1),
>   A(B(3L, Seq(2), Some(2)), 2)
> )
> val ds = sparkSession.createDataset(seq)
> {code}
>  
> {code:java}
> java.lang.UnsupportedOperationException: Schema for type Seq[Types.Alias2] is 
> not supported at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:780)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:715)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:714)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:381)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:380)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:380)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:150)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:150)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:391)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:380)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:380)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:150)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:150)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:138)
>  at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:72)
>  at org.apache.spark.sql.Encoders$.product(Encoders.scala:275) at 
> org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:248)
>  at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34)
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24523) InterruptedException when closing SparkContext

2018-07-19 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550060#comment-16550060
 ] 

Dongjoon Hyun commented on SPARK-24523:
---

Hi, [~umayr_nuna].
Do you still see the same situation with Apache Spark 2.3.1?

> InterruptedException when closing SparkContext
> --
>
> Key: SPARK-24523
> URL: https://issues.apache.org/jira/browse/SPARK-24523
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
> Environment: EMR 5.14.0, S3/HDFS inputs and outputs
>  
>  
>  
>Reporter: Umayr Hassan
>Priority: Major
>
> I'm running a Scala application in EMR with the following properties:
> {{--master yarn --deploy-mode cluster --driver-memory 13g --executor-memory 
> 30g --executor-cores 5 --conf spark.default.parallelism=400 --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.maxExecutors=20 --conf 
> spark.eventLog.dir=hdfs:///var/log/spark/apps --conf 
> spark.eventLog.enabled=true --conf 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf 
> spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
> spark.shuffle.service.enabled=true --conf spark.sql.shuffle.partitions=400 
> --conf spark.yarn.maxAppAttempts=1}}
> The application runs fine till SparkContext is (automatically) closed, at 
> which point the SparkContext object throws. 
> {{18/06/10 10:44:43 ERROR Utils: Uncaught exception in thread pool-4-thread-1 
> java.lang.InterruptedException at java.lang.Object.wait(Native Method) at 
> java.lang.Thread.join(Thread.java:1252) at 
> java.lang.Thread.join(Thread.java:1326) at 
> org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:133) at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at 
> org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:54) at 
> org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219) at 
> org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1915)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at 
> org.apache.spark.SparkContext.stop(SparkContext.scala:1914) at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) 
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) 
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>  at scala.util.Try$.apply(Try.scala:192) at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>  at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)}}
>  
> I've not seen this behavior in Spark 2.0.2 and Spark 2.2.0 (for the same 
> application), so I'm not sure which change is causing Spark 2.3 to throw. Any 
> ideas?
> best,
> Umayr



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24726) Discuss necessary info and access in barrier mode + Standalone

2018-07-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24726.
---
  Resolution: Resolved
Target Version/s: 2.4.0  (was: 3.0.0)

> Discuss necessary info and access in barrier mode + Standalone
> --
>
> Key: SPARK-24726
> URL: https://issues.apache.org/jira/browse/SPARK-24726
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + Standalone. For MPI, 
> what we need is password-less SSH access among workers. We might also 
> consider other distributed frameworks, like distributed tensorflow, H2O, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24726) Discuss necessary info and access in barrier mode + Standalone

2018-07-19 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550050#comment-16550050
 ] 

Xiangrui Meng commented on SPARK-24726:
---

I'm closing this ticket as resolved since with passwordless SSH on a standalone 
cluster, users should be able to do other things via SSH.

> Discuss necessary info and access in barrier mode + Standalone
> --
>
> Key: SPARK-24726
> URL: https://issues.apache.org/jira/browse/SPARK-24726
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + Standalone. For MPI, 
> what we need is password-less SSH access among workers. We might also 
> consider other distributed frameworks, like distributed tensorflow, H2O, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24726) Discuss necessary info and access in barrier mode + Standalone

2018-07-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24726:
-

Assignee: Xiangrui Meng

> Discuss necessary info and access in barrier mode + Standalone
> --
>
> Key: SPARK-24726
> URL: https://issues.apache.org/jira/browse/SPARK-24726
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + Standalone. For MPI, 
> what we need is password-less SSH access among workers. We might also 
> consider other distributed frameworks, like distributed tensorflow, H2O, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24723:
--
Description: 
In barrier mode, to run hybrid distributed DL training jobs, we need to provide 
users sufficient info and access so they can set up a hybrid distributed 
training job, e.g., using MPI.

This ticket limits the scope of discussion to Spark + YARN. There were some 
past attempts from the Hadoop community. So we should find someone with good 
knowledge to lead the discussion here.

 

Requirements:
 * understand how to set up YARN to run MPI job as a YARN application
 * figure out how to do it with Spark w/ Barrier

  was:
In barrier mode, to run hybrid distributed DL training jobs, we need to provide 
users sufficient info and access so they can set up a hybrid distributed 
training job, e.g., using MPI.

This ticket limits the scope of discussion to Spark + YARN. There were some 
past attempts from the Hadoop community. So we should find someone with good 
knowledge to lead the discussion here.


> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Saisai Shao
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.
>  
> Requirements:
>  * understand how to set up YARN to run MPI job as a YARN application
>  * figure out how to do it with Spark w/ Barrier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24724) Discuss necessary info and access in barrier mode + Kubernetes

2018-07-19 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550049#comment-16550049
 ] 

Xiangrui Meng commented on SPARK-24724:
---

[~liyinan926] Any updates?

> Discuss necessary info and access in barrier mode + Kubernetes
> --
>
> Key: SPARK-24724
> URL: https://issues.apache.org/jira/browse/SPARK-24724
> Project: Spark
>  Issue Type: Story
>  Components: Kubernetes, ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Yinan Li
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + Kubernetes. There were 
> some past and on-going attempts from the Kubenetes community. So we should 
> find someone with good knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24723:
-

Assignee: Saisai Shao

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Saisai Shao
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-19 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550048#comment-16550048
 ] 

Xiangrui Meng commented on SPARK-24723:
---

[~jerryshao] Does YARN have the feature that will by default configure 
passwordless SSH on all containers (or per application)? If Spark generates the 
key files in barrier mode on YARN, it might break this feature provided by 
YARN. And does container by default run sshd? If not, which process is 
responsible for starting/terminating the daemon?

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24869) SaveIntoDataSourceCommand's input Dataset does not use Cached Data

2018-07-19 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550046#comment-16550046
 ] 

Xiao Li commented on SPARK-24869:
-

Thanks! [~maropu]

> SaveIntoDataSourceCommand's input Dataset does not use Cached Data
> --
>
> Key: SPARK-24869
> URL: https://issues.apache.org/jira/browse/SPARK-24869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> withTable("t") {
>   withTempPath { path =>
> var numTotalCachedHit = 0
> val listener = new QueryExecutionListener {
>   override def onFailure(f: String, qe: QueryExecution, e: 
> Exception):Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, 
> duration: Long): Unit = {
> qe.withCachedData match {
>   case c: SaveIntoDataSourceCommand
>   if c.query.isInstanceOf[InMemoryRelation] =>
> numTotalCachedHit += 1
>   case _ =>
> println(qe.withCachedData)
> }
>   }
> }
> spark.listenerManager.register(listener)
> val udf1 = udf({ (x: Int, y: Int) => x + y })
> val df = spark.range(0, 3).toDF("a")
>   .withColumn("b", udf1(col("a"), lit(10)))
> df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", 
> properties)
> assert(numTotalCachedHit == 1, "expected to be cached in jdbc")
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24865) Remove AnalysisBarrier

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24865:


Assignee: (was: Apache Spark)

> Remove AnalysisBarrier
> --
>
> Key: SPARK-24865
> URL: https://issues.apache.org/jira/browse/SPARK-24865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Reynold Xin
>Priority: Major
>
> AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed 
> (don't re-analyze nodes that have already been analyzed).
> Before AnalysisBarrier, we already had some infrastructure in place, with 
> analysis specific functions (resolveOperators and resolveExpressions). These 
> functions do not recursively traverse down subplans that are already analyzed 
> (with a mutable boolean flag _analyzed). The issue with the old system was 
> that developers started using transformDown, which does a top-down traversal 
> of the plan tree, because there was not top-down resolution function, and as 
> a result analyzer performance became pretty bad.
> In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a 
> special node and for this special node, transform/transformUp/transformDown 
> don't traverse down. However, the introduction of this special node caused a 
> lot more troubles than it solves. This implicit node breaks assumptions and 
> code in a few places, and it's hard to know when analysis barrier would 
> exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR 
> discussions demonstrates it is a source of bugs and additional complexity.
> Instead, I think a much simpler fix to the original issue is to introduce 
> resolveOperatorsDown, and change all places that call transformDown in the 
> analyzer to use that. We can also ban accidental uses of the various 
> transform* methods by using a linter (which can only lint specific packages), 
> or in test mode inspect the stack trace and fail explicitly if transform* are 
> called in the analyzer. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24865) Remove AnalysisBarrier

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24865:


Assignee: Apache Spark

> Remove AnalysisBarrier
> --
>
> Key: SPARK-24865
> URL: https://issues.apache.org/jira/browse/SPARK-24865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Major
>
> AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed 
> (don't re-analyze nodes that have already been analyzed).
> Before AnalysisBarrier, we already had some infrastructure in place, with 
> analysis specific functions (resolveOperators and resolveExpressions). These 
> functions do not recursively traverse down subplans that are already analyzed 
> (with a mutable boolean flag _analyzed). The issue with the old system was 
> that developers started using transformDown, which does a top-down traversal 
> of the plan tree, because there was not top-down resolution function, and as 
> a result analyzer performance became pretty bad.
> In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a 
> special node and for this special node, transform/transformUp/transformDown 
> don't traverse down. However, the introduction of this special node caused a 
> lot more troubles than it solves. This implicit node breaks assumptions and 
> code in a few places, and it's hard to know when analysis barrier would 
> exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR 
> discussions demonstrates it is a source of bugs and additional complexity.
> Instead, I think a much simpler fix to the original issue is to introduce 
> resolveOperatorsDown, and change all places that call transformDown in the 
> analyzer to use that. We can also ban accidental uses of the various 
> transform* methods by using a linter (which can only lint specific packages), 
> or in test mode inspect the stack trace and fail explicitly if transform* are 
> called in the analyzer. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24865) Remove AnalysisBarrier

2018-07-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550047#comment-16550047
 ] 

Apache Spark commented on SPARK-24865:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21822

> Remove AnalysisBarrier
> --
>
> Key: SPARK-24865
> URL: https://issues.apache.org/jira/browse/SPARK-24865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Reynold Xin
>Priority: Major
>
> AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed 
> (don't re-analyze nodes that have already been analyzed).
> Before AnalysisBarrier, we already had some infrastructure in place, with 
> analysis specific functions (resolveOperators and resolveExpressions). These 
> functions do not recursively traverse down subplans that are already analyzed 
> (with a mutable boolean flag _analyzed). The issue with the old system was 
> that developers started using transformDown, which does a top-down traversal 
> of the plan tree, because there was not top-down resolution function, and as 
> a result analyzer performance became pretty bad.
> In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a 
> special node and for this special node, transform/transformUp/transformDown 
> don't traverse down. However, the introduction of this special node caused a 
> lot more troubles than it solves. This implicit node breaks assumptions and 
> code in a few places, and it's hard to know when analysis barrier would 
> exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR 
> discussions demonstrates it is a source of bugs and additional complexity.
> Instead, I think a much simpler fix to the original issue is to introduce 
> resolveOperatorsDown, and change all places that call transformDown in the 
> analyzer to use that. We can also ban accidental uses of the various 
> transform* methods by using a linter (which can only lint specific packages), 
> or in test mode inspect the stack trace and fail explicitly if transform* are 
> called in the analyzer. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter

2018-07-19 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24867:

Priority: Blocker  (was: Major)

> Add AnalysisBarrier to DataFrameWriter 
> ---
>
> Key: SPARK-24867
> URL: https://issues.apache.org/jira/browse/SPARK-24867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> {code}
>   val udf1 = udf({(x: Int, y: Int) => x + y})
>   val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
>   df.cache()
>   df.write.saveAsTable("t")
>   df.write.saveAsTable("t1")
> {code}
> Cache is not being used because the plans do not match with the cached plan. 
> This is a regression caused by the changes we made in AnalysisBarrier, since 
> not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24867:


Assignee: Xiao Li  (was: Apache Spark)

> Add AnalysisBarrier to DataFrameWriter 
> ---
>
> Key: SPARK-24867
> URL: https://issues.apache.org/jira/browse/SPARK-24867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> {code}
>   val udf1 = udf({(x: Int, y: Int) => x + y})
>   val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
>   df.cache()
>   df.write.saveAsTable("t")
>   df.write.saveAsTable("t1")
> {code}
> Cache is not being used because the plans do not match with the cached plan. 
> This is a regression caused by the changes we made in AnalysisBarrier, since 
> not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter

2018-07-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550034#comment-16550034
 ] 

Apache Spark commented on SPARK-24867:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/21821

> Add AnalysisBarrier to DataFrameWriter 
> ---
>
> Key: SPARK-24867
> URL: https://issues.apache.org/jira/browse/SPARK-24867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> {code}
>   val udf1 = udf({(x: Int, y: Int) => x + y})
>   val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
>   df.cache()
>   df.write.saveAsTable("t")
>   df.write.saveAsTable("t1")
> {code}
> Cache is not being used because the plans do not match with the cached plan. 
> This is a regression caused by the changes we made in AnalysisBarrier, since 
> not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24867:


Assignee: Apache Spark  (was: Xiao Li)

> Add AnalysisBarrier to DataFrameWriter 
> ---
>
> Key: SPARK-24867
> URL: https://issues.apache.org/jira/browse/SPARK-24867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> {code}
>   val udf1 = udf({(x: Int, y: Int) => x + y})
>   val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
>   df.cache()
>   df.write.saveAsTable("t")
>   df.write.saveAsTable("t1")
> {code}
> Cache is not being used because the plans do not match with the cached plan. 
> This is a regression caused by the changes we made in AnalysisBarrier, since 
> not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24869) SaveIntoDataSourceCommand's input Dataset does not use Cached Data

2018-07-19 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550029#comment-16550029
 ] 

Takeshi Yamamuro commented on SPARK-24869:
--

ok

> SaveIntoDataSourceCommand's input Dataset does not use Cached Data
> --
>
> Key: SPARK-24869
> URL: https://issues.apache.org/jira/browse/SPARK-24869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> withTable("t") {
>   withTempPath { path =>
> var numTotalCachedHit = 0
> val listener = new QueryExecutionListener {
>   override def onFailure(f: String, qe: QueryExecution, e: 
> Exception):Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, 
> duration: Long): Unit = {
> qe.withCachedData match {
>   case c: SaveIntoDataSourceCommand
>   if c.query.isInstanceOf[InMemoryRelation] =>
> numTotalCachedHit += 1
>   case _ =>
> println(qe.withCachedData)
> }
>   }
> }
> spark.listenerManager.register(listener)
> val udf1 = udf({ (x: Int, y: Int) => x + y })
> val df = spark.range(0, 3).toDF("a")
>   .withColumn("b", udf1(col("a"), lit(10)))
> df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", 
> properties)
> assert(numTotalCachedHit == 1, "expected to be cached in jdbc")
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6459) Warn when Column API is constructing trivially true equality

2018-07-19 Thread nirav patel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550024#comment-16550024
 ] 

nirav patel commented on SPARK-6459:


[~zero323] why the example you gave should generate cartesian product? I don't 
see why it should if it were to run against any ansi sql engine (mysql, oracle) 
. Why is this issue with spark? Is because nature of lazy evaluation and query 
planner optimization done at the end ?

[~marmbrus] I think this "WARNING" is masking some bigger issue here. Can't 
spark sql engine create aliases itself so it itself doesn't get confused 
instead of burdening user with it. As far as user is concerned he is writing 
individual sql statements which are correct syntactically and semantically. 
It's a spark query planner which misinterprets the semantic. 

> Warn when Column API is constructing trivially true equality
> 
>
> Key: SPARK-6459
> URL: https://issues.apache.org/jira/browse/SPARK-6459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.3.1, 1.4.0
>
>
> Right now its pretty confusing when a user constructs and equality predicate 
> that is going to be use in a self join, where the optimizer cannot 
> distinguish between the attributes in question (e.g.,  [SPARK-6231]).  Since 
> there is really no good reason to do this, lets print a warning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-19 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550023#comment-16550023
 ] 

Xiangrui Meng commented on SPARK-24615:
---

[~tgraves] Could you help link some past requests on configurable CPU/memory 
per stage? And you are suggesting making the API generalizable to those 
scenarios, but not including the feature under the scope of this proposal, 
correct?

Btw, how do you like the following API?
{code:java}
rdd.withResources
  .prefer("/gpu/k80", 2) // prefix of resource logical name, amount
  .require("/cpu", 1)
  .require("/memory", 819200)
  .require("/disk", 1){code}
 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24869) SaveIntoDataSourceCommand's input Dataset does not use Cached Data

2018-07-19 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550020#comment-16550020
 ] 

Xiao Li commented on SPARK-24869:
-

cc [~maropu] Do you want to make a try?

> SaveIntoDataSourceCommand's input Dataset does not use Cached Data
> --
>
> Key: SPARK-24869
> URL: https://issues.apache.org/jira/browse/SPARK-24869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> withTable("t") {
>   withTempPath { path =>
> var numTotalCachedHit = 0
> val listener = new QueryExecutionListener {
>   override def onFailure(f: String, qe: QueryExecution, e: 
> Exception):Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, 
> duration: Long): Unit = {
> qe.withCachedData match {
>   case c: SaveIntoDataSourceCommand
>   if c.query.isInstanceOf[InMemoryRelation] =>
> numTotalCachedHit += 1
>   case _ =>
> println(qe.withCachedData)
> }
>   }
> }
> spark.listenerManager.register(listener)
> val udf1 = udf({ (x: Int, y: Int) => x + y })
> val df = spark.range(0, 3).toDF("a")
>   .withColumn("b", udf1(col("a"), lit(10)))
> df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", 
> properties)
> assert(numTotalCachedHit == 1, "expected to be cached in jdbc")
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24869) SaveIntoDataSourceCommand's input Dataset does not use Cached Data

2018-07-19 Thread Xiao Li (JIRA)
Xiao Li created SPARK-24869:
---

 Summary: SaveIntoDataSourceCommand's input Dataset does not use 
Cached Data
 Key: SPARK-24869
 URL: https://issues.apache.org/jira/browse/SPARK-24869
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Xiao Li


{code}
withTable("t") {
  withTempPath { path =>

var numTotalCachedHit = 0
val listener = new QueryExecutionListener {
  override def onFailure(f: String, qe: QueryExecution, e: 
Exception):Unit = {}

  override def onSuccess(funcName: String, qe: QueryExecution, 
duration: Long): Unit = {
qe.withCachedData match {
  case c: SaveIntoDataSourceCommand
  if c.query.isInstanceOf[InMemoryRelation] =>
numTotalCachedHit += 1
  case _ =>
println(qe.withCachedData)
}
  }
}
spark.listenerManager.register(listener)

val udf1 = udf({ (x: Int, y: Int) => x + y })
val df = spark.range(0, 3).toDF("a")
  .withColumn("b", udf1(col("a"), lit(10)))
df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", 
properties)
assert(numTotalCachedHit == 1, "expected to be cached in jdbc")
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24868) add sequence function in Python

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24868:


Assignee: (was: Apache Spark)

> add sequence function in Python
> ---
>
> Key: SPARK-24868
> URL: https://issues.apache.org/jira/browse/SPARK-24868
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Seems the sequence function is only in functions.scala, not in functions.py. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24868) add sequence function in Python

2018-07-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550014#comment-16550014
 ] 

Apache Spark commented on SPARK-24868:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21820

> add sequence function in Python
> ---
>
> Key: SPARK-24868
> URL: https://issues.apache.org/jira/browse/SPARK-24868
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Seems the sequence function is only in functions.scala, not in functions.py. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24868) add sequence function in Python

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24868:


Assignee: Apache Spark

> add sequence function in Python
> ---
>
> Key: SPARK-24868
> URL: https://issues.apache.org/jira/browse/SPARK-24868
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> Seems the sequence function is only in functions.scala, not in functions.py. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24868) add sequence function in Python

2018-07-19 Thread Huaxin Gao (JIRA)
Huaxin Gao created SPARK-24868:
--

 Summary: add sequence function in Python
 Key: SPARK-24868
 URL: https://issues.apache.org/jira/browse/SPARK-24868
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Huaxin Gao


Seems the sequence function is only in functions.scala, not in functions.py. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter

2018-07-19 Thread Xiao Li (JIRA)
Xiao Li created SPARK-24867:
---

 Summary: Add AnalysisBarrier to DataFrameWriter 
 Key: SPARK-24867
 URL: https://issues.apache.org/jira/browse/SPARK-24867
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li



{code}
  val udf1 = udf({(x: Int, y: Int) => x + y})
  val df = spark.range(0, 3).toDF("a")
.withColumn("b", udf1($"a", udf1($"a", lit(10
  df.cache()
  df.write.saveAsTable("t")
  df.write.saveAsTable("t1")
{code}

Cache is not being used because the plans do not match with the cached plan. 
This is a regression caused by the changes we made in AnalysisBarrier, since 
not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24866) Artifactual ROC scores when scaling up Random Forest classifier

2018-07-19 Thread Evan Zamir (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Zamir updated SPARK-24866:
---
Description: I'm encountering a very strange behavior that I can't explain 
away other than a bug somewhere. I'm creating RF models on Amazon EMR, normally 
using 1 Core instance. On these models, I have been consistently getting ROCs 
(during CV) ~0.55-0.60 (not good models obviously, but that's not the point 
here). After learning that Spark 2.3 introduced a parallelism parameter for the 
CV class, I decided to implement that and see if increasing the number of Core 
instances could also help speed up the models (which take several hours, 
sometimes up to a full day). To make a long story short, I have seen that on 
some of my datasets simply increasing the number of Core instances (i.e. 2), 
the ROC scores (*bestValidationMetric*) increase tremendously to the range of 
0.85-0.95. For the life of me I can't figure out why simply increasing the 
number of instances (with absolutely no changes to code), would have this 
effect. I don't know if this is a Spark problem or somehow EMR, but I figured 
I'd post here and see if anyone has an idea for me.   (was: I'm encountering a 
very strange behavior that I can't explain away other than a bug somewhere. I'm 
creating RF models on Amazon EMR, normally using 1 Core instance. On these 
models, I have been consistently getting ROCs (during CV) ~0.55-0.60 (not good 
models obviously, but that's not the point here). After learning that Spark 2.3 
introduced a parallelism parameter for the CV class, I decided to implement 
that and see if increasing the number of Core instances could also help speed 
up the models (which take several hours, sometimes up to a full day). To make a 
long story short, I have seen that on some of my datasets simply increasing the 
number of Core instances (i.e. 2), the ROC scores increase tremendously to the 
range of 0.85-0.95. For the life of me I can't figure out why simply increasing 
the number of instances (with absolutely no changes to code), would have this 
effect. I don't know if this is a Spark problem or somehow EMR, but I figured 
I'd post here and see if anyone has an idea for me. )

> Artifactual ROC scores when scaling up Random Forest classifier
> ---
>
> Key: SPARK-24866
> URL: https://issues.apache.org/jira/browse/SPARK-24866
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Evan Zamir
>Priority: Minor
>
> I'm encountering a very strange behavior that I can't explain away other than 
> a bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core 
> instance. On these models, I have been consistently getting ROCs (during CV) 
> ~0.55-0.60 (not good models obviously, but that's not the point here). After 
> learning that Spark 2.3 introduced a parallelism parameter for the CV class, 
> I decided to implement that and see if increasing the number of Core 
> instances could also help speed up the models (which take several hours, 
> sometimes up to a full day). To make a long story short, I have seen that on 
> some of my datasets simply increasing the number of Core instances (i.e. 2), 
> the ROC scores (*bestValidationMetric*) increase tremendously to the range of 
> 0.85-0.95. For the life of me I can't figure out why simply increasing the 
> number of instances (with absolutely no changes to code), would have this 
> effect. I don't know if this is a Spark problem or somehow EMR, but I figured 
> I'd post here and see if anyone has an idea for me. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24866) Artifactual ROC scores when scaling up Random Forest classifier

2018-07-19 Thread Evan Zamir (JIRA)
Evan Zamir created SPARK-24866:
--

 Summary: Artifactual ROC scores when scaling up Random Forest 
classifier
 Key: SPARK-24866
 URL: https://issues.apache.org/jira/browse/SPARK-24866
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.0
Reporter: Evan Zamir


I'm encountering a very strange behavior that I can't explain away other than a 
bug somewhere. I'm creating RF models on Amazon EMR, normally using 1 Core 
instance. On these models, I have been consistently getting ROCs (during CV) 
~0.55-0.60 (not good models obviously, but that's not the point here). After 
learning that Spark 2.3 introduced a parallelism parameter for the CV class, I 
decided to implement that and see if increasing the number of Core instances 
could also help speed up the models (which take several hours, sometimes up to 
a full day). To make a long story short, I have seen that on some of my 
datasets simply increasing the number of Core instances (i.e. 2), the ROC 
scores increase tremendously to the range of 0.85-0.95. For the life of me I 
can't figure out why simply increasing the number of instances (with absolutely 
no changes to code), would have this effect. I don't know if this is a Spark 
problem or somehow EMR, but I figured I'd post here and see if anyone has an 
idea for me. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24865) Remove AnalysisBarrier

2018-07-19 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-24865:

Affects Version/s: 2.3.0

> Remove AnalysisBarrier
> --
>
> Key: SPARK-24865
> URL: https://issues.apache.org/jira/browse/SPARK-24865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Reynold Xin
>Priority: Major
>
> AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed 
> (don't re-analyze nodes that have already been analyzed).
> Before AnalysisBarrier, we already had some infrastructure in place, with 
> analysis specific functions (resolveOperators and resolveExpressions). These 
> functions do not recursively traverse down subplans that are already analyzed 
> (with a mutable boolean flag _analyzed). The issue with the old system was 
> that developers started using transformDown, which does a top-down traversal 
> of the plan tree, because there was not top-down resolution function, and as 
> a result analyzer performance became pretty bad.
> In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a 
> special node and for this special node, transform/transformUp/transformDown 
> don't traverse down. However, the introduction of this special node caused a 
> lot more troubles than it solves. This implicit node breaks assumptions and 
> code in a few places, and it's hard to know when analysis barrier would 
> exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR 
> discussions demonstrates it is a source of bugs and additional complexity.
> Instead, I think a much simpler fix to the original issue is to introduce 
> resolveOperatorsDown, and change all places that call transformDown in the 
> analyzer to use that. We can also ban accidental uses of the various 
> transform* methods by using a linter (which can only lint specific packages), 
> or in test mode inspect the stack trace and fail explicitly if transform* are 
> called in the analyzer. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24865) Remove AnalysisBarrier

2018-07-19 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-24865:

Description: 
AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed (don't 
re-analyze nodes that have already been analyzed).

Before AnalysisBarrier, we already had some infrastructure in place, with 
analysis specific functions (resolveOperators and resolveExpressions). These 
functions do not recursively traverse down subplans that are already analyzed 
(with a mutable boolean flag _analyzed). The issue with the old system was that 
developers started using transformDown, which does a top-down traversal of the 
plan tree, because there was not top-down resolution function, and as a result 
analyzer performance became pretty bad.

In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a 
special node and for this special node, transform/transformUp/transformDown 
don't traverse down. However, the introduction of this special node caused a 
lot more troubles than it solves. This implicit node breaks assumptions and 
code in a few places, and it's hard to know when analysis barrier would exist, 
and when it wouldn't. Just a simple search of AnalysisBarrier in PR discussions 
demonstrates it is a source of bugs and additional complexity.

Instead, I think a much simpler fix to the original issue is to introduce 
resolveOperatorsDown, and change all places that call transformDown in the 
analyzer to use that. We can also ban accidental uses of the various transform* 
methods by using a linter (which can only lint specific packages), or in test 
mode inspect the stack trace and fail explicitly if transform* are called in 
the analyzer. 

  was:
AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed (don't 
re-analyze nodes that have already been analyzed).

 

Before AnalysisBarrier, we already had some infrastructure in place, with 
analysis specific functions (resolveOperators and resolveExpressions). These 
functions do not recursively traverse down subplans that are already analyzed 
(with a mutable boolean flag _analyzed). The issue with the old system was that 
developers started using transformDown, which does a top-down traversal of the 
plan tree, because there was not top-down resolution function, and as a result 
analyzer performance became pretty bad.

 

In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a 
special node and for this special node, transform/transformUp/transformDown 
don't traverse down. However, the introduction of this special node caused a 
lot more troubles than it solves. This implicit node breaks assumptions and 
code in a few places, and it's hard to know when analysis barrier would exist, 
and when it wouldn't. Just a simple search of AnalysisBarrier in PR discussions 
demonstrates it is a source of bugs and additional complexity.

 

Instead, I think a much simpler fix to the original issue is to introduce 
resolveOperatorsDown, and change all places that call transformDown in the 
analyzer to use that. We can also ban accidental uses of the various transform* 
methods by using a linter (which can only lint specific packages), or in test 
mode inspect the stack trace and fail explicitly if transform* are called in 
the analyzer.

 

 

 

 


> Remove AnalysisBarrier
> --
>
> Key: SPARK-24865
> URL: https://issues.apache.org/jira/browse/SPARK-24865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Reynold Xin
>Priority: Major
>
> AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed 
> (don't re-analyze nodes that have already been analyzed).
> Before AnalysisBarrier, we already had some infrastructure in place, with 
> analysis specific functions (resolveOperators and resolveExpressions). These 
> functions do not recursively traverse down subplans that are already analyzed 
> (with a mutable boolean flag _analyzed). The issue with the old system was 
> that developers started using transformDown, which does a top-down traversal 
> of the plan tree, because there was not top-down resolution function, and as 
> a result analyzer performance became pretty bad.
> In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a 
> special node and for this special node, transform/transformUp/transformDown 
> don't traverse down. However, the introduction of this special node caused a 
> lot more troubles than it solves. This implicit node breaks assumptions and 
> code in a few places, and it's hard to know when analysis barrier would 
> exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR 
> discussions demonstrates it is a source of bugs and additional complexity.
> Instead, I think a much 

[jira] [Created] (SPARK-24865) Remove AnalysisBarrier

2018-07-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-24865:
---

 Summary: Remove AnalysisBarrier
 Key: SPARK-24865
 URL: https://issues.apache.org/jira/browse/SPARK-24865
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Reynold Xin


AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed (don't 
re-analyze nodes that have already been analyzed).

 

Before AnalysisBarrier, we already had some infrastructure in place, with 
analysis specific functions (resolveOperators and resolveExpressions). These 
functions do not recursively traverse down subplans that are already analyzed 
(with a mutable boolean flag _analyzed). The issue with the old system was that 
developers started using transformDown, which does a top-down traversal of the 
plan tree, because there was not top-down resolution function, and as a result 
analyzer performance became pretty bad.

 

In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a 
special node and for this special node, transform/transformUp/transformDown 
don't traverse down. However, the introduction of this special node caused a 
lot more troubles than it solves. This implicit node breaks assumptions and 
code in a few places, and it's hard to know when analysis barrier would exist, 
and when it wouldn't. Just a simple search of AnalysisBarrier in PR discussions 
demonstrates it is a source of bugs and additional complexity.

 

Instead, I think a much simpler fix to the original issue is to introduce 
resolveOperatorsDown, and change all places that call transformDown in the 
analyzer to use that. We can also ban accidental uses of the various transform* 
methods by using a linter (which can only lint specific packages), or in test 
mode inspect the stack trace and fail explicitly if transform* are called in 
the analyzer.

 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view

2018-07-19 Thread Abhishek Madav (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Madav updated SPARK-24864:
---
Description: 
Spark job reading from a hive-view fails with analysis exception when resolving 
column ordinals which are autogenerated.

*Exception*:
{code:java}
scala> spark.sql("Select * from vsrc1new").show
org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
input columns: [id, upper(name)]; line 1 pos 24;
'Project [*]
+- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
   +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
  +- SubqueryAlias vsrc1new
 +- Project [id#634, upper(name#635) AS upper(name)#636]
    +- MetastoreRelation default, src1

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
{code}
*Steps to reproduce:*

1: Create a simple table, say src
{code:java}
CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
TERMINATED BY ','
{code}
2: Create a view, say with name vsrc1new
{code:java}
CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, upper(name) 
FROM src1) vsrc1new;
{code}
3. Selecting data from this view in hive-cli/beeline doesn't cause any error.

4. Creating a dataframe using:
{code:java}
spark.sql("Select * from vsrc1new").show //throws error
{code}
The auto-generated column names for the view are not resolved. Am I possibly 
missing some spark-sql configuration here? I tried the repro-case against spark 
1.6 and that worked fine. Any inputs are appreciated.

  was:
Spark job reading from a hive-view fails with analysis exception when resolving 
column ordinals which are autogenerated.

*Exception*:
{code:java}
scala> spark.sql("Select * from vsrc1new").show
org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
input columns: [id, upper(name)]; line 1 pos 24;
'Project [*]
+- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
   +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
  +- SubqueryAlias vsrc1new
 +- Project [id#634, upper(name#635) AS upper(name)#636]
    +- MetastoreRelation default, src1

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
{code}
Steps to reproduce:

1: Create a simple table, say src
{code:java}
CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
TERMINATED BY ','
{code}
2: Create a view, say with name vsrc1new
{code:java}
CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, upper(name) 
FROM src1) vsrc1new;
{code}
3. Selecting data from this view in hive-cli/beeline doesn't cause any error.

4. Creating a dataframe using:
{code:java}
spark.sql("Select * from vsrc1new").show //throws error
{code}
The auto-generated column names for the view are not resolved. Am I possibly 
missing some spark-sql configuration here? I tried the repro-case against spark 
1.6 and that worked fine. Any inputs are appreciated.


> Cannot resolve auto-generated column ordinals in a hive view
> 
>
> Key: SPARK-24864
> URL: https://issues.apache.org/jira/browse/SPARK-24864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Abhishek Madav
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark job reading from a hive-view fails with analysis exception when 
> resolving column ordinals which are autogenerated.
> *Exception*:
> {code:java}
> scala> spark.sql("Select * from 

[jira] [Updated] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view

2018-07-19 Thread Abhishek Madav (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Madav updated SPARK-24864:
---
Summary: Cannot resolve auto-generated column ordinals in a hive view  
(was: Cannot reference auto-generated column ordinals in a hive view)

> Cannot resolve auto-generated column ordinals in a hive view
> 
>
> Key: SPARK-24864
> URL: https://issues.apache.org/jira/browse/SPARK-24864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Abhishek Madav
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark job reading from a hive-view fails with analysis exception when 
> resolving column ordinals which are autogenerated.
> *Exception*:
> {code:java}
> scala> spark.sql("Select * from vsrc1new").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
> input columns: [id, upper(name)]; line 1 pos 24;
> 'Project [*]
> +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
>    +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
>   +- SubqueryAlias vsrc1new
>  +- Project [id#634, upper(name#635) AS upper(name)#636]
>     +- MetastoreRelation default, src1
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
> {code}
> Steps to reproduce:
> 1: Create a simple table, say src
> {code:java}
> CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
> TERMINATED BY ','
> {code}
> 2: Create a view, say with name vsrc1new
> {code:java}
> CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, 
> upper(name) FROM src1) vsrc1new;
> {code}
> 3. Selecting data from this view in hive-cli/beeline doesn't cause any error.
> 4. Creating a dataframe using:
> {code:java}
> spark.sql("Select * from vsrc1new").show //throws error
> {code}
> The auto-generated column names for the view are not resolved. Am I possibly 
> missing some spark-sql configuration here? I tried the repro-case against 
> spark 1.6 and that worked fine. Any inputs are appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24864) Cannot reference auto-generated column ordinals in a hive view

2018-07-19 Thread Abhishek Madav (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Madav updated SPARK-24864:
---
Summary: Cannot reference auto-generated column ordinals in a hive view  
(was: Cannot reference auto-generated column ordinals in a hive-view. )

> Cannot reference auto-generated column ordinals in a hive view
> --
>
> Key: SPARK-24864
> URL: https://issues.apache.org/jira/browse/SPARK-24864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Abhishek Madav
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark job reading from a hive-view fails with analysis exception when 
> resolving column ordinals which are autogenerated.
> *Exception*:
> {code:java}
> scala> spark.sql("Select * from vsrc1new").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
> input columns: [id, upper(name)]; line 1 pos 24;
> 'Project [*]
> +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
>    +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
>   +- SubqueryAlias vsrc1new
>  +- Project [id#634, upper(name#635) AS upper(name)#636]
>     +- MetastoreRelation default, src1
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
> {code}
> Steps to reproduce:
> 1: Create a simple table, say src
> {code:java}
> CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
> TERMINATED BY ','
> {code}
> 2: Create a view, say with name vsrc1new
> {code:java}
> CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, 
> upper(name) FROM src1) vsrc1new;
> {code}
> 3. Selecting data from this view in hive-cli/beeline doesn't cause any error.
> 4. Creating a dataframe using:
> {code:java}
> spark.sql("Select * from vsrc1new").show //throws error
> {code}
> The auto-generated column names for the view are not resolved. Am I possibly 
> missing some spark-sql configuration here? I tried the repro-case against 
> spark 1.6 and that worked fine. Any inputs are appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24864) Cannot reference auto-generated column ordinals in a hive-view.

2018-07-19 Thread Abhishek Madav (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Madav updated SPARK-24864:
---
Description: 
Spark job reading from a hive-view fails with analysis exception when resolving 
column ordinals which are autogenerated.

*Exception*:
{code:java}
scala> spark.sql("Select * from vsrc1new").show
org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
input columns: [id, upper(name)]; line 1 pos 24;
'Project [*]
+- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
   +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
  +- SubqueryAlias vsrc1new
 +- Project [id#634, upper(name#635) AS upper(name)#636]
    +- MetastoreRelation default, src1

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
{code}
Steps to reproduce:

1: Create a simple table, say src
{code:java}
CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
TERMINATED BY ','
{code}
2: Create a view, say with name vsrc1new
{code:java}
CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, upper(name) 
FROM src1) vsrc1new;
{code}
3. Selecting data from this view in hive-cli/beeline doesn't cause any error.

4. Creating a dataframe using:
{code:java}
spark.sql("Select * from vsrc1new").show //throws error
{code}
The auto-generated column names for the view are not resolved. Am I possibly 
missing some spark-sql configuration here? I tried the repro-case against spark 
1.6 and that worked fine. Any inputs are appreciated.

> Cannot reference auto-generated column ordinals in a hive-view. 
> 
>
> Key: SPARK-24864
> URL: https://issues.apache.org/jira/browse/SPARK-24864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Abhishek Madav
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark job reading from a hive-view fails with analysis exception when 
> resolving column ordinals which are autogenerated.
> *Exception*:
> {code:java}
> scala> spark.sql("Select * from vsrc1new").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
> input columns: [id, upper(name)]; line 1 pos 24;
> 'Project [*]
> +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
>    +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
>   +- SubqueryAlias vsrc1new
>  +- Project [id#634, upper(name#635) AS upper(name)#636]
>     +- MetastoreRelation default, src1
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
> {code}
> Steps to reproduce:
> 1: Create a simple table, say src
> {code:java}
> CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
> TERMINATED BY ','
> {code}
> 2: Create a view, say with name vsrc1new
> {code:java}
> CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, 
> upper(name) FROM src1) vsrc1new;
> {code}
> 3. Selecting data from this view in hive-cli/beeline doesn't cause any error.
> 4. Creating a dataframe using:
> {code:java}
> spark.sql("Select * from vsrc1new").show //throws error
> {code}
> The auto-generated column names for the view are not resolved. Am I possibly 
> missing some spark-sql configuration here? I tried the repro-case against 
> spark 1.6 and that worked fine. Any inputs are appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SPARK-24846) Stabilize expression cannonicalization

2018-07-19 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-24846.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Fixed by gvr's PR. I could not find this user in JIRA.

> Stabilize expression cannonicalization
> --
>
> Key: SPARK-24846
> URL: https://issues.apache.org/jira/browse/SPARK-24846
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: spree
> Fix For: 2.4.0
>
>
> Spark plan canonicalization is can be non-deterministic between different 
> versions of spark due to the fact that {{ExprId}} uses a UUID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24864) Cannot reference auto-generated column ordinals in a hive-view.

2018-07-19 Thread Abhishek Madav (JIRA)
Abhishek Madav created SPARK-24864:
--

 Summary: Cannot reference auto-generated column ordinals in a 
hive-view. 
 Key: SPARK-24864
 URL: https://issues.apache.org/jira/browse/SPARK-24864
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.1
Reporter: Abhishek Madav
 Fix For: 2.4.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24863) Report offset lag as a custom metrics for Kafka structured streaming source

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24863:


Assignee: Apache Spark

> Report offset lag as a custom metrics for Kafka structured streaming source
> ---
>
> Key: SPARK-24863
> URL: https://issues.apache.org/jira/browse/SPARK-24863
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Assignee: Apache Spark
>Priority: Major
>
> We can build on top of SPARK-24748 to report offset lag as a custom metrics 
> for Kafka structured streaming source.
> This is the difference between the latest offsets in Kafka the time the 
> metrics is reported (just after a micro-batch completes) and the latest 
> offset Spark has processed. It can be 0 (or close to 0) if spark keeps up 
> with the rate at which messages are ingested into Kafka topics in steady 
> state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24863) Report offset lag as a custom metrics for Kafka structured streaming source

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24863:


Assignee: (was: Apache Spark)

> Report offset lag as a custom metrics for Kafka structured streaming source
> ---
>
> Key: SPARK-24863
> URL: https://issues.apache.org/jira/browse/SPARK-24863
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Priority: Major
>
> We can build on top of SPARK-24748 to report offset lag as a custom metrics 
> for Kafka structured streaming source.
> This is the difference between the latest offsets in Kafka the time the 
> metrics is reported (just after a micro-batch completes) and the latest 
> offset Spark has processed. It can be 0 (or close to 0) if spark keeps up 
> with the rate at which messages are ingested into Kafka topics in steady 
> state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24863) Report offset lag as a custom metrics for Kafka structured streaming source

2018-07-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549814#comment-16549814
 ] 

Apache Spark commented on SPARK-24863:
--

User 'arunmahadevan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21819

> Report offset lag as a custom metrics for Kafka structured streaming source
> ---
>
> Key: SPARK-24863
> URL: https://issues.apache.org/jira/browse/SPARK-24863
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Priority: Major
>
> We can build on top of SPARK-24748 to report offset lag as a custom metrics 
> for Kafka structured streaming source.
> This is the difference between the latest offsets in Kafka the time the 
> metrics is reported (just after a micro-batch completes) and the latest 
> offset Spark has processed. It can be 0 (or close to 0) if spark keeps up 
> with the rate at which messages are ingested into Kafka topics in steady 
> state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24863) Report offset lag as a custom metrics for Kafka structured streaming source

2018-07-19 Thread Arun Mahadevan (JIRA)
Arun Mahadevan created SPARK-24863:
--

 Summary: Report offset lag as a custom metrics for Kafka 
structured streaming source
 Key: SPARK-24863
 URL: https://issues.apache.org/jira/browse/SPARK-24863
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Arun Mahadevan


We can build on top of SPARK-24748 to report offset lag as a custom metrics for 
Kafka structured streaming source.

This is the difference between the latest offsets in Kafka the time the metrics 
is reported (just after a micro-batch completes) and the latest offset Spark 
has processed. It can be 0 (or close to 0) if spark keeps up with the rate at 
which messages are ingested into Kafka topics in steady state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22187) Update unsaferow format for saved state such that we can set timeouts when state is null

2018-07-19 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-22187.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21739
[https://github.com/apache/spark/pull/21739]

> Update unsaferow format for saved state such that we can set timeouts when 
> state is null
> 
>
> Key: SPARK-22187
> URL: https://issues.apache.org/jira/browse/SPARK-22187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>  Labels: release-notes, releasenotes
> Fix For: 3.0.0
>
>
> Currently the group state of user-defined-type is encoded as top-level 
> columns in the unsaferows stores in state store. The timeout timestamp is 
> also saved as (when needed) as the last top-level column. Since, the 
> groupState is serialized to top level columns, you cannot save "null" as a 
> value of state (setting null in all the top-level columns is not equivalent). 
> So we dont let the user to set the timeout without initializing the state for 
> a key. Based on user experience, his leads to confusion. 
> This JIRA is to change the row format such that the state is saved as nested 
> columns. This would allow the state to be set to null, and avoid these 
> confusing corner cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists

2018-07-19 Thread Antonio Murgia (JIRA)
Antonio Murgia created SPARK-24862:
--

 Summary: Spark Encoder is not consistent to scala case class 
semantic for multiple argument lists
 Key: SPARK-24862
 URL: https://issues.apache.org/jira/browse/SPARK-24862
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1
Reporter: Antonio Murgia


Spark Encoder is not consistent to scala case class semantic for multiple 
argument lists.

For example if I create a case class with multiple constructor argument lists:
{code:java}
case class Multi(x: String)(y: Int){code}
Scala creates a product with arity 1, while if I apply 
{code:java}
Encoders.product[Multi].schema.printTreeString{code}
I get
{code:java}
root
|-- x: string (nullable = true)
|-- y: integer (nullable = false){code}
That is not consistent and leads to:
{code:java}
Error while encoding: java.lang.RuntimeException: Couldn't find y on class 
it.enel.next.platform.service.events.common.massive.immutable.Multi
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(assertnotnull(input[0, 
it.enel.next.platform.service.events.common.massive.immutable.Multi, true])).x, 
true) AS x#0
assertnotnull(assertnotnull(input[0, 
it.enel.next.platform.service.events.common.massive.immutable.Multi, true])).y 
AS y#1
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
Couldn't find y on class 
it.enel.next.platform.service.events.common.massive.immutable.Multi
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(assertnotnull(input[0, 
it.enel.next.platform.service.events.common.massive.immutable.Multi, true])).x, 
true) AS x#0
assertnotnull(assertnotnull(input[0, 
it.enel.next.platform.service.events.common.massive.immutable.Multi, true])).y 
AS y#1
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464)
at 
it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48)
at 
it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
at 
it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682)
at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685)
at org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679)
at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692)
at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685)
at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:373)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:410)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FlatSpecLike$class.runTests(FlatSpecLike.scala:1750)
at org.scalatest.FlatSpec.runTests(FlatSpec.scala:1685)
at 

[jira] [Commented] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory

2018-07-19 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549745#comment-16549745
 ] 

Imran Rashid commented on SPARK-24801:
--

[~felixcheung] [~mshen]

> Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can 
> waste a lot of memory
> ---
>
> Key: SPARK-24801
> URL: https://issues.apache.org/jira/browse/SPARK-24801
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
>
> I recently analyzed another Yarn NM heap dump with jxray 
> ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is 
> wasted by empty (all zeroes) byte[] arrays. Most of these arrays are 
> referenced by 
> {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in 
> turn come from 
> {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is 
> the full reference chain that leads to the problematic arrays:
> {code:java}
> 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%)
> ↖org.apache.spark.network.util.ByteArrayWritableChannel.data
> ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next}
> ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry
> ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer
> ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code}
>  
> Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that 
> byteChannel is always initialized eagerly in the constructor:
> {code:java}
> this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code}
> So I think to address the problem of empty byte[] arrays flooding the memory, 
> we should initialize {{byteChannel}} lazily, upon the first use. As far as I 
> can see, it's used only in one method, {{private void nextChunk()}}.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15689) Data source API v2

2018-07-19 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549724#comment-16549724
 ] 

Reynold Xin commented on SPARK-15689:
-

Is there an umbrella ticket on improving the dsv2 api? If yes, please link to 
it from this Jira ticket. If not, we should create one.

 

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: SPIP, releasenotes
> Fix For: 2.3.0
>
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24192) Invalid Spark URL in local spark session since upgrading from org.apache.spark:spark-sql_2.11:2.2.1 to org.apache.spark:spark-sql_2.11:2.3.0

2018-07-19 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549707#comment-16549707
 ] 

Prabhu Joseph commented on SPARK-24192:
---

We have faced this issue and this happens when the hostname has underscore in 
it. 

> Invalid Spark URL in local spark session since upgrading from 
> org.apache.spark:spark-sql_2.11:2.2.1 to org.apache.spark:spark-sql_2.11:2.3.0
> 
>
> Key: SPARK-24192
> URL: https://issues.apache.org/jira/browse/SPARK-24192
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Tal Barda
>Priority: Major
>  Labels: HeartbeatReceiver, config, session, spark, spark-conf, 
> spark-session
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> since updating to Spark 2.3.0, tests which are run in my CI (Codeship) fail 
> due to a allegedly invalid spark url when creating the (local) spark context. 
> Here's a log from my _*mvn clean install*_ command execution:
> {quote}{{2018-05-03 13:18:47.668 ERROR 5533 --- [ main] 
> org.apache.spark.SparkContext : Error initializing SparkContext. 
> org.apache.spark.SparkException: Invalid Spark URL: 
> spark://HeartbeatReceiver@railsonfire_61eb1c99-232b-49d0-abb5-a1eb9693516b_52bcc09bb48b:44284
>  at 
> org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66) 
> ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)
>  ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) 
> ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109) 
> ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32) 
> ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.executor.Executor.(Executor.scala:155) 
> ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)
>  ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)
>  ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
>  ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.SparkContext.(SparkContext.scala:500) 
> ~[spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486) 
> [spark-core_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
>  [spark-sql_2.11-2.3.0.jar:2.3.0] at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
>  [spark-sql_2.11-2.3.0.jar:2.3.0] at scala.Option.getOrElse(Option.scala:121) 
> [scala-library-2.11.8.jar:na] at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) 
> [spark-sql_2.11-2.3.0.jar:2.3.0] at 
> com.planck.spark_data_features_extractors.utils.SparkConfig.sparkSession(SparkConfig.java:43)
>  [classes/:na] at 
> com.planck.spark_data_features_extractors.utils.SparkConfig$$EnhancerBySpringCGLIB$$66dd1f72.CGLIB$sparkSession$0()
>  [classes/:na] at 
> com.planck.spark_data_features_extractors.utils.SparkConfig$$EnhancerBySpringCGLIB$$66dd1f72$$FastClassBySpringCGLIB$$a213b647.invoke()
>  [classes/:na] at 
> org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:228) 
> [spring-core-5.0.2.RELEASE.jar:5.0.2.RELEASE] at 
> org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:361)
>  [spring-context-5.0.2.RELEASE.jar:5.0.2.RELEASE] at 
> com.planck.spark_data_features_extractors.utils.SparkConfig$$EnhancerBySpringCGLIB$$66dd1f72.sparkSession()
>  [classes/:na] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[na:1.8.0_171] at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[na:1.8.0_171] at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[na:1.8.0_171] at java.lang.reflect.Method.invoke(Method.java:498) 
> ~[na:1.8.0_171] at 
> org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:154)
>  [spring-beans-5.0.2.RELEASE.jar:5.0.2.RELEASE] at 
> org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:579)
>  [spring-beans-5.0.2.RELEASE.jar:5.0.2.RELEASE] at 
> 

[jira] [Commented] (SPARK-23908) High-order function: transform(array, function) → array

2018-07-19 Thread Frederick Reiss (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549628#comment-16549628
 ] 

Frederick Reiss commented on SPARK-23908:
-

Thanks Herman, looking forward to seeing this feature!

> High-order function: transform(array, function) → array
> ---
>
> Key: SPARK-23908
> URL: https://issues.apache.org/jira/browse/SPARK-23908
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Herman van Hovell
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array that is the result of applying function to each element of 
> array:
> {noformat}
> SELECT transform(ARRAY [], x -> x + 1); -- []
> SELECT transform(ARRAY [5, 6], x -> x + 1); -- [6, 7]
> SELECT transform(ARRAY [5, NULL, 6], x -> COALESCE(x, 0) + 1); -- [6, 1, 7]
> SELECT transform(ARRAY ['x', 'abc', 'z'], x -> x || '0'); -- ['x0', 'abc0', 
> 'z0']
> SELECT transform(ARRAY [ARRAY [1, NULL, 2], ARRAY[3, NULL]], a -> filter(a, x 
> -> x IS NOT NULL)); -- [[1, 2], [3]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2018-07-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24374:
--
Summary: SPIP: Support Barrier Execution Mode in Apache Spark  (was: SPIP: 
Support Barrier Scheduling in Apache Spark)

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24860) Expose dynamic partition overwrite per write operation

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24860:


Assignee: Apache Spark

> Expose dynamic partition overwrite per write operation
> --
>
> Key: SPARK-24860
> URL: https://issues.apache.org/jira/browse/SPARK-24860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: koert kuipers
>Assignee: Apache Spark
>Priority: Minor
>
> This is a follow up to issue SPARK-20236
> Also see the discussion in pullreq https://github.com/apache/spark/pull/18714
> SPARK-20236 added a global setting spark.sql.sources.partitionOverwriteMode 
> to switch between static and dynamic overwrite of partitioned tables. It 
> would be nice if we could choose per partitioned overwrite operation whether 
> it's behavior is static or dynamic. The suggested syntax is:
> {noformat}
> df.write.option("partitionOverwriteMode", "dynamic").parquet...{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24860) Expose dynamic partition overwrite per write operation

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24860:


Assignee: (was: Apache Spark)

> Expose dynamic partition overwrite per write operation
> --
>
> Key: SPARK-24860
> URL: https://issues.apache.org/jira/browse/SPARK-24860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: koert kuipers
>Priority: Minor
>
> This is a follow up to issue SPARK-20236
> Also see the discussion in pullreq https://github.com/apache/spark/pull/18714
> SPARK-20236 added a global setting spark.sql.sources.partitionOverwriteMode 
> to switch between static and dynamic overwrite of partitioned tables. It 
> would be nice if we could choose per partitioned overwrite operation whether 
> it's behavior is static or dynamic. The suggested syntax is:
> {noformat}
> df.write.option("partitionOverwriteMode", "dynamic").parquet...{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24860) Expose dynamic partition overwrite per write operation

2018-07-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549620#comment-16549620
 ] 

Apache Spark commented on SPARK-24860:
--

User 'koertkuipers' has created a pull request for this issue:
https://github.com/apache/spark/pull/21818

> Expose dynamic partition overwrite per write operation
> --
>
> Key: SPARK-24860
> URL: https://issues.apache.org/jira/browse/SPARK-24860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: koert kuipers
>Priority: Minor
>
> This is a follow up to issue SPARK-20236
> Also see the discussion in pullreq https://github.com/apache/spark/pull/18714
> SPARK-20236 added a global setting spark.sql.sources.partitionOverwriteMode 
> to switch between static and dynamic overwrite of partitioned tables. It 
> would be nice if we could choose per partitioned overwrite operation whether 
> it's behavior is static or dynamic. The suggested syntax is:
> {noformat}
> df.write.option("partitionOverwriteMode", "dynamic").parquet...{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24861) create corrected temp directories in RateSourceSuite

2018-07-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549540#comment-16549540
 ] 

Apache Spark commented on SPARK-24861:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21817

> create corrected temp directories in RateSourceSuite
> 
>
> Key: SPARK-24861
> URL: https://issues.apache.org/jira/browse/SPARK-24861
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24861) create corrected temp directories in RateSourceSuite

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24861:


Assignee: Apache Spark  (was: Wenchen Fan)

> create corrected temp directories in RateSourceSuite
> 
>
> Key: SPARK-24861
> URL: https://issues.apache.org/jira/browse/SPARK-24861
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24861) create corrected temp directories in RateSourceSuite

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24861:


Assignee: Wenchen Fan  (was: Apache Spark)

> create corrected temp directories in RateSourceSuite
> 
>
> Key: SPARK-24861
> URL: https://issues.apache.org/jira/browse/SPARK-24861
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators

2018-07-19 Thread Qifan Pu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549504#comment-16549504
 ] 

Qifan Pu commented on SPARK-24838:
--

Thanks for the PR [~maurits]! Should we also fix it for Aggregate altogether?

> Support uncorrelated IN/EXISTS subqueries for more operators 
> -
>
> Key: SPARK-24838
> URL: https://issues.apache.org/jira/browse/SPARK-24838
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Qifan Pu
>Priority: Major
>  Labels: spree
>
> Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. 
> Running a query:
> {{select name in (select * from valid_names)}}
> {{from all_names}}
> returns error:
> {code:java}
> Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries 
> can only be used in a Filter
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24861) create corrected temp directories in RateSourceSuite

2018-07-19 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-24861:
---

 Summary: create corrected temp directories in RateSourceSuite
 Key: SPARK-24861
 URL: https://issues.apache.org/jira/browse/SPARK-24861
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24424) Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24424:


Assignee: Apache Spark

> Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET
> ---
>
> Key: SPARK-24424
> URL: https://issues.apache.org/jira/browse/SPARK-24424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Currently, our Group By clause follows Hive 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
>  :
>  However, this does not match ANSI SQL compliance. The proposal is to update 
> our parser and analyzer for ANSI compliance. 
>  For example,
> {code:java}
> GROUP BY col1, col2 WITH ROLLUP
> GROUP BY col1, col2 WITH CUBE
> GROUP BY col1, col2 GROUPING SET ...
> {code}
> It is nice to support ANSI SQL syntax at the same time.
> {code:java}
> GROUP BY ROLLUP(col1, col2)
> GROUP BY CUBE(col1, col2)
> GROUP BY GROUPING SET(...) 
> {code}
> Note, we only need to support one-level grouping set in this stage. That 
> means, nested grouping set is not supported.
> Note, we should not break the existing syntax. The parser changes should be 
> like
> {code:sql}
> group-by-expressions
> >>-GROUP BY+-hive-sql-group-by-expressions-+---><
>'-ansi-sql-grouping-set-expressions-'
> hive-sql-group-by-expressions
> '--GROUPING SETS--(--grouping-set-expressions--)--'
>.-,--.   +--WITH CUBE--+
>V|   +--WITH ROLLUP+
> >>---+-expression-+-+---+-+-><
> grouping-expressions-list
>.-,--.  
>V|  
> >>---+-expression-+-+--><
> grouping-set-expressions
> .-,.
> |  .-,--.  |
> |  V|  |
> V '-(--expression---+-)-'  |
> >>+-expression--+--+-><
> ansi-sql-grouping-set-expressions
> >>-+-ROLLUP--(--grouping-expression-list--)-+--><
>+-CUBE--(--grouping-expression-list--)---+   
>'-GROUPING SETS--(--grouping-set-expressions--)--'  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24424) Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET

2018-07-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24424:


Assignee: (was: Apache Spark)

> Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET
> ---
>
> Key: SPARK-24424
> URL: https://issues.apache.org/jira/browse/SPARK-24424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our Group By clause follows Hive 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
>  :
>  However, this does not match ANSI SQL compliance. The proposal is to update 
> our parser and analyzer for ANSI compliance. 
>  For example,
> {code:java}
> GROUP BY col1, col2 WITH ROLLUP
> GROUP BY col1, col2 WITH CUBE
> GROUP BY col1, col2 GROUPING SET ...
> {code}
> It is nice to support ANSI SQL syntax at the same time.
> {code:java}
> GROUP BY ROLLUP(col1, col2)
> GROUP BY CUBE(col1, col2)
> GROUP BY GROUPING SET(...) 
> {code}
> Note, we only need to support one-level grouping set in this stage. That 
> means, nested grouping set is not supported.
> Note, we should not break the existing syntax. The parser changes should be 
> like
> {code:sql}
> group-by-expressions
> >>-GROUP BY+-hive-sql-group-by-expressions-+---><
>'-ansi-sql-grouping-set-expressions-'
> hive-sql-group-by-expressions
> '--GROUPING SETS--(--grouping-set-expressions--)--'
>.-,--.   +--WITH CUBE--+
>V|   +--WITH ROLLUP+
> >>---+-expression-+-+---+-+-><
> grouping-expressions-list
>.-,--.  
>V|  
> >>---+-expression-+-+--><
> grouping-set-expressions
> .-,.
> |  .-,--.  |
> |  V|  |
> V '-(--expression---+-)-'  |
> >>+-expression--+--+-><
> ansi-sql-grouping-set-expressions
> >>-+-ROLLUP--(--grouping-expression-list--)-+--><
>+-CUBE--(--grouping-expression-list--)---+   
>'-GROUPING SETS--(--grouping-set-expressions--)--'  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24424) Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET

2018-07-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549478#comment-16549478
 ] 

Apache Spark commented on SPARK-24424:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/21813

> Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET
> ---
>
> Key: SPARK-24424
> URL: https://issues.apache.org/jira/browse/SPARK-24424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Currently, our Group By clause follows Hive 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
>  :
>  However, this does not match ANSI SQL compliance. The proposal is to update 
> our parser and analyzer for ANSI compliance. 
>  For example,
> {code:java}
> GROUP BY col1, col2 WITH ROLLUP
> GROUP BY col1, col2 WITH CUBE
> GROUP BY col1, col2 GROUPING SET ...
> {code}
> It is nice to support ANSI SQL syntax at the same time.
> {code:java}
> GROUP BY ROLLUP(col1, col2)
> GROUP BY CUBE(col1, col2)
> GROUP BY GROUPING SET(...) 
> {code}
> Note, we only need to support one-level grouping set in this stage. That 
> means, nested grouping set is not supported.
> Note, we should not break the existing syntax. The parser changes should be 
> like
> {code:sql}
> group-by-expressions
> >>-GROUP BY+-hive-sql-group-by-expressions-+---><
>'-ansi-sql-grouping-set-expressions-'
> hive-sql-group-by-expressions
> '--GROUPING SETS--(--grouping-set-expressions--)--'
>.-,--.   +--WITH CUBE--+
>V|   +--WITH ROLLUP+
> >>---+-expression-+-+---+-+-><
> grouping-expressions-list
>.-,--.  
>V|  
> >>---+-expression-+-+--><
> grouping-set-expressions
> .-,.
> |  .-,--.  |
> |  V|  |
> V '-(--expression---+-)-'  |
> >>+-expression--+--+-><
> ansi-sql-grouping-set-expressions
> >>-+-ROLLUP--(--grouping-expression-list--)-+--><
>+-CUBE--(--grouping-expression-list--)---+   
>'-GROUPING SETS--(--grouping-set-expressions--)--'  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24755) Executor loss can cause task to not be resubmitted

2018-07-19 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-24755:
-

Assignee: Hieu Tri Huynh

> Executor loss can cause task to not be resubmitted
> --
>
> Key: SPARK-24755
> URL: https://issues.apache.org/jira/browse/SPARK-24755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Mridul Muralidharan
>Assignee: Hieu Tri Huynh
>Priority: Major
> Fix For: 2.4.0, 2.3.3
>
>
> As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
> checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
> if task needs to be resubmitted for partition.
> Consider following:
> For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 
> respectively (one of them being speculative task)
> T1 finishes successfully first.
> This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
> We also end up killing task T2.
> Now, exec-1 if/when goes MIA.
> executorLost will no longer schedule task for P1 - since 
> killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there 
> is no other copy of P1 around (T2 was killed when T1 succeeded).
> I noticed this bug as part of reviewing PR# 21653 for SPARK-13343
> Essentially, SPARK-22074 causes a regression (which I dont usually observe 
> due to shuffle service, sigh) - and as such the fix is broken IMO.
> I dont have a PR handy for this, so if anyone wants to pick it up, please do 
> feel free !
> +CC [~XuanYuan] who fixed SPARK-22074 initially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24755) Executor loss can cause task to not be resubmitted

2018-07-19 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-24755.
---
   Resolution: Fixed
Fix Version/s: 2.3.3
   2.4.0

> Executor loss can cause task to not be resubmitted
> --
>
> Key: SPARK-24755
> URL: https://issues.apache.org/jira/browse/SPARK-24755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Mridul Muralidharan
>Priority: Major
> Fix For: 2.4.0, 2.3.3
>
>
> As part of SPARK-22074, when an executor is lost, TSM.executorLost currently 
> checks for "if (successful(index) && !killedByOtherAttempt(index))" to decide 
> if task needs to be resubmitted for partition.
> Consider following:
> For partition P1, tasks T1 and T2 are running on exec-1 and exec-2 
> respectively (one of them being speculative task)
> T1 finishes successfully first.
> This results in setting "killedByOtherAttempt(P1) = true" due to running T2.
> We also end up killing task T2.
> Now, exec-1 if/when goes MIA.
> executorLost will no longer schedule task for P1 - since 
> killedByOtherAttempt(P1) == true; even though P1 was hosted on T1 and there 
> is no other copy of P1 around (T2 was killed when T1 succeeded).
> I noticed this bug as part of reviewing PR# 21653 for SPARK-13343
> Essentially, SPARK-22074 causes a regression (which I dont usually observe 
> due to shuffle service, sigh) - and as such the fix is broken IMO.
> I dont have a PR handy for this, so if anyone wants to pick it up, please do 
> feel free !
> +CC [~XuanYuan] who fixed SPARK-22074 initially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2018-07-19 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549356#comment-16549356
 ] 

Thomas Graves commented on SPARK-22151:
---

ok thanks, must have missed that.

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.4.0
>
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24860) Expose dynamic partition overwrite per write operation

2018-07-19 Thread koert kuipers (JIRA)
koert kuipers created SPARK-24860:
-

 Summary: Expose dynamic partition overwrite per write operation
 Key: SPARK-24860
 URL: https://issues.apache.org/jira/browse/SPARK-24860
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: koert kuipers


This is a follow up to issue SPARK-20236

Also see the discussion in pullreq https://github.com/apache/spark/pull/18714

SPARK-20236 added a global setting spark.sql.sources.partitionOverwriteMode to 
switch between static and dynamic overwrite of partitioned tables. It would be 
nice if we could choose per partitioned overwrite operation whether it's 
behavior is static or dynamic. The suggested syntax is:
{noformat}
df.write.option("partitionOverwriteMode", "dynamic").parquet...{noformat}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2018-07-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549351#comment-16549351
 ] 

Sean Owen commented on SPARK-22151:
---

It wasn't marked as Resolved, but had a Fix version. I think you just Resolved 
it. My Jira client reports just flag stuff like this. Usually it's because a 
reporter added Fix version improperly but here looks like the PR had been 
resolved.

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.4.0
>
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2018-07-19 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549348#comment-16549348
 ] 

Thomas Graves commented on SPARK-22151:
---

[~srowen] why did you blank out fixed version and transition back to in 
progress?

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.4.0
>
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2018-07-19 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-22151.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.4.0
>
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-19 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549345#comment-16549345
 ] 

Thomas Graves edited comment on SPARK-24615 at 7/19/18 2:28 PM:


but my point is exactly that, it shouldn't be yet another mechanism for this, 
why not make a generic one that can handle all resources?   The dynamic 
allocation mechanism should ideally also handle gpus as I've stated above.


was (Author: tgraves):
but my point is exactly that, it shouldn't be yet another mechanism for this, 
why not make a generic one that can handle all resources? 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-19 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549345#comment-16549345
 ] 

Thomas Graves commented on SPARK-24615:
---

but my point is exactly that, it shouldn't be yet another mechanism for this, 
why not make a generic one that can handle all resources? 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24858) Avoid unnecessary parquet footer reads

2018-07-19 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24858:


Assignee: Gengliang Wang

> Avoid unnecessary parquet footer reads
> --
>
> Key: SPARK-24858
> URL: https://issues.apache.org/jira/browse/SPARK-24858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently the same Parquet footer is read twice in the function 
> `buildReaderWithPartitionValues` of ParquetFileFormat if filter push down is 
> enabled.
> Fix it with simple changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24858) Avoid unnecessary parquet footer reads

2018-07-19 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24858.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21814
[https://github.com/apache/spark/pull/21814]

> Avoid unnecessary parquet footer reads
> --
>
> Key: SPARK-24858
> URL: https://issues.apache.org/jira/browse/SPARK-24858
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently the same Parquet footer is read twice in the function 
> `buildReaderWithPartitionValues` of ParquetFileFormat if filter push down is 
> enabled.
> Fix it with simple changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24838) Support uncorrelated IN/EXISTS subqueries for more operators

2018-07-19 Thread Maurits van der Goes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549293#comment-16549293
 ] 

Maurits van der Goes commented on SPARK-24838:
--

Worked with [~hvanhovell] on a fix. This is what I currently have: 
[https://github.com/apache/spark/compare/master...vanderGoes:SPARK-24838]

Will polish it a bit and provide comments before PR.

> Support uncorrelated IN/EXISTS subqueries for more operators 
> -
>
> Key: SPARK-24838
> URL: https://issues.apache.org/jira/browse/SPARK-24838
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Qifan Pu
>Priority: Major
>  Labels: spree
>
> Currently, CheckAnalysis allows IN/EXISTS subquery only for filter operators. 
> Running a query:
> {{select name in (select * from valid_names)}}
> {{from all_names}}
> returns error:
> {code:java}
> Error in SQL statement: AnalysisException: IN/EXISTS predicate sub-queries 
> can only be used in a Filter
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549276#comment-16549276
 ] 

Saisai Shao commented on SPARK-24615:
-

Thanks [~tgraves] for the suggestion. 
{quote}Once I get to the point I want to do the ML I want to ask for the gpu's 
as well as ask for more memory during that stage because I didn't need more 
before this stage for all the etl work.  I realize you already have executors, 
but ideally spark with the cluster manager could potentially release the 
existing ones and ask for new ones with those requirements.
{quote}
Yes, I already discussed with my colleague offline, this is a valid scenario, 
but I think to achieve this we should change the current dynamic resource 
allocation mechanism Currently I marked this as a Non-Goal in this proposal, 
only focus on statically resource requesting (--executor-cores, 
--executor-gpus). I think we should support it later.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-19 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549266#comment-16549266
 ] 

Thomas Graves commented on SPARK-24615:
---

I think the usage for cpu/memory is the same.  You know one job or stage has x 
number of tasks and perhaps they are caching at the time and need more memory 
or cpu. For instance look at your accelerator example. Let say I'm doing some 
etl before my ML.  Once I get to the point I want to do the ML I want to ask 
for the gpu's as well as ask for more memory during that stage because I didn't 
need more before this stage for all the etl work.  I realize you already have 
executors, but ideally spark with the cluster manager could potentially release 
the existing ones and ask for new ones with those requirements.   That is 
obviously a separate task to do the latter part but I think if we are creating 
an interface we should keep those cases in mind.  

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12126) JDBC datasource processes filters only commonly pushed down.

2018-07-19 Thread Kyle Prifogle (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549220#comment-16549220
 ] 

Kyle Prifogle commented on SPARK-12126:
---

Thanks [~hyukjin.kwon], I'll attach the relevant story here for anyone else 
that happens along this:

https://issues.apache.org/jira/browse/SPARK-22386

> JDBC datasource processes filters only commonly pushed down.
> 
>
> Key: SPARK-12126
> URL: https://issues.apache.org/jira/browse/SPARK-12126
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Major
>
> As suggested 
> [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646],
>  Currently JDBC datasource only processes the filters pushed down from 
> {{DataSourceStrategy}}.
> Unlike ORC or Parquet, this can process pretty a lot of filters (for example, 
> a + b > 3) since it is just about string parsing.
> As 
> [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526],
>  using {{CatalystScan}} trait might be one of solutions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16854) mapWithState Support for Python

2018-07-19 Thread Joost Verdoorn (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549209#comment-16549209
 ] 

Joost Verdoorn commented on SPARK-16854:


mapWithState would be extremely helpful within python. Any plans on supporting 
this soon?

> mapWithState Support for Python
> ---
>
> Key: SPARK-16854
> URL: https://issues.apache.org/jira/browse/SPARK-16854
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Boaz
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >