[jira] [Commented] (SPARK-38536) Spark 3 can not read mixed format partitions

2022-03-15 Thread Deegue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17506728#comment-17506728
 ] 

Deegue commented on SPARK-38536:


I wonder if it's caused by the different Hadoop versions your Orc and Parquet 
based on. Can you check it?  [~songhuicheng]

> Spark 3 can not read mixed format partitions
> 
>
> Key: SPARK-38536
> URL: https://issues.apache.org/jira/browse/SPARK-38536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.2.1
>Reporter: Huicheng Song
>Priority: Major
>
> Spark 3.x reads partitions with table's input format, which fails when the 
> partition has a different input format than the table.
> This is a regression introduced by SPARK-26630. Before that fix, Spark will 
> use Partition InputFormat when creating HadoopRDD. With that fix, Spark uses 
> only Table InputFormat when creating HadoopRDD, causing failures
> Reading mixed format partitions is an import scenario, especially for format 
> migration. It is also well supported in query engines like Hive and Presto.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38536) Spark 3 can not read mixed format partitions

2022-03-14 Thread Deegue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17506678#comment-17506678
 ] 

Deegue commented on SPARK-38536:


Thanks [~hyukjin.kwon], [~songhuicheng] I don't think this pr would change how 
we read a table through its ifc. Can you describe more about this issue? like 
the exception or related code etc.

> Spark 3 can not read mixed format partitions
> 
>
> Key: SPARK-38536
> URL: https://issues.apache.org/jira/browse/SPARK-38536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.2.1
>Reporter: Huicheng Song
>Priority: Major
>
> Spark 3.x reads partitions with table's input format, which fails when the 
> partition has a different input format than the table.
> This is a regression introduced by SPARK-26630. Before that fix, Spark will 
> use Partition InputFormat when creating HadoopRDD. With that fix, Spark uses 
> only Table InputFormat when creating HadoopRDD, causing failures
> Reading mixed format partitions is an import scenario, especially for format 
> migration. It is also well supported in query engines like Hive and Presto.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29910) Add minimum runtime limit to speculation

2019-11-14 Thread Deegue (Jira)
Deegue created SPARK-29910:
--

 Summary: Add minimum runtime limit to speculation
 Key: SPARK-29910
 URL: https://issues.apache.org/jira/browse/SPARK-29910
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Deegue


The minimum runtime to speculation used to be a fixed value 100ms.  It means 
tasks finished in seconds will also be speculated and more executors will be 
required.
To resolve this, we add `spark.speculation.minRuntime` to control the minimum 
runtime limit for speculation.
We can reduce normal tasks to be speculated by adjusting 
`spark.speculation.minRuntime`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29786) Fix MetaException when dropping a partition not exists on HDFS.

2019-11-06 Thread Deegue (Jira)
Deegue created SPARK-29786:
--

 Summary: Fix MetaException when dropping a partition not exists on 
HDFS.
 Key: SPARK-29786
 URL: https://issues.apache.org/jira/browse/SPARK-29786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Deegue


When we drop a partition which doesn't exist on HDFS, we will receive 
`MetaException`.
But actually the partition has been dropped.

In Hive, no exception will thrown in this case.

For example:
If we execute alter table test.tmp drop partition(stat_day=20190516);
(the partition stat_day=20190516 exists on Hive meta, but doesn't exist on HDFS)

We will get :

{code:java}
Error: Error running query: MetaException(message:File does not exist: 
/user/hive/warehouse/test.db/tmp/stat_day=20190516
   at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2414)
   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:4719)
   at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:1237)
   at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getContentSummary(AuthorizationProviderProxyClientProtocol.java:568)
   at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:896)
   at org.apache.hadoop.hdfs. 
protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2278)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2274)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:422)
   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2274)
) (state=,code=0)
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29785) Optimize opening a new session of Spark Thrift Server

2019-11-06 Thread Deegue (Jira)
Deegue created SPARK-29785:
--

 Summary: Optimize opening a new session of Spark Thrift Server
 Key: SPARK-29785
 URL: https://issues.apache.org/jira/browse/SPARK-29785
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Deegue


When we open a new session of Spark Thrift Server, `use default` is called and 
a free executor is needed to execute the SQL. This behavior adds ~5 seconds to 
opening a new session which should only cost ~100ms.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Allow TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Summary: Allow TCP connections created by shuffle service auto close on 
YARN NodeManagers  (was: Make TCP connections created by shuffle service auto 
close on YARN NodeManagers)

> Allow TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> 
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> Configs:
> spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> When executing shuffle tasks, TCP connections(on port 7337 by default) will 
> be established by shuffle service.
> It will like:
>  !screenshot-1.png! 
> However, some of the TCP connections are still busy when the task is actually 
> finished. These connections won't close automatically until we restart the 
> NodeManager process.
> Connections pile up and NodeManagers are getting slower and slower.
>  !screenshot-2.png! 
> These unclosed TCP connections stay busy and it seem doesn't take effect when 
> I set ChannelOption.SO_KEEPALIVE to true according to 
> [SPARK-23182|https://github.com/apache/spark/pull/20512].
> So the solution is setting ChannelOption.AUTO_CLOSE to true, and after which 
> our cluster(running 1+ jobs / day) is processing normally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Description: 
When executing shuffle tasks, TCP connections(on port 7337 by default) will be 
established by shuffle service.
It will like:

 !screenshot-1.png! 

However, some of the TCP connections are still busy when the task is actually 
finished. These connections won't close automatically until we restart the 
NodeManager process.

Connections pile up and NodeManagers are getting slower and slower.

 !screenshot-2.png! 

These unclosed TCP connections stay busy and it seem doesn't take effect when I 
set ChannelOption.SO_KEEPALIVE to true according to 
[SPARK-23182|https://github.com/apache/spark/pull/20512].

So the solution is setting ChannelOption.AUTO_CLOSE to true, and after which 
our cluster(running 1+ jobs / day) is processing normally.

  was:
When executing shuffle tasks, TCP connections(on port 7337 by default) will be 
established by shuffle service.
It will like:

 !screenshot-1.png! 

However, some of the TCP connections are still busy when the task is actually 
finished. These connections won't close automatically until we restart the 
NodeManager process.

Connections pile up and NodeManagers are getting slower and slower.

 !screenshot-2.png! 

These unclosed TCP connections stay busy and it seem doesn't take effect when I 
set ChannelOption.SO_KEEPALIVE to true according to 
[SPARK-23182|https://github.com/apache/spark/pull/20512].

So the solution is setting ChannelOption.AUTO_CLOSE to true, and after which 
our cluster(running 1+ jobs / day) is processing 


> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> Configs:
> spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> When executing shuffle tasks, TCP connections(on port 7337 by default) will 
> be established by shuffle service.
> It will like:
>  !screenshot-1.png! 
> However, some of the TCP connections are still busy when the task is actually 
> finished. These connections won't close automatically until we restart the 
> NodeManager process.
> Connections pile up and NodeManagers are getting slower and slower.
>  !screenshot-2.png! 
> These unclosed TCP connections stay busy and it seem doesn't take effect when 
> I set ChannelOption.SO_KEEPALIVE to true according to 
> [SPARK-23182|https://github.com/apache/spark/pull/20512].
> So the solution is setting ChannelOption.AUTO_CLOSE to true, and after which 
> our cluster(running 1+ jobs / day) is processing normally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Description: 
When executing shuffle tasks, TCP connections(on port 7337 by default) will be 
established by shuffle service.
It will like:

 !screenshot-1.png! 

However, some of the TCP connections are still busy when the task is actually 
finished. These connections won't close automatically until we restart the 
NodeManager process.

Connections pile up and NodeManagers are getting slower and slower.

 !screenshot-2.png! 

These unclosed TCP connections stay busy and it seem doesn't take effect when I 
set ChannelOption.SO_KEEPALIVE to true according to 
[SPARK-23182|https://github.com/apache/spark/pull/20512].

So the solution is setting ChannelOption.AUTO_CLOSE to true, and after which 
our cluster(running 1+ jobs / day) is processing 

  was:
When executing shuffle tasks, TCP connections(on port 7337 by default) will be 
established by shuffle service.
It will like:

 !screenshot-1.png! 

However, some of the TCP connections are still busy when the task is actually 
finished. These connections won't close automatically until we restart the 
NodeManager process.

Connections pile up and NodeManagers are getting slower and slower.

 !screenshot-2.png! 

These unclosed TCP connections stay busy and it seem doesn't take effect when I 
set ChannelOption.SO_KEEPALIVE to true according to 
[SPARK-23182|https://github.com/apache/spark/pull/20512].




> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> Configs:
> spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> When executing shuffle tasks, TCP connections(on port 7337 by default) will 
> be established by shuffle service.
> It will like:
>  !screenshot-1.png! 
> However, some of the TCP connections are still busy when the task is actually 
> finished. These connections won't close automatically until we restart the 
> NodeManager process.
> Connections pile up and NodeManagers are getting slower and slower.
>  !screenshot-2.png! 
> These unclosed TCP connections stay busy and it seem doesn't take effect when 
> I set ChannelOption.SO_KEEPALIVE to true according to 
> [SPARK-23182|https://github.com/apache/spark/pull/20512].
> So the solution is setting ChannelOption.AUTO_CLOSE to true, and after which 
> our cluster(running 1+ jobs / day) is processing 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Description: 
When executing shuffle tasks, TCP connections(on port 7337 by default) will be 
established by shuffle service.
It will like:

 !screenshot-1.png! 

However, some of the TCP connections are still busy when the task is actually 
finished. These connections won't close automatically until we restart the 
NodeManager process.

Connections pile up and NodeManagers are getting slower and slower.

 !screenshot-2.png! 

These unclosed TCP connections stay busy and it seem doesn't take effect when I 
set ChannelOption.SO_KEEPALIVE to true according to 
[SPARK-23182|https://github.com/apache/spark/pull/20512].



  was:
When executing shuffle tasks, TCP connections(on port 7337 by default) will be 
established by shuffle service.
It will like:

 !screenshot-1.png! 

However, some of the TCP connections are still busy when the task is actually 
finished. These connections won't close automatically until we restart the 
NodeManager process.

Connections pile up and NodeManagers are getting slower and slower.

 !screenshot-2.png! 

These unclosed TCP connections stay busy and it seem doesn't take effect when I 
set ChannelOption.SO_KEEPALIVE to true according to 


> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> Configs:
> spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> When executing shuffle tasks, TCP connections(on port 7337 by default) will 
> be established by shuffle service.
> It will like:
>  !screenshot-1.png! 
> However, some of the TCP connections are still busy when the task is actually 
> finished. These connections won't close automatically until we restart the 
> NodeManager process.
> Connections pile up and NodeManagers are getting slower and slower.
>  !screenshot-2.png! 
> These unclosed TCP connections stay busy and it seem doesn't take effect when 
> I set ChannelOption.SO_KEEPALIVE to true according to 
> [SPARK-23182|https://github.com/apache/spark/pull/20512].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Description: 
When executing shuffle tasks, TCP connections(on port 7337 by default) will be 
established by shuffle service.
It will like:

 !screenshot-1.png! 

However, some of the TCP connections are still busy when the task is actually 
finished. These connections won't close automatically until we restart the 
NodeManager process.

Connections pile up and NodeManagers are getting slower and slower.

 !screenshot-2.png! 

These unclosed TCP connections stay busy and it seem doesn't take effect when I 
set ChannelOption.SO_KEEPALIVE to true according to 

  was:
When executing shuffle tasks, TCP connections(on port 7337 by default) will be 
established by shuffle service.
It will like:

 !screenshot-1.png! 

However, some of the TCP connections are still busy when the task is actually 
finished. These connections won't close automatically until we restart the 
NodeManager process.

Connections pile up and NodeManagers are getting slower and slower.

 !screenshot-2.png! 




> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> Configs:
> spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> When executing shuffle tasks, TCP connections(on port 7337 by default) will 
> be established by shuffle service.
> It will like:
>  !screenshot-1.png! 
> However, some of the TCP connections are still busy when the task is actually 
> finished. These connections won't close automatically until we restart the 
> NodeManager process.
> Connections pile up and NodeManagers are getting slower and slower.
>  !screenshot-2.png! 
> These unclosed TCP connections stay busy and it seem doesn't take effect when 
> I set ChannelOption.SO_KEEPALIVE to true according to 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Description: 
When executing shuffle tasks, TCP connections(on port 7337 by default) will be 
established by shuffle service.
It will like:

 !screenshot-1.png! 

However, some of the TCP connections are still busy when the task is actually 
finished. These connections won't close automatically until we restart the 
NodeManager process.

Connections pile up and NodeManagers are getting slower and slower.

 !screenshot-2.png! 



  was:
When executing shuffle tasks, TCP connections(on port 7337 by default) will be 
established by shuffle service.
It will like:

 !screenshot-1.png! 

However, some of the TCP connections are still busy when the task is actually 
finished. These connections won't close automatically until we restart the 
NodeManager process.

Connections pile up and NodeManagers are getting slower and slower.

 !screenshot-2.png! 



> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> Configs:
> spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> When executing shuffle tasks, TCP connections(on port 7337 by default) will 
> be established by shuffle service.
> It will like:
>  !screenshot-1.png! 
> However, some of the TCP connections are still busy when the task is actually 
> finished. These connections won't close automatically until we restart the 
> NodeManager process.
> Connections pile up and NodeManagers are getting slower and slower.
>  !screenshot-2.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Environment: 
Hadoop2.6.0-CDH5.8.3(netty3)
Spark2.4.0(netty4)

Configs:
spark.shuffle.service.enabled=true

  was:
Hadoop2.6.0-CDH5.8.3(netty3)
Spark2.4.0(netty4)

set spark.shuffle.service.enabled=true


> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> Configs:
> spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> When executing shuffle tasks, TCP connections(on port 7337 by default) will 
> be established by shuffle service.
> It will like:
>  !screenshot-1.png! 
> However, some of the TCP connections are still busy when the task is actually 
> finished. These connections won't close automatically until we restart the 
> NodeManager process.
> Connections pile up and NodeManagers are getting slower and slower.
>  !screenshot-2.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Description: 
When executing shuffle tasks, TCP connections(on port 7337 by default) will be 
established by shuffle service.
It will like:

 !screenshot-1.png! 

However, some of the TCP connections are still busy when the task is actually 
finished. These connections won't close automatically until we restart the 
NodeManager process.

Connections pile up and NodeManagers are getting slower and slower.

 !screenshot-2.png! 


  was:When executing shuffle tasks, 


> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> set spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> When executing shuffle tasks, TCP connections(on port 7337 by default) will 
> be established by shuffle service.
> It will like:
>  !screenshot-1.png! 
> However, some of the TCP connections are still busy when the task is actually 
> finished. These connections won't close automatically until we restart the 
> NodeManager process.
> Connections pile up and NodeManagers are getting slower and slower.
>  !screenshot-2.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Attachment: screenshot-2.png

> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> set spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> When executing shuffle tasks, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Attachment: (was: screenshot-1.png)

> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> set spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> When executing shuffle tasks, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Attachment: screenshot-1.png

> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> set spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> When executing shuffle tasks, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Attachment: screenshot-1.png

> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> set spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> When executing shuffle tasks, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Description: When executing shuffle tasks,   (was: When we set 
spark.shuffle.service.enabled=true, )

> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> set spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
>
> When executing shuffle tasks, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Description: When we set spark.shuffle.service.enabled=true, 

> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
>Reporter: Deegue
>Priority: Minor
>
> When we set spark.shuffle.service.enabled=true, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-28239:
---
Environment: 
Hadoop2.6.0-CDH5.8.3(netty3)
Spark2.4.0(netty4)

set spark.shuffle.service.enabled=true

  was:
Hadoop2.6.0-CDH5.8.3(netty3)
Spark2.4.0(netty4)


> Make TCP connections created by shuffle service auto close on YARN 
> NodeManagers
> ---
>
> Key: SPARK-28239
> URL: https://issues.apache.org/jira/browse/SPARK-28239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, YARN
>Affects Versions: 2.4.0
> Environment: Hadoop2.6.0-CDH5.8.3(netty3)
> Spark2.4.0(netty4)
> set spark.shuffle.service.enabled=true
>Reporter: Deegue
>Priority: Minor
>
> When we set spark.shuffle.service.enabled=true, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28239) Make TCP connections created by shuffle service auto close on YARN NodeManagers

2019-07-03 Thread Deegue (JIRA)
Deegue created SPARK-28239:
--

 Summary: Make TCP connections created by shuffle service auto 
close on YARN NodeManagers
 Key: SPARK-28239
 URL: https://issues.apache.org/jira/browse/SPARK-28239
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, YARN
Affects Versions: 2.4.0
 Environment: Hadoop2.6.0-CDH5.8.3(netty3)
Spark2.4.0(netty4)
Reporter: Deegue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26667) Add `Scanning Input Table` to Performance Tuning Guide

2019-01-19 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-26667:
---
Description: 
We can use `CombineTextInputFormat` instead of `TextInputFormat` and set 
configurations to increase the speed while reading a table. 

There's no need to add spark configurations,  
[PR#23506|https://github.com/apache/spark/pull/23506], so add it to the 
Performance Tuning.

  was:
We can use `CombineTextInputFormat` instead of `TextInputFormat` and set 
configurations to increase the speed while reading a table. 

There's no need to add spark configurations,  [link 
title|[https://github.com/apache/spark/pull/23506],] so add it to the 
Performance Tuning.


> Add `Scanning Input Table` to Performance Tuning Guide
> --
>
> Key: SPARK-26667
> URL: https://issues.apache.org/jira/browse/SPARK-26667
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.1, 3.0.0
>Reporter: Deegue
>Priority: Minor
>
> We can use `CombineTextInputFormat` instead of `TextInputFormat` and set 
> configurations to increase the speed while reading a table. 
> There's no need to add spark configurations,  
> [PR#23506|https://github.com/apache/spark/pull/23506], so add it to the 
> Performance Tuning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26667) Add `Scanning Input Table` to Performance Tuning Guide

2019-01-19 Thread Deegue (JIRA)
Deegue created SPARK-26667:
--

 Summary: Add `Scanning Input Table` to Performance Tuning Guide
 Key: SPARK-26667
 URL: https://issues.apache.org/jira/browse/SPARK-26667
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.4.1, 3.0.0
Reporter: Deegue


We can use `CombineTextInputFormat` instead of `TextInputFormat` and set 
configurations to increase the speed while reading a table. 

There's no need to add spark configurations,  [link 
title|[https://github.com/apache/spark/pull/23506],] so add it to the 
Performance Tuning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26630) ClassCastException in TableReader while creating HadoopRDD

2019-01-19 Thread Deegue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746985#comment-16746985
 ] 

Deegue commented on SPARK-26630:


Hi [~dongjoon],

It seems all of the versions are affected, because we only used HadoopRDD 
before. And I don't know which `Affects Version/s:` to choose...

> ClassCastException in TableReader while creating HadoopRDD
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Priority: Major
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26630) ClassCastException in TableReader while creating HadoopRDD

2019-01-19 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-26630:
---
Affects Version/s: 2.4.1
   2.4.0

> ClassCastException in TableReader while creating HadoopRDD
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Priority: Major
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26630) ClassCastException in TableReader while creating HadoopRDD

2019-01-15 Thread Deegue (JIRA)
Deegue created SPARK-26630:
--

 Summary: ClassCastException in TableReader while creating HadoopRDD
 Key: SPARK-26630
 URL: https://issues.apache.org/jira/browse/SPARK-26630
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Deegue


This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
#23506).

It will throw ClassCastException when we use new input format (eg. 
`org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
use NewHadoopRDD to deal with this input format in TableReader.scala.

Exception :
{noformat}
Caused by: java.lang.ClassCastException: 
org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
org.apache.hadoop.mapred.InputFormat
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
... 87 more
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26577) Add input optimizer when reading Hive table by SparkSQL

2019-01-14 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-26577:
---
Description: 
When using SparkSQL, for example the ThriftServer, if we set

`spark.sql.hive.inputFormat.optimizer.enabled=true`

we can optimize the InputFormat to CombineTextInputFormat automatically if it's 
TextInputFormat before. And we can also change the max/min size of input splits 
by setting, for example

`mapreduce.input.fileinputformat.split.maxsize=268435456`

`mapreduce.input.fileinputformat.split.minsize=134217728`

 

Otherwise, we have to modify Hive Configs and structure of tables.

  was:
When using SparkSQL, for example the ThriftServer, if we set

`spark.sql.hive.fileInputFormat.enabled=true`

we can optimize the InputFormat to CombineTextInputFormat automatically if it's 
TextInputFormat before. And we can also change the max/min size of input splits 
by setting, for example

`spark.sql.hive.fileInputFormat.split.maxsize=268435456`

`spark.sql.hive.fileInputFormat.split.minsize=134217728`

 

Otherwise, we have to modify Hive Configs and structure of tables.


> Add input optimizer when reading Hive table by SparkSQL
> ---
>
> Key: SPARK-26577
> URL: https://issues.apache.org/jira/browse/SPARK-26577
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Deegue
>Priority: Minor
>
> When using SparkSQL, for example the ThriftServer, if we set
> `spark.sql.hive.inputFormat.optimizer.enabled=true`
> we can optimize the InputFormat to CombineTextInputFormat automatically if 
> it's TextInputFormat before. And we can also change the max/min size of input 
> splits by setting, for example
> `mapreduce.input.fileinputformat.split.maxsize=268435456`
> `mapreduce.input.fileinputformat.split.minsize=134217728`
>  
> Otherwise, we have to modify Hive Configs and structure of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26577) Add input optimizer when reading Hive table by SparkSQL

2019-01-08 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-26577:
---
External issue URL: https://github.com/apache/spark/pull/23496

> Add input optimizer when reading Hive table by SparkSQL
> ---
>
> Key: SPARK-26577
> URL: https://issues.apache.org/jira/browse/SPARK-26577
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Deegue
>Priority: Minor
>
> When using SparkSQL, for example the ThriftServer, if we set
> `spark.sql.hive.fileInputFormat.enabled=true`
> we can optimize the InputFormat to CombineTextInputFormat automatically if 
> it's TextInputFormat before. And we can also change the max/min size of input 
> splits by setting, for example
> `spark.sql.hive.fileInputFormat.split.maxsize=268435456`
> `spark.sql.hive.fileInputFormat.split.minsize=134217728`
>  
> Otherwise, we have to modify Hive Configs and structure of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26577) Add input optimizer when reading Hive table by SparkSQL

2019-01-08 Thread Deegue (JIRA)
Deegue created SPARK-26577:
--

 Summary: Add input optimizer when reading Hive table by SparkSQL
 Key: SPARK-26577
 URL: https://issues.apache.org/jira/browse/SPARK-26577
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: Deegue


When using SparkSQL, for example the ThriftServer, if we set

`spark.sql.hive.fileInputFormat.enabled=true`

we can optimize the InputFormat to CombineTextInputFormat automatically if it's 
TextInputFormat before. And we can also change the max/min size of input splits 
by setting, for example

`spark.sql.hive.fileInputFormat.split.maxsize=268435456`

`spark.sql.hive.fileInputFormat.split.minsize=134217728`

 

Otherwise, we have to modify Hive Configs and structure of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24672) No job is running but there are active tasks

2018-06-28 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue reopened SPARK-24672:


Coditions to this issue:

When the amount of data I selected is larger than spark.driver.maxResultSize , 
It returns the info below and the job failed automatically . 

!image4.png!

After that , there are several active tasks remain that occupy executors.

 

!image2.png!

 

Thanks a lot for your comment.

> No job is running but there are active tasks
> 
>
> Key: SPARK-24672
> URL: https://issues.apache.org/jira/browse/SPARK-24672
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: hadoop 2.6.0
> spark 2.2.1
> CDH 5.8.3
> java 1.8.0
>  
> More info :
> image1.png in Attachments
>Reporter: Deegue
>Priority: Major
> Attachments: image1.png, image2.png, image3.png, image4.png
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Issues:
>  
> There are active tasks while no job is running.
> These active tasks occupy the executors(resources) , and I don't know why 
> they haven't been killed or stopped after its jobs failed.
>  
> More info :
> image2.png & image3.png in Attachments
>  
> I'd be very appreciated it if anyone can help me...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24672) No job is running but there are active tasks

2018-06-28 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-24672:
---
Attachment: image4.png

> No job is running but there are active tasks
> 
>
> Key: SPARK-24672
> URL: https://issues.apache.org/jira/browse/SPARK-24672
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: hadoop 2.6.0
> spark 2.2.1
> CDH 5.8.3
> java 1.8.0
>  
> More info :
> image1.png in Attachments
>Reporter: Deegue
>Priority: Major
> Attachments: image1.png, image2.png, image3.png, image4.png
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Issues:
>  
> There are active tasks while no job is running.
> These active tasks occupy the executors(resources) , and I don't know why 
> they haven't been killed or stopped after its jobs failed.
>  
> More info :
> image2.png & image3.png in Attachments
>  
> I'd be very appreciated it if anyone can help me...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24672) No job is running but there are active tasks

2018-06-28 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-24672:
---
Description: 
Issues:

 

There are active tasks while no job is running.

These active tasks occupy the executors(resources) , and I don't know why they 
haven't been killed or stopped after its jobs failed.

 

More info :

image2.png & image3.png in Attachments

 

I'd be very appreciated it if anyone can help me...

  was:
Issues:

 

There are active tasks while no job is running.

These active tasks occupy the executors(resources) , and I don't know why they 
haven't been killed or stopped after its jobs failed.

 

More info :

image2.png & image3.png & image4.png  in Attachments

 

I'd be very appreciated it if anyone can help me...


> No job is running but there are active tasks
> 
>
> Key: SPARK-24672
> URL: https://issues.apache.org/jira/browse/SPARK-24672
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: hadoop 2.6.0
> spark 2.2.1
> CDH 5.8.3
> java 1.8.0
>  
> More info :
> image1.png in Attachments
>Reporter: Deegue
>Priority: Major
> Attachments: image1.png, image2.png, image3.png
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Issues:
>  
> There are active tasks while no job is running.
> These active tasks occupy the executors(resources) , and I don't know why 
> they haven't been killed or stopped after its jobs failed.
>  
> More info :
> image2.png & image3.png in Attachments
>  
> I'd be very appreciated it if anyone can help me...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24672) No job is running but there are active tasks

2018-06-28 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-24672:
---
Attachment: image3.png
image2.png

> No job is running but there are active tasks
> 
>
> Key: SPARK-24672
> URL: https://issues.apache.org/jira/browse/SPARK-24672
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: hadoop 2.6.0
> spark 2.2.1
> CDH 5.8.3
> java 1.8.0
>  
> More info :
> image1.png in Attachments
>Reporter: Deegue
>Priority: Major
> Attachments: image1.png, image2.png, image3.png
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Issues:
>  
> There are active tasks while no job is running.
> These active tasks occupy the executors(resources) , and I don't know why 
> they haven't been killed or stopped after its jobs failed.
>  
> More info :
> image2.png & image3.png & image4.png  in Attachments
>  
> I'd be very appreciated it if anyone can help me...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24672) No job is running but there are active tasks

2018-06-28 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-24672:
---
Description: 
Issues:

 

There are active tasks while no job is running.

These active tasks occupy the executors(resources) , and I don't know why they 
haven't been killed or stopped after its jobs failed.

 

More info :

image2.png & image3.png & image4.png  in Attachments

 

I'd be very appreciated it if anyone can help me...

  was:
Issues:

 

There are active tasks while no job is running.

These active tasks occupy the executors(resources) , and I don't know why they 
haven't been killed or stopped after its jobs failed.

 

!image-2018-06-28-15-18-50-877.png!

 

!image-2018-06-28-15-25-50-812.png!

 

!image-2018-06-28-15-26-54-721.png!


> No job is running but there are active tasks
> 
>
> Key: SPARK-24672
> URL: https://issues.apache.org/jira/browse/SPARK-24672
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: hadoop 2.6.0
> spark 2.2.1
> CDH 5.8.3
> java 1.8.0
>  
> More info :
> image1.png in Attachments
>Reporter: Deegue
>Priority: Major
> Attachments: image1.png
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Issues:
>  
> There are active tasks while no job is running.
> These active tasks occupy the executors(resources) , and I don't know why 
> they haven't been killed or stopped after its jobs failed.
>  
> More info :
> image2.png & image3.png & image4.png  in Attachments
>  
> I'd be very appreciated it if anyone can help me...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24672) No job is running but there are active tasks

2018-06-28 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-24672:
---
Environment: 
hadoop 2.6.0

spark 2.2.1

CDH 5.8.3

java 1.8.0

 

More info :

image1.png in Attachments

  was:
hadoop 2.6.0

spark 2.2.1

CDH 5.8.3

java 1.8.0

 

image1.png


> No job is running but there are active tasks
> 
>
> Key: SPARK-24672
> URL: https://issues.apache.org/jira/browse/SPARK-24672
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: hadoop 2.6.0
> spark 2.2.1
> CDH 5.8.3
> java 1.8.0
>  
> More info :
> image1.png in Attachments
>Reporter: Deegue
>Priority: Major
> Attachments: image1.png
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Issues:
>  
> There are active tasks while no job is running.
> These active tasks occupy the executors(resources) , and I don't know why 
> they haven't been killed or stopped after its jobs failed.
>  
> !image-2018-06-28-15-18-50-877.png!
>  
> !image-2018-06-28-15-25-50-812.png!
>  
> !image-2018-06-28-15-26-54-721.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24672) No job is running but there are active tasks

2018-06-28 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-24672:
---
Attachment: image1.png

> No job is running but there are active tasks
> 
>
> Key: SPARK-24672
> URL: https://issues.apache.org/jira/browse/SPARK-24672
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: hadoop 2.6.0
> spark 2.2.1
> CDH 5.8.3
> java 1.8.0
>  
> image1.png
>Reporter: Deegue
>Priority: Major
> Attachments: image1.png
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Issues:
>  
> There are active tasks while no job is running.
> These active tasks occupy the executors(resources) , and I don't know why 
> they haven't been killed or stopped after its jobs failed.
>  
> !image-2018-06-28-15-18-50-877.png!
>  
> !image-2018-06-28-15-25-50-812.png!
>  
> !image-2018-06-28-15-26-54-721.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24672) No job is running but there are active tasks

2018-06-28 Thread Deegue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deegue updated SPARK-24672:
---
Environment: 
hadoop 2.6.0

spark 2.2.1

CDH 5.8.3

java 1.8.0

 

image1.png

  was:
hadoop 2.6.0

spark 2.2.1

CDH 5.8.3

java 1.8.0

 

!image-2018-06-28-15-28-11-200.png!


> No job is running but there are active tasks
> 
>
> Key: SPARK-24672
> URL: https://issues.apache.org/jira/browse/SPARK-24672
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: hadoop 2.6.0
> spark 2.2.1
> CDH 5.8.3
> java 1.8.0
>  
> image1.png
>Reporter: Deegue
>Priority: Major
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Issues:
>  
> There are active tasks while no job is running.
> These active tasks occupy the executors(resources) , and I don't know why 
> they haven't been killed or stopped after its jobs failed.
>  
> !image-2018-06-28-15-18-50-877.png!
>  
> !image-2018-06-28-15-25-50-812.png!
>  
> !image-2018-06-28-15-26-54-721.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24672) No job is running but there are active tasks

2018-06-28 Thread Deegue (JIRA)
Deegue created SPARK-24672:
--

 Summary: No job is running but there are active tasks
 Key: SPARK-24672
 URL: https://issues.apache.org/jira/browse/SPARK-24672
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, Spark Core, SQL
Affects Versions: 2.2.1
 Environment: hadoop 2.6.0

spark 2.2.1

CDH 5.8.3

java 1.8.0

 

!image-2018-06-28-15-28-11-200.png!
Reporter: Deegue


Issues:

 

There are active tasks while no job is running.

These active tasks occupy the executors(resources) , and I don't know why they 
haven't been killed or stopped after its jobs failed.

 

!image-2018-06-28-15-18-50-877.png!

 

!image-2018-06-28-15-25-50-812.png!

 

!image-2018-06-28-15-26-54-721.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org