[jira] [Updated] (SPARK-46198) Unexpected Shuffle Read when using cached DataFrame

2023-12-01 Thread Vitaliy Savkin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaliy Savkin updated SPARK-46198:
---
Description: 
When a computation is based on a cached data frame, I expect to see no Shuffle 
Reads, but they happen under certain circumstances.

*Reproduction*
{code:scala}
val ctx: SQLContext = // init context 
val root = "s3a://af-data-eu-west-1-stg-parquet/vitalii-test-coalesce"

def populateAndRead(tag: String): DataFrame = {
  val path = s"$root/numbers_$tag"
//  import ctx.implicits._
//  import org.apache.spark.sql.functions.lit
//  (0 to 10 * 1000 * 1000)
//.toDF("id")
//.withColumn(tag, lit(tag.toUpperCase))
//.repartition(100)
//.write
//.option("header", "true")
//.mode("ignore")
//.csv(path)

  ctx.read.option("header", "true").csv(path).withColumnRenamed("id", tag + 
"_id")
}

val dfa = populateAndRead("a1")
val dfb = populateAndRead("b1")
val res =
  dfa.join(dfb, dfa("a1_id") === dfb("b1_id"))
.unionByName(dfa.join(dfb, dfa("a1") === dfb("b1")))
.cache()

println(res.count())
res.coalesce(1).write.mode("overwrite").csv(s"$root/numbers")
{code}
Relevant configs
{code:scala}
spark.executor.instances=10
spark.executor.cores=7
spark.executor.memory=40g
spark.executor.memoryOverhead=5g

spark.shuffle.service.enabled=true
spark.sql.adaptive.enabled=false
spark.sql.autoBroadcastJoinThreshold=-1
{code}
Spark Plan says that cache is used
{code:scala}
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (27)
+- Coalesce (26)
   +- InMemoryTableScan (1)
         +- InMemoryRelation (2)
               +- Union (25)
                  :- * SortMergeJoin Inner (13)
                  :  :- * Sort (7)
                  :  :  +- Exchange (6)
                  :  :     +- * Project (5)
                  :  :        +- * Filter (4)
                  :  :           +- Scan csv  (3)
                  :  +- * Sort (12)
                  :     +- Exchange (11)
                  :        +- * Project (10)
                  :           +- * Filter (9)
                  :              +- Scan csv  (8)
                  +- * SortMergeJoin Inner (24)
                     :- * Sort (18)
                     :  +- Exchange (17)
                     :     +- * Project (16)
                     :        +- * Filter (15)
                     :           +- Scan csv  (14)
                     +- * Sort (23)
                        +- Exchange (22)
                           +- * Project (21)
                              +- * Filter (20)
                                 +- Scan csv  (19)
{code}
But when running on YARN, the csv job has shuffle reads.

!shuffle.png!

*Additional info*
 - I was unable to reproduce it with local Spark.
 - If {{.withColumnRenamed("id", tag + "_id")}} is dropped and the join 
conditions are changed to just {{{}"id"{}}}, the issue disappears!
 - This behaviour is stable - it's not a result of failed instances.

*Production impact*

Without cache saving data in production takes much longer (30 seconds vs 18 
seconds). To avoid shuffle reads, we had to add a {{repartition}} step before 
{{cache}} as a workaround, which reduced time from 18 seconds to 10.

  was:
When a computation is base on a cached data frames, I expect to see no Shuffle 
Reads, but it happens under certain circumstances.

*Reproduction*
{code:scala}
val ctx: SQLContext = // init context 
val root = "s3a://af-data-eu-west-1-stg-parquet/vitalii-test-coalesce"

def populateAndRead(tag: String): DataFrame = {
  val path = s"$root/numbers_$tag"
//  import ctx.implicits._
//  import org.apache.spark.sql.functions.lit
//  (0 to 10 * 1000 * 1000)
//.toDF("id")
//.withColumn(tag, lit(tag.toUpperCase))
//.repartition(100)
//.write
//.option("header", "true")
//.mode("ignore")
//.csv(path)

  ctx.read.option("header", "true").csv(path).withColumnRenamed("id", tag + 
"_id")
}

val dfa = populateAndRead("a1")
val dfb = populateAndRead("b1")
val res =
  dfa.join(dfb, dfa("a1_id") === dfb("b1_id"))
.unionByName(dfa.join(dfb, dfa("a1") === dfb("b1")))
.cache()

println(res.count())
res.coalesce(1).write.mode("overwrite").csv(s"$root/numbers")
{code}
Relevant configs
{code:scala}
spark.executor.instances=10
spark.executor.cores=7
spark.executor.memory=40g
spark.executor.memoryOverhead=5g

spark.shuffle.service.enabled=true
spark.sql.adaptive.enabled=false
spark.sql.autoBroadcastJoinThreshold=-1
{code}
Spark Plan says that cache is used
{code:scala}
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (27)
+- Coalesce (26)
   +- InMemoryTableScan (1)
         +- InMemoryRelation (2)
               +- Union (25)
                  :- * SortMergeJoin Inner (13)
         

[jira] [Updated] (SPARK-46198) Unexpected Shuffle Read when using cached DataFrame

2023-11-30 Thread Vitaliy Savkin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaliy Savkin updated SPARK-46198:
---
Description: 
When a computation is base on a cached data frames, I expect to see no Shuffle 
Reads, but it happens under certain circumstances.

*Reproduction*
{code:scala}
val ctx: SQLContext = // init context 
val root = "s3a://af-data-eu-west-1-stg-parquet/vitalii-test-coalesce"

def populateAndRead(tag: String): DataFrame = {
  val path = s"$root/numbers_$tag"
//  import ctx.implicits._
//  import org.apache.spark.sql.functions.lit
//  (0 to 10 * 1000 * 1000)
//.toDF("id")
//.withColumn(tag, lit(tag.toUpperCase))
//.repartition(100)
//.write
//.option("header", "true")
//.mode("ignore")
//.csv(path)

  ctx.read.option("header", "true").csv(path).withColumnRenamed("id", tag + 
"_id")
}

val dfa = populateAndRead("a1")
val dfb = populateAndRead("b1")
val res =
  dfa.join(dfb, dfa("a1_id") === dfb("b1_id"))
.unionByName(dfa.join(dfb, dfa("a1") === dfb("b1")))
.cache()

println(res.count())
res.coalesce(1).write.mode("overwrite").csv(s"$root/numbers")
{code}
Relevant configs
{code:scala}
spark.executor.instances=10
spark.executor.cores=7
spark.executor.memory=40g
spark.executor.memoryOverhead=5g

spark.shuffle.service.enabled=true
spark.sql.adaptive.enabled=false
spark.sql.autoBroadcastJoinThreshold=-1
{code}
Spark Plan says that cache is used
{code:scala}
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (27)
+- Coalesce (26)
   +- InMemoryTableScan (1)
         +- InMemoryRelation (2)
               +- Union (25)
                  :- * SortMergeJoin Inner (13)
                  :  :- * Sort (7)
                  :  :  +- Exchange (6)
                  :  :     +- * Project (5)
                  :  :        +- * Filter (4)
                  :  :           +- Scan csv  (3)
                  :  +- * Sort (12)
                  :     +- Exchange (11)
                  :        +- * Project (10)
                  :           +- * Filter (9)
                  :              +- Scan csv  (8)
                  +- * SortMergeJoin Inner (24)
                     :- * Sort (18)
                     :  +- Exchange (17)
                     :     +- * Project (16)
                     :        +- * Filter (15)
                     :           +- Scan csv  (14)
                     +- * Sort (23)
                        +- Exchange (22)
                           +- * Project (21)
                              +- * Filter (20)
                                 +- Scan csv  (19)
{code}
But when running on YARN, the csv job has shuffle reads.

!shuffle.png!

*Additional info*
 - I was unable to reproduce it with local Spark.
 - If {{.withColumnRenamed("id", tag + "_id")}} is dropped and the join 
conditions are changed to just {{{}"id"{}}}, the issue disappears!
 - This behaviour is stable - it's not a result of failed instances.

*Production impact*

Without cache saving data in production takes much longer (30 seconds vs 18 
seconds). To avoid shuffle reads, we had to add a {{repartition}} step before 
{{cache}} as a workaround, which reduced time from 18 seconds to 10.

  was:
When a computation is base on a cached data frames, I expect to see no Shuffle 
Reads, but it happens under certain circumstances.

*Reproduction*
{code:scala}
val ctx: SQLContext = // init context 
val root = "s3a://af-data-eu-west-1-stg-parquet/vitalii-test-coalesce"

def populateAndRead(tag: String): DataFrame = {
  val path = s"$root/numbers_$tag"
//  import ctx.implicits._
//  import org.apache.spark.sql.functions.lit
//  (0 to 10 * 1000 * 1000)
//.toDF("id")
//.withColumn(tag, lit(tag.toUpperCase))
//.repartition(100)
//.write
//.option("header", "true")
//.mode("ignore")
//.csv(path)

  ctx.read.option("header", "true").csv(path).withColumnRenamed("id", tag + 
"_id")
}

val dfa = populateAndRead("a1")
val dfb = populateAndRead("b1")
val res =
  dfa.join(dfb, dfa("a1_id") === dfb("b1_id"))
.unionByName(dfa.join(dfb, dfa("a1") === dfb("b1")))
.cache()

println(res.count())
res.coalesce(1).write.mode("overwrite").csv(s"$root/numbers")
{code}
Relevant configs
{code:scala}
spark.executor.instances=10
spark.executor.cores=7
spark.executor.memory=40g
spark.executor.memoryOverhead=5g

spark.shuffle.service.enabled=true
spark.sql.adaptive.enabled=false
spark.sql.autoBroadcastJoinThreshold=-1
{code}
Spark Plan says that cache is used
{code:scala}
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (27)
+- Coalesce (26)
   +- InMemoryTableScan (1)
         +- InMemoryRelation (2)
               +- Union (25)
                  :- * SortMergeJoin Inner (13)
          

[jira] [Created] (SPARK-46198) Unexpected Shuffle Read when using cached DataFrame

2023-11-30 Thread Vitaliy Savkin (Jira)
Vitaliy Savkin created SPARK-46198:
--

 Summary: Unexpected Shuffle Read when using cached DataFrame
 Key: SPARK-46198
 URL: https://issues.apache.org/jira/browse/SPARK-46198
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: Vitaliy Savkin
 Attachments: shuffle.png

When a computation is base on a cached data frames, I expect to see no Shuffle 
Reads, but it happens under certain circumstances.

*Reproduction*
{code:scala}
val ctx: SQLContext = // init context 
val root = "s3a://af-data-eu-west-1-stg-parquet/vitalii-test-coalesce"

def populateAndRead(tag: String): DataFrame = {
  val path = s"$root/numbers_$tag"
//  import ctx.implicits._
//  import org.apache.spark.sql.functions.lit
//  (0 to 10 * 1000 * 1000)
//.toDF("id")
//.withColumn(tag, lit(tag.toUpperCase))
//.repartition(100)
//.write
//.option("header", "true")
//.mode("ignore")
//.csv(path)

  ctx.read.option("header", "true").csv(path).withColumnRenamed("id", tag + 
"_id")
}

val dfa = populateAndRead("a1")
val dfb = populateAndRead("b1")
val res =
  dfa.join(dfb, dfa("a1_id") === dfb("b1_id"))
.unionByName(dfa.join(dfb, dfa("a1") === dfb("b1")))
.cache()

println(res.count())
res.coalesce(1).write.mode("overwrite").csv(s"$root/numbers")
{code}
Relevant configs
{code:scala}
spark.executor.instances=10
spark.executor.cores=7
spark.executor.memory=40g
spark.executor.memoryOverhead=5g

spark.shuffle.service.enabled=true
spark.sql.adaptive.enabled=false
spark.sql.autoBroadcastJoinThreshold=-1
{code}
Spark Plan says that cache is used
{code:scala}
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (27)
+- Coalesce (26)
   +- InMemoryTableScan (1)
         +- InMemoryRelation (2)
               +- Union (25)
                  :- * SortMergeJoin Inner (13)
                  :  :- * Sort (7)
                  :  :  +- Exchange (6)
                  :  :     +- * Project (5)
                  :  :        +- * Filter (4)
                  :  :           +- Scan csv  (3)
                  :  +- * Sort (12)
                  :     +- Exchange (11)
                  :        +- * Project (10)
                  :           +- * Filter (9)
                  :              +- Scan csv  (8)
                  +- * SortMergeJoin Inner (24)
                     :- * Sort (18)
                     :  +- Exchange (17)
                     :     +- * Project (16)
                     :        +- * Filter (15)
                     :           +- Scan csv  (14)
                     +- * Sort (23)
                        +- Exchange (22)
                           +- * Project (21)
                              +- * Filter (20)
                                 +- Scan csv  (19)
{code}
But when running on YARN, the csv job has shuffle reads.

!image-2023-12-01-09-27-39-463.png!

*Additional info*
 - I was unable to reproduce it with local Spark.
 - If {{.withColumnRenamed("id", tag + "_id")}} is dropped and the join 
conditions are changed to just {{{}"id"{}}}, the issue disappears!
 - This behaviour is stable - it's not a result of failed instances.

*Production impact*

Without cache saving data in production takes much longer (30 seconds vs 18 
seconds). To avoid shuffle reads, we had to add a {{repartition}} step before 
{{cache}} as a workaround, which reduced time from 18 seconds to 10.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46198) Unexpected Shuffle Read when using cached DataFrame

2023-11-30 Thread Vitaliy Savkin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaliy Savkin updated SPARK-46198:
---
Attachment: shuffle.png

> Unexpected Shuffle Read when using cached DataFrame
> ---
>
> Key: SPARK-46198
> URL: https://issues.apache.org/jira/browse/SPARK-46198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Vitaliy Savkin
>Priority: Major
> Attachments: shuffle.png
>
>
> When a computation is base on a cached data frames, I expect to see no 
> Shuffle Reads, but it happens under certain circumstances.
> *Reproduction*
> {code:scala}
> val ctx: SQLContext = // init context 
> val root = "s3a://af-data-eu-west-1-stg-parquet/vitalii-test-coalesce"
> def populateAndRead(tag: String): DataFrame = {
>   val path = s"$root/numbers_$tag"
> //  import ctx.implicits._
> //  import org.apache.spark.sql.functions.lit
> //  (0 to 10 * 1000 * 1000)
> //.toDF("id")
> //.withColumn(tag, lit(tag.toUpperCase))
> //.repartition(100)
> //.write
> //.option("header", "true")
> //.mode("ignore")
> //.csv(path)
>   ctx.read.option("header", "true").csv(path).withColumnRenamed("id", tag 
> + "_id")
> }
> val dfa = populateAndRead("a1")
> val dfb = populateAndRead("b1")
> val res =
>   dfa.join(dfb, dfa("a1_id") === dfb("b1_id"))
> .unionByName(dfa.join(dfb, dfa("a1") === dfb("b1")))
> .cache()
> println(res.count())
> res.coalesce(1).write.mode("overwrite").csv(s"$root/numbers")
> {code}
> Relevant configs
> {code:scala}
> spark.executor.instances=10
> spark.executor.cores=7
> spark.executor.memory=40g
> spark.executor.memoryOverhead=5g
> spark.shuffle.service.enabled=true
> spark.sql.adaptive.enabled=false
> spark.sql.autoBroadcastJoinThreshold=-1
> {code}
> Spark Plan says that cache is used
> {code:scala}
> == Physical Plan ==
> Execute InsertIntoHadoopFsRelationCommand (27)
> +- Coalesce (26)
>    +- InMemoryTableScan (1)
>          +- InMemoryRelation (2)
>                +- Union (25)
>                   :- * SortMergeJoin Inner (13)
>                   :  :- * Sort (7)
>                   :  :  +- Exchange (6)
>                   :  :     +- * Project (5)
>                   :  :        +- * Filter (4)
>                   :  :           +- Scan csv  (3)
>                   :  +- * Sort (12)
>                   :     +- Exchange (11)
>                   :        +- * Project (10)
>                   :           +- * Filter (9)
>                   :              +- Scan csv  (8)
>                   +- * SortMergeJoin Inner (24)
>                      :- * Sort (18)
>                      :  +- Exchange (17)
>                      :     +- * Project (16)
>                      :        +- * Filter (15)
>                      :           +- Scan csv  (14)
>                      +- * Sort (23)
>                         +- Exchange (22)
>                            +- * Project (21)
>                               +- * Filter (20)
>                                  +- Scan csv  (19)
> {code}
> But when running on YARN, the csv job has shuffle reads.
> !image-2023-12-01-09-27-39-463.png!
> *Additional info*
>  - I was unable to reproduce it with local Spark.
>  - If {{.withColumnRenamed("id", tag + "_id")}} is dropped and the join 
> conditions are changed to just {{{}"id"{}}}, the issue disappears!
>  - This behaviour is stable - it's not a result of failed instances.
> *Production impact*
> Without cache saving data in production takes much longer (30 seconds vs 18 
> seconds). To avoid shuffle reads, we had to add a {{repartition}} step before 
> {{cache}} as a workaround, which reduced time from 18 seconds to 10.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26587) Deadlock between SparkUI thread and Driver thread

2019-01-10 Thread Vitaliy Savkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaliy Savkin updated SPARK-26587:
---
Description: 
One time in a month (~1000 runs) one of our spark applications freezes at 
startup. jstack says that there is a deadlock. Please see locks 
0x802c00c0 and 0x8271bb98 in stacktraces below.
{noformat}
"Driver":
at java.lang.Package.getSystemPackage(Package.java:540)
- waiting to lock <0x802c00c0> (a java.util.HashMap)
at java.lang.ClassLoader.getPackage(ClassLoader.java:1625)
at java.net.URLClassLoader.getAndVerifyPackage(URLClassLoader.java:394)
at java.net.URLClassLoader.definePackageInternal(URLClassLoader.java:420)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:452)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
- locked <0x82789598> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:221)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
- locked <0x82789540> (a 
org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at javax.xml.parsers.FactoryFinder$1.run(FactoryFinder.java:294)
at java.security.AccessController.doPrivileged(Native Method)
at javax.xml.parsers.FactoryFinder.findServiceProvider(FactoryFinder.java:289)
at javax.xml.parsers.FactoryFinder.find(FactoryFinder.java:267)
at 
javax.xml.parsers.DocumentBuilderFactory.newInstance(DocumentBuilderFactory.java:120)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2516)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
- locked <0x8271bb98> (a org.apache.hadoop.conf.Configuration)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:981)
at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2189)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2702)
at 
org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
at java.net.URL.getURLStreamHandler(URL.java:1142)
at java.net.URL.(URL.java:599)
at java.net.URL.(URL.java:490)
at java.net.URL.(URL.java:439)
at java.net.JarURLConnection.parseSpecs(JarURLConnection.java:175)
at java.net.JarURLConnection.(JarURLConnection.java:158)
at sun.net.www.protocol.jar.JarURLConnection.(JarURLConnection.java:81)
at sun.net.www.protocol.jar.Handler.openConnection(Handler.java:41)
at java.net.URL.openConnection(URL.java:979)
at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:238)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:216)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
- locked <0x82789540> (a 
org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:262)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
- locked <0x8302a120> (a org.apache.spark.sql.hive.HiveExternalCatalog)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
- 

[jira] [Updated] (SPARK-26587) Deadlock between SparkUI thread and Driver thread

2019-01-10 Thread Vitaliy Savkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaliy Savkin updated SPARK-26587:
---
Description: 
One time in a month (~1000 runs) one of our spark applications freezes. jstack 
says that there is a deadlock. Please see locks 0x802c00c0 and 
0x8271bb98 in stacktraces below.
{noformat}
"Driver":
at java.lang.Package.getSystemPackage(Package.java:540)
- waiting to lock <0x802c00c0> (a java.util.HashMap)
at java.lang.ClassLoader.getPackage(ClassLoader.java:1625)
at java.net.URLClassLoader.getAndVerifyPackage(URLClassLoader.java:394)
at java.net.URLClassLoader.definePackageInternal(URLClassLoader.java:420)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:452)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
- locked <0x82789598> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:221)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
- locked <0x82789540> (a 
org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at javax.xml.parsers.FactoryFinder$1.run(FactoryFinder.java:294)
at java.security.AccessController.doPrivileged(Native Method)
at javax.xml.parsers.FactoryFinder.findServiceProvider(FactoryFinder.java:289)
at javax.xml.parsers.FactoryFinder.find(FactoryFinder.java:267)
at 
javax.xml.parsers.DocumentBuilderFactory.newInstance(DocumentBuilderFactory.java:120)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2516)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
- locked <0x8271bb98> (a org.apache.hadoop.conf.Configuration)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:981)
at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2189)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2702)
at 
org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
at java.net.URL.getURLStreamHandler(URL.java:1142)
at java.net.URL.(URL.java:599)
at java.net.URL.(URL.java:490)
at java.net.URL.(URL.java:439)
at java.net.JarURLConnection.parseSpecs(JarURLConnection.java:175)
at java.net.JarURLConnection.(JarURLConnection.java:158)
at sun.net.www.protocol.jar.JarURLConnection.(JarURLConnection.java:81)
at sun.net.www.protocol.jar.Handler.openConnection(Handler.java:41)
at java.net.URL.openConnection(URL.java:979)
at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:238)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:216)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
- locked <0x82789540> (a 
org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:262)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
- locked <0x8302a120> (a org.apache.spark.sql.hive.HiveExternalCatalog)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
- locked 

[jira] [Created] (SPARK-26587) Deadlock between SparkUI thread and Driver thread

2019-01-10 Thread Vitaliy Savkin (JIRA)
Vitaliy Savkin created SPARK-26587:
--

 Summary: Deadlock between SparkUI thread and Driver thread  
 Key: SPARK-26587
 URL: https://issues.apache.org/jira/browse/SPARK-26587
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
 Environment: EMR 5.9.0
Reporter: Vitaliy Savkin


One time in a month (~1000 runs) one of our spark applications freezes. jstack 
says that there is a deadlock. Please see locks 0x802c00c0 and 
0x8271bb98 in stacktraces below.
{noformat}
"Driver":
at java.lang.Package.getSystemPackage(Package.java:540)
- waiting to lock <0x802c00c0> (a java.util.HashMap)
at java.lang.ClassLoader.getPackage(ClassLoader.java:1625)
at java.net.URLClassLoader.getAndVerifyPackage(URLClassLoader.java:394)
at java.net.URLClassLoader.definePackageInternal(URLClassLoader.java:420)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:452)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
- locked <0x82789598> (a 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:221)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
- locked <0x82789540> (a 
org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at javax.xml.parsers.FactoryFinder$1.run(FactoryFinder.java:294)
at java.security.AccessController.doPrivileged(Native Method)
at javax.xml.parsers.FactoryFinder.findServiceProvider(FactoryFinder.java:289)
at javax.xml.parsers.FactoryFinder.find(FactoryFinder.java:267)
at 
javax.xml.parsers.DocumentBuilderFactory.newInstance(DocumentBuilderFactory.java:120)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2516)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
- locked <0x8271bb98> (a org.apache.hadoop.conf.Configuration)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:981)
at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2189)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2702)
at 
org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
at java.net.URL.getURLStreamHandler(URL.java:1142)
at java.net.URL.(URL.java:599)
at java.net.URL.(URL.java:490)
at java.net.URL.(URL.java:439)
at java.net.JarURLConnection.parseSpecs(JarURLConnection.java:175)
at java.net.JarURLConnection.(JarURLConnection.java:158)
at sun.net.www.protocol.jar.JarURLConnection.(JarURLConnection.java:81)
at sun.net.www.protocol.jar.Handler.openConnection(Handler.java:41)
at java.net.URL.openConnection(URL.java:979)
at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:238)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:216)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:210)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
- locked <0x82789540> (a 
org.apache.spark.sql.internal.NonClosableMutableURLClassLoader)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:262)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:362)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:266)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
- locked <0x8302a120> (a org.apache.spark.sql.hive.HiveExternalCatalog)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)
at