Done. https://issues.apache.org/jira/browse/SPARK-32130
On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <maxim.g...@databricks.com> wrote: > Hello Sanjeev, > > It is hard to troubleshoot the issue without input files. Could you open > an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and > attach the JSON files there (or samples or code which generates JSON > files)? > > Maxim Gekk > > Software Engineer > > Databricks, Inc. > > > On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <sanjeev.mis...@gmail.com> > wrote: > >> It has read everything. As you notice the timing of count is still >> smaller in Spark 2.4 >> >> Spark 2.4 >> >> scala> spark.time(spark.read.json("/data/20200528")) >> Time taken: 19691 ms >> res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... >> 5 more fields] >> >> scala> spark.time(res61.count()) >> Time taken: 7113 ms >> res64: Long = 2605349 >> >> Spark 3.0 >> scala> spark.time(spark.read.json("/data/20200528")) >> 20/06/29 08:06:53 WARN package: Truncated the string representation of a >> plan since it was too large. This behavior can be adjusted by setting >> 'spark.sql.debug.maxToStringFields'. >> Time taken: 849652 ms >> res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 >> more fields] >> >> scala> spark.time(res0.count()) >> Time taken: 8201 ms >> res2: Long = 2605349 >> >> >> >> >> On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <arte...@dtechspace.com> >> wrote: >> >>> Could you share your code? Are you sure you Spark 2.4 cluster had >>> indeed read anything? Looks like the Input size field is empty under 2.4. >>> >>> -- ND >>> On 6/27/20 7:58 PM, Sanjeev Mishra wrote: >>> >>> >>> I have large amount of json files that Spark can read in 36 seconds but >>> Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, >>> looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone >>> have any idea what is going on? Is there any configuration problem with >>> Spark 3.0. >>> >>> Here are the details: >>> >>> *Spark 2.4* >>> >>> Summary Metrics for 2203 Completed Tasks >>> <http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle> >>> Metric Min 25th percentile Median 75th percentile Max >>> Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms >>> GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms >>> Showing 1 to 2 of 2 entries >>> Aggregated Metrics by Executor >>> Show 20 40 60 100 All entries >>> Search: >>> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks >>> Succeeded >>> Tasks Blacklisted >>> driver >>> 10.0.0.8:49159 36 s 2203 0 0 2203 false >>> >>> >>> *Spark 3.0* >>> >>> Summary Metrics for 8 Completed Tasks >>> <http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle> >>> Metric Min 25th percentile Median 75th percentile Max >>> Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min >>> GC Time 3 s 3 s 3 s 4 s 4 s >>> Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 >>> MiB / 58148 20.2 MiB / 71624 >>> Showing 1 to 3 of 3 entries >>> Aggregated Metrics by Executor >>> Show 20 40 60 100 All entries >>> Search: >>> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks >>> Succeeded >>> Tasks Blacklisted Input Size / Records >>> driver >>> 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999 >>> >>> >>> The DAG is also different >>> Spark 2.0 DAG >>> >>> [image: Screenshot 2020-06-27 16.30.26.png] >>> >>> Spark 3.0 DAG >>> >>> [image: Screenshot 2020-06-27 16.32.32.png] >>> >>> >>>