Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Sanjeev Mishra Mon, 29 Jun 2020 08:57:00 -0700

Done. https://issues.apache.org/jira/browse/SPARK-32130




On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <maxim.g...@databricks.com>
wrote:

> Hello Sanjeev,
>
> It is hard to troubleshoot the issue without input files. Could you open
> an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and
> attach the JSON files there (or samples or code which generates JSON
> files)?
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
> On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <sanjeev.mis...@gmail.com>
> wrote:
>
>> It has read everything. As you notice the timing of count is still
>> smaller in Spark 2.4
>>
>> Spark 2.4
>>
>> scala> spark.time(spark.read.json("/data/20200528"))
>> Time taken: 19691 ms
>> res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ...
>> 5 more fields]
>>
>> scala> spark.time(res61.count())
>> Time taken: 7113 ms
>> res64: Long = 2605349
>>
>> Spark 3.0
>> scala> spark.time(spark.read.json("/data/20200528"))
>> 20/06/29 08:06:53 WARN package: Truncated the string representation of a
>> plan since it was too large. This behavior can be adjusted by setting
>> 'spark.sql.debug.maxToStringFields'.
>> Time taken: 849652 ms
>> res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5
>> more fields]
>>
>> scala> spark.time(res0.count())
>> Time taken: 8201 ms
>> res2: Long = 2605349
>>
>>
>>
>>
>> On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <arte...@dtechspace.com>
>> wrote:
>>
>>> Could you share your code?  Are you sure you Spark 2.4 cluster had
>>> indeed read anything?  Looks like the Input size field is empty under 2.4.
>>>
>>> -- ND
>>> On 6/27/20 7:58 PM, Sanjeev Mishra wrote:
>>>
>>>
>>> I have large amount of json files that Spark can read in 36 seconds but
>>> Spark 3.0 takes almost 33 minutes to read the same. On closer analysis,
>>> looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone
>>> have any idea what is going on? Is there any configuration problem with
>>> Spark 3.0.
>>>
>>> Here are the details:
>>>
>>> *Spark 2.4*
>>>
>>> Summary Metrics for 2203 Completed Tasks
>>> <http://10.0.0.8:4040/stages/stage/?id=0&attempt=0#tasksTitle>
>>> Metric Min 25th percentile Median 75th percentile Max
>>> Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
>>> GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
>>> Showing 1 to 2 of 2 entries
>>>   Aggregated Metrics by Executor
>>> Show  20 40 60 100 All  entries
>>> Search:
>>> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks 
>>> Succeeded
>>> Tasks Blacklisted
>>> driver
>>> 10.0.0.8:49159 36 s 2203 0 0 2203 false
>>>
>>>
>>> *Spark 3.0*
>>>
>>> Summary Metrics for 8 Completed Tasks
>>> <http://10.0.0.8:4040/stages/stage/?id=1&attempt=0&task.eventTimelinePageNumber=1&task.eventTimelinePageSize=47#tasksTitle>
>>> Metric Min 25th percentile Median 75th percentile Max
>>> Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
>>> GC Time 3 s 3 s 3 s 4 s 4 s
>>> Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8
>>> MiB / 58148 20.2 MiB / 71624
>>> Showing 1 to 3 of 3 entries
>>>   Aggregated Metrics by Executor
>>> Show  20 40 60 100 All  entries
>>> Search:
>>> Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks 
>>> Succeeded
>>> Tasks Blacklisted Input Size / Records
>>> driver
>>> 10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999
>>>
>>>
>>> The DAG is also different
>>> Spark 2.0 DAG
>>>
>>> [image: Screenshot 2020-06-27 16.30.26.png]
>>>
>>> Spark 3.0 DAG
>>>
>>> [image: Screenshot 2020-06-27 16.32.32.png]
>>>
>>>
>>>

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Reply via email to