[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-07-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32130:
--
Target Version/s: 3.0.1

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
> To reproduce the issue please untar the attachment - it will have multiple 
> .json.gz files whose contents will look similar to following
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],

[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Sanjeev Mishra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanjeev Mishra updated SPARK-32130:
---
Description: 
We are planning to move to Spark 3 but the read performance of our json files 
is unacceptable. Following is the performance numbers when compared to Spark 2.4

 
 Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
 Time taken: {color:#ff}19691 ms{color}
 res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
more fields]
 scala> spark.time(res61.count())
 Time taken: {color:#ff}7113 ms{color}
 res64: Long = 2605349
 Spark 3.0
 scala> spark.time(spark.read.json("/data/20200528"))
 20/06/29 08:06:53 WARN package: Truncated the string representation of a plan 
since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
 Time taken: {color:#ff}849652 ms{color}
 res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more 
fields]
 scala> spark.time(res0.count())
 Time taken: {color:#ff}8201 ms{color}
 res2: Long = 2605349
  
  
 I am attaching a sample data (please delete is once you are able to reproduce 
the issue) that is much smaller than the actual size but the performance 
comparison can still be verified.

The sample tar contains bunch of json.gz files, each line of the file is self 
contained json doc as shown below

To reproduce the issue please untar the attachment - it will have multiple 
.json.gz files whose contents will look similar to following

 
{quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS
 HNC IGD","Annex F 
Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
 Arris FastPath Speed 
Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
 HNC IGD 
EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
 Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[\\{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":\\{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":\\{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":\\{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":\\{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":\\{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":\\{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":\\{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}{color}
{quote}
{quote}{color:#741b47}{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":\{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS
 HNC IGD","Annex F 
Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports
 Arris FastPath Speed 
Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
 HNC IGD 
EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","first

[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Gourav (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gourav updated SPARK-32130:
---
Attachment: SPARK 32130 - replication and findings.ipynb

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[\{"0":{"Enable":"1

[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Gourav (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gourav updated SPARK-32130:
---
Attachment: (was: SPARK 32130 - replication and findings.ipynb)

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[\{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertise

[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-30 Thread Gourav (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gourav updated SPARK-32130:
---
Attachment: SPARK 32130 - replication and findings.ipynb

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[\{"0":{"Enable":"1

[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-29 Thread Sanjeev Mishra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanjeev Mishra updated SPARK-32130:
---
Description: 
We are planning to move to Spark 3 but the read performance of our json files 
is unacceptable. Following is the performance numbers when compared to Spark 2.4

 
 Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
 Time taken: {color:#ff}19691 ms{color}
 res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
more fields]
 scala> spark.time(res61.count())
 Time taken: {color:#ff}7113 ms{color}
 res64: Long = 2605349
 Spark 3.0
 scala> spark.time(spark.read.json("/data/20200528"))
 20/06/29 08:06:53 WARN package: Truncated the string representation of a plan 
since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
 Time taken: {color:#ff}849652 ms{color}
 res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more 
fields]
 scala> spark.time(res0.count())
 Time taken: {color:#ff}8201 ms{color}
 res2: Long = 2605349
  
  
 I am attaching a sample data (please delete is once you are able to reproduce 
the issue) that is much smaller than the actual size but the performance 
comparison can still be verified.

The sample tar contains bunch of json.gz files, each line of the file is self 
contained json doc as shown below

 
{quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS
 HNC IGD","Annex F 
Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
 Arris FastPath Speed 
Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
 HNC IGD 
EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
 Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[\{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":\{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":\{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":\{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":\{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":\{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":\{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":\{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}{color}{quote}
{quote}{color:#741b47}{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS
 HNC IGD","Annex F 
Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports
 Arris FastPath Speed 
Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
 HNC IGD 
EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total
 Control","GPON_100M_100M","Self-Service 
Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","G

[jira] [Updated] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-06-29 Thread Sanjeev Mishra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanjeev Mishra updated SPARK-32130:
---
Attachment: small-anon.tar

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Priority: Critical
> Attachments: small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
> Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
> Time taken: {color:#ff}19691 ms{color}
> res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
> scala> spark.time(res61.count())
> Time taken: {color:#ff}7113 ms{color}
> res64: Long = 2605349
> Spark 3.0
> scala> spark.time(spark.read.json("/data/20200528"))
> 20/06/29 08:06:53 WARN package: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> Time taken: {color:#ff}849652 ms{color}
> res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
> scala> spark.time(res0.count())
> Time taken: {color:#ff}8201 ms{color}
> res2: Long = 2605349
>  
>  
> I am attaching a sample data (please delete is once you are able to reproduce 
> the issue) that is much smaller than the actual size but the performance 
> comparison can still be verified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org