Bruce Robbins created SPARK-26711: ------------------------------------- Summary: JSON Schema inference takes 15 times longer Key: SPARK-26711 URL: https://issues.apache.org/jira/browse/SPARK-26711 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Bruce Robbins
I noticed that the first benchmark/case of JSONBenchmark ("JSON schema inferring", "No encoding") was taking an hour to run, when it used to run in 4-5 minutes. The culprit seems to be this commit: [https://github.com/apache/spark/commit/d72571e51d] A quick look using a profiler, and it seems to be spending 99% of its time doing some kind of exception handling in JsonInferSchema.scala. You can reproduce in the spark-shell by recreating the data used by the benchmark {noformat} scala> :paste val rowsNum = 100 * 1000 * 1000 spark.sparkContext.range(0, rowsNum, 1) .map(_ => "a") .toDF("fieldA") .write .option("encoding", "UTF-8") .json("utf8.json") // Entering paste mode (ctrl-D to finish) // Exiting paste mode, now interpreting. rowsNum: Int = 100000000 scala> {noformat} Then you can run the test by hand starting spark-shell as so (emulating SqlBasedBenchmark): {noformat} bin/spark-shell --driver-memory 8g \ --conf "spark.sql.autoBroadcastJoinThreshold=1" \ --conf "spark.sql.shuffle.partitions=1" --master "local[1]" {noformat} On commit d72571e51d: {noformat} scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); System.currentTimeMillis-start start: Long = 1548297682225 res0: Long = 815978 <== 13.6 minutes scala> {noformat} On the previous commit (86100df54b): {noformat} scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); System.currentTimeMillis-start start: Long = 1548298927151 res0: Long = 50087 <= 50 seconds scala> {noformat} I also tried {{spark.read.option("inferTimestamp", false).json("utf8.json")}}, but that option didn't seem to make a difference in run time. Maybe I am using it incorrectly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org