Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

Сергей Романов Mon, 05 Sep 2016 04:22:10 -0700

Hi, Gavin,

Shuffling is exactly the same in both requests and is minimal. Both requests 
produces one shuffle task. Running time is the only difference I can see in 
metrics:


timeit.timeit(spark.read.csv('file:///data/dump/test_csv', 
schema=schema).groupBy().sum(*(['dd_convs'] * 57) ).collect, number=1)
0.713730096817
 {
    "id" : 368,
    "name" : "duration total (min, med, max)",
    "value" : "524"
  }, {
    "id" : 375,
    "name" : "internal.metrics.executorRunTime",
    "value" : "527"
  }, {
    "id" : 391,
    "name" : "internal.metrics.shuffle.write.writeTime",
    "value" : "244495"
  }

timeit.timeit(spark.read.csv('file:///data/dump/test_csv', 
schema=schema).groupBy().sum(*(['dd_convs'] * 58) ).collect, number=1)
2.97951102257

  }, {
    "id" : 469,
    "name" : "duration total (min, med, max)",
    "value" : "2654"
  }, {
    "id" : 476,
    "name" : "internal.metrics.executorRunTime",
    "value" : "2661"
  }, {
    "id" : 492,
    "name" : "internal.metrics.shuffle.write.writeTime",
    "value" : "371883"
  }, {
Full metrics in attachment.
>Суббота,  3 сентября 2016, 19:53 +03:00 от Gavin Yue <yue.yuany...@gmail.com>:
>
>Any shuffling? 
>
>
>On Sep 3, 2016, at 5:50 AM, Сергей Романов < romano...@inbox.ru.INVALID > 
>wrote:
>
>>Same problem happens with CSV data file, so it's not parquet-related either.
>>
>>Welcome to
>>      ____              __
>>     / __/__  ___ _____/ /__
>>    _\ \/ _ \/ _ `/ __/  '_/
>>   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>>      /_/
>>
>>Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>>SparkSession available as 'spark'.
>>>>> import timeit
>>>>> from pyspark.sql.types import *
>>>>> schema = StructType([StructField('dd_convs', FloatType(), True)])
>>>>> for x in range(50, 70): print x, 
>>>>> timeit.timeit(spark.read.csv('file:///data/dump/test_csv', 
>>>>> schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
>>50 0.372850894928
>>51 0.376906871796
>>52 0.381325960159
>>53 0.385444164276
>>54 0.386877775192
>>55 0.388918161392
>>56 0.397624969482
>>57 0.391713142395
>>58 2.62714004517
>>59 2.68421196938
>>60 2.74627685547
>>61 2.81081581116
>>62 3.43532109261
>>63 3.07742786407
>>64 3.03904604912
>>65 3.01616096497
>>66 3.06293702126
>>67 3.09386610985
>>68 3.27610206604
>>69 3.2041969299 Суббота,  3 сентября 2016, 15:40 +03:00 от Сергей Романов < 
>>romano...@inbox.ru.INVALID >:
>>>
>>>Hi,
>>>I had narrowed down my problem to a very simple case. I'm sending 27kb 
>>>parquet in attachment. (file:///data/dump/test2 in example)
>>>Please, can you take a look at it? Why there is performance drop after 57 
>>>sum columns?
>>>Welcome to
>>>      ____              __
>>>     / __/__  ___ _____/ /__
>>>    _\ \/ _ \/ _ `/ __/  '_/
>>>   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>>>      /_/
>>>
>>>Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>>>SparkSession available as 'spark'.
>>>>>> import timeit
>>>>>> for x in range(70): print x, 
>>>>>> timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs']
>>>>>>  * x) ).collect, number=1)
>>>... 
>>>SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>>SLF4J: Defaulting to no-operation (NOP) logger implementation
>>>SLF4J: See  http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
>>>details.
>>>0 1.05591607094
>>>1 0.200426101685
>>>2 0.203800916672
>>>3 0.176458120346
>>>4 0.184863805771
>>>5 0.232321023941
>>>6 0.216032981873
>>>7 0.201778173447
>>>8 0.292424917221
>>>9 0.228524923325
>>>10 0.190534114838
>>>11 0.197028160095
>>>12 0.270443916321
>>>13 0.429781913757
>>>14 0.270851135254
>>>15 0.776989936829
>>>16 0.233337879181
>>>17 0.227638959885
>>>18 0.212944030762
>>>19 0.2144780159
>>>20 0.22200012207
>>>21 0.262261152267
>>>22 0.254227876663
>>>23 0.275084018707
>>>24 0.292124032974
>>>25 0.280488014221
>>>16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan 
>>>since it was too large. This behavior can be adjusted by setting 
>>>'spark.debug.maxToStringFields' in SparkEnv.conf.
>>>26 0.290093898773
>>>27 0.238478899002
>>>28 0.246420860291
>>>29 0.241401195526
>>>30 0.255286931992
>>>31 0.42702794075
>>>32 0.327946186066
>>>33 0.434395074844
>>>34 0.314198970795
>>>35 0.34576010704
>>>36 0.278323888779
>>>37 0.289474964142
>>>38 0.290827989578
>>>39 0.376291036606
>>>40 0.347742080688
>>>41 0.363158941269
>>>42 0.318687915802
>>>43 0.376327991486
>>>44 0.374994039536
>>>45 0.362971067429
>>>46 0.425967931747
>>>47 0.370860099792
>>>48 0.443903923035
>>>49 0.374128103256
>>>50 0.378985881805
>>>51 0.476850986481
>>>52 0.451028823853
>>>53 0.432540893555
>>>54 0.514838933945
>>>55 0.53990483284
>>>56 0.449142932892
>>>57 0.465240001678 // 5x slower after 57 columns
>>>58 2.40412116051
>>>59 2.41632795334
>>>60 2.41812801361
>>>61 2.55726218224
>>>62 2.55484509468
>>>63 2.56128406525
>>>64 2.54642391205
>>>65 2.56381797791
>>>66 2.56871509552
>>>67 2.66187620163
>>>68 2.63496208191
>>>69 2.81545996666
        
>>>
>>>Sergei Romanov
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe e-mail:  user-unsubscr...@spark.apache.org
>>Sergei Romanov
>><bad.csv.tgz>
>>
>>---------------------------------------------------------------------
>>To unsubscribe e-mail:  user-unsubscr...@spark.apache.org

timeit.timeit(spark.read.csv('file:///data/dump/test_csv', 
schema=schema).groupBy().sum(*(['dd_convs'] * 57) ).collect, number=1)
0.713730096817

{
  "jobId" : 4,
  "name" : "collect at /usr/lib/python2.7/timeit.py:100",
  "submissionTime" : "2016-09-05T10:51:45.764GMT",
  "completionTime" : "2016-09-05T10:51:46.327GMT",
  "stageIds" : [ 9, 8 ],
  "status" : "SUCCEEDED",
  "numTasks" : 2,
  "numActiveTasks" : 0,
  "numCompletedTasks" : 2,
  "numSkippedTasks" : 0,
  "numFailedTasks" : 0,
  "numActiveStages" : 0,
  "numCompletedStages" : 2,
  "numSkippedStages" : 0,
  "numFailedStages" : 0
}
[ {
  "status" : "COMPLETE",
  "stageId" : 8,
  "attemptId" : 0,
  "numActiveTasks" : 0,
  "numCompleteTasks" : 1,
  "numFailedTasks" : 0,
  "executorRunTime" : 527,
  "submissionTime" : "2016-09-05T10:51:45.770GMT",
  "firstTaskLaunchedTime" : "2016-09-05T10:51:45.770GMT",
  "completionTime" : "2016-09-05T10:51:46.311GMT",
  "inputBytes" : 1538820,
  "inputRecords" : 769163,
  "outputBytes" : 0,
  "outputRecords" : 0,
  "shuffleReadBytes" : 0,
  "shuffleReadRecords" : 0,
  "shuffleWriteBytes" : 68,
  "shuffleWriteRecords" : 1,
  "memoryBytesSpilled" : 0,
  "diskBytesSpilled" : 0,
  "name" : "collect at /usr/lib/python2.7/timeit.py:100",
  "details" : 
"org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2512)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native
 
Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:606)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:211)\njava.lang.Thread.run(Thread.java:745)",
  "schedulingPool" : "default",
  "accumulatorUpdates" : [ {
    "id" : 391,
    "name" : "internal.metrics.shuffle.write.writeTime",
    "value" : "244495"
  }, {
    "id" : 373,
    "name" : "number of output rows",
    "value" : "769163"
  }, {
    "id" : 376,
    "name" : "internal.metrics.resultSize",
    "value" : "1871"
  }, {
    "id" : 369,
    "name" : "number of output rows",
    "value" : "1"
  }, {
    "id" : 390,
    "name" : "internal.metrics.shuffle.write.recordsWritten",
    "value" : "1"
  }, {
    "id" : 372,
    "name" : "aggregate time total (min, med, max)",
    "value" : "524"
  }, {
    "id" : 375,
    "name" : "internal.metrics.executorRunTime",
    "value" : "527"
  }, {
    "id" : 393,
    "name" : "internal.metrics.input.recordsRead",
    "value" : "769163"
  }, {
    "id" : 392,
    "name" : "internal.metrics.input.bytesRead",
    "value" : "1538820"
  }, {
    "id" : 377,
    "name" : "internal.metrics.jvmGCTime",
    "value" : "4"
  }, {
    "id" : 368,
    "name" : "duration total (min, med, max)",
    "value" : "524"
  }, {
    "id" : 389,
    "name" : "internal.metrics.shuffle.write.bytesWritten",
    "value" : "68"
  }, {
    "id" : 362,
    "name" : "data size total (min, med, max)",
    "value" : "462"
  }, {
    "id" : 374,
    "name" : "internal.metrics.executorDeserializeTime",
    "value" : "9"
  } ],
  "tasks" : {
    "7" : {
      "taskId" : 7,
      "index" : 0,
      "attempt" : 0,
      "launchTime" : "2016-09-05T10:51:45.770GMT",
      "executorId" : "driver",
      "host" : "localhost",
      "taskLocality" : "PROCESS_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "taskMetrics" : {
        "executorDeserializeTime" : 9,
        "executorRunTime" : 527,
        "resultSize" : 1871,
        "jvmGcTime" : 4,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "inputMetrics" : {
          "bytesRead" : 1538820,
          "recordsRead" : 769163
        },
        "outputMetrics" : {
          "bytesWritten" : 0,
          "recordsWritten" : 0
        },
        "shuffleReadMetrics" : {
          "remoteBlocksFetched" : 0,
          "localBlocksFetched" : 0,
          "fetchWaitTime" : 0,
          "remoteBytesRead" : 0,
          "localBytesRead" : 0,
          "recordsRead" : 0
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 68,
          "writeTime" : 244495,
          "recordsWritten" : 1
        }
      }
    }
  },
  "executorSummary" : {
    "driver" : {
      "taskTime" : 540,
      "failedTasks" : 0,
      "succeededTasks" : 1,
      "inputBytes" : 1538820,
      "outputBytes" : 0,
      "shuffleRead" : 0,
      "shuffleWrite" : 68,
      "memoryBytesSpilled" : 0,
      "diskBytesSpilled" : 0
    }
  }
} ]

[ {
  "status" : "COMPLETE",
  "stageId" : 9,
  "attemptId" : 0,
  "numActiveTasks" : 0,
  "numCompleteTasks" : 1,
  "numFailedTasks" : 0,
  "executorRunTime" : 2,
  "submissionTime" : "2016-09-05T10:51:46.315GMT",
  "firstTaskLaunchedTime" : "2016-09-05T10:51:46.315GMT",
  "completionTime" : "2016-09-05T10:51:46.327GMT",
  "inputBytes" : 0,
  "inputRecords" : 0,
  "outputBytes" : 0,
  "outputRecords" : 0,
  "shuffleReadBytes" : 68,
  "shuffleReadRecords" : 1,
  "shuffleWriteBytes" : 0,
  "shuffleWriteRecords" : 0,
  "memoryBytesSpilled" : 0,
  "diskBytesSpilled" : 0,
  "name" : "collect at /usr/lib/python2.7/timeit.py:100",
  "details" : 
"org.apache.spark.rdd.RDD.collect(RDD.scala:892)\norg.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)\norg.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2513)\norg.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2513)\norg.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2513)\norg.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)\norg.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)\norg.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2512)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native
 
Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:606)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:211)\njava.lang.Thread.run(Thread.java:745)",
  "schedulingPool" : "default",
  "accumulatorUpdates" : [ {
    "id" : 427,
    "name" : "internal.metrics.shuffle.read.remoteBlocksFetched",
    "value" : "0"
  }, {
    "id" : 364,
    "name" : "number of output rows",
    "value" : "1"
  }, {
    "id" : 418,
    "name" : "internal.metrics.executorDeserializeTime",
    "value" : "6"
  }, {
    "id" : 430,
    "name" : "internal.metrics.shuffle.read.localBytesRead",
    "value" : "68"
  }, {
    "id" : 420,
    "name" : "internal.metrics.resultSize",
    "value" : "6543"
  }, {
    "id" : 429,
    "name" : "internal.metrics.shuffle.read.remoteBytesRead",
    "value" : "0"
  }, {
    "id" : 432,
    "name" : "internal.metrics.shuffle.read.recordsRead",
    "value" : "1"
  }, {
    "id" : 363,
    "name" : "duration total (min, med, max)",
    "value" : "0"
  }, {
    "id" : 428,
    "name" : "internal.metrics.shuffle.read.localBlocksFetched",
    "value" : "1"
  }, {
    "id" : 419,
    "name" : "internal.metrics.executorRunTime",
    "value" : "2"
  }, {
    "id" : 431,
    "name" : "internal.metrics.shuffle.read.fetchWaitTime",
    "value" : "0"
  } ],
  "tasks" : {
    "8" : {
      "taskId" : 8,
      "index" : 0,
      "attempt" : 0,
      "launchTime" : "2016-09-05T10:51:46.315GMT",
      "executorId" : "driver",
      "host" : "localhost",
      "taskLocality" : "ANY",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "taskMetrics" : {
        "executorDeserializeTime" : 6,
        "executorRunTime" : 2,
        "resultSize" : 6543,
        "jvmGcTime" : 0,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "inputMetrics" : {
          "bytesRead" : 0,
          "recordsRead" : 0
        },
        "outputMetrics" : {
          "bytesWritten" : 0,
          "recordsWritten" : 0
        },
        "shuffleReadMetrics" : {
          "remoteBlocksFetched" : 0,
          "localBlocksFetched" : 1,
          "fetchWaitTime" : 0,
          "remoteBytesRead" : 0,
          "localBytesRead" : 68,
          "recordsRead" : 1
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      }
    }
  },
  "executorSummary" : {
    "driver" : {
      "taskTime" : 11,
      "failedTasks" : 0,
      "succeededTasks" : 1,
      "inputBytes" : 0,
      "outputBytes" : 0,
      "shuffleRead" : 68,
      "shuffleWrite" : 0,
      "memoryBytesSpilled" : 0,
      "diskBytesSpilled" : 0
    }
  }
} ]


XXX


timeit.timeit(spark.read.csv('file:///data/dump/test_csv', 
schema=schema).groupBy().sum(*(['dd_convs'] * 58) ).collect, number=1)
2.97951102257

{
  "jobId" : 5,
  "name" : "collect at /usr/lib/python2.7/timeit.py:100",
  "submissionTime" : "2016-09-05T10:51:46.670GMT",
  "completionTime" : "2016-09-05T10:51:49.372GMT",
  "stageIds" : [ 10, 11 ],
  "status" : "SUCCEEDED",
  "numTasks" : 2,
  "numActiveTasks" : 0,
  "numCompletedTasks" : 2,
  "numSkippedTasks" : 0,
  "numFailedTasks" : 0,
  "numActiveStages" : 0,
  "numCompletedStages" : 2,
  "numSkippedStages" : 0,
  "numFailedStages" : 0
}
[ {
  "status" : "COMPLETE",
  "stageId" : 10,
  "attemptId" : 0,
  "numActiveTasks" : 0,
  "numCompleteTasks" : 1,
  "numFailedTasks" : 0,
  "executorRunTime" : 2661,
  "submissionTime" : "2016-09-05T10:51:46.677GMT",
  "firstTaskLaunchedTime" : "2016-09-05T10:51:46.677GMT",
  "completionTime" : "2016-09-05T10:51:49.351GMT",
  "inputBytes" : 1538820,
  "inputRecords" : 769163,
  "outputBytes" : 0,
  "outputRecords" : 0,
  "shuffleReadBytes" : 0,
  "shuffleReadRecords" : 0,
  "shuffleWriteBytes" : 68,
  "shuffleWriteRecords" : 1,
  "memoryBytesSpilled" : 0,
  "diskBytesSpilled" : 0,
  "name" : "collect at /usr/lib/python2.7/timeit.py:100",
  "details" : 
"org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2512)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native
 
Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:606)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:211)\njava.lang.Thread.run(Thread.java:745)",
  "schedulingPool" : "default",
  "accumulatorUpdates" : [ {
    "id" : 478,
    "name" : "internal.metrics.jvmGCTime",
    "value" : "5"
  }, {
    "id" : 469,
    "name" : "duration total (min, med, max)",
    "value" : "2654"
  }, {
    "id" : 463,
    "name" : "data size total (min, med, max)",
    "value" : "470"
  }, {
    "id" : 490,
    "name" : "internal.metrics.shuffle.write.bytesWritten",
    "value" : "68"
  }, {
    "id" : 492,
    "name" : "internal.metrics.shuffle.write.writeTime",
    "value" : "371883"
  }, {
    "id" : 474,
    "name" : "number of output rows",
    "value" : "769163"
  }, {
    "id" : 477,
    "name" : "internal.metrics.resultSize",
    "value" : "1871"
  }, {
    "id" : 470,
    "name" : "number of output rows",
    "value" : "1"
  }, {
    "id" : 473,
    "name" : "aggregate time total (min, med, max)",
    "value" : "2654"
  }, {
    "id" : 491,
    "name" : "internal.metrics.shuffle.write.recordsWritten",
    "value" : "1"
  }, {
    "id" : 476,
    "name" : "internal.metrics.executorRunTime",
    "value" : "2661"
  }, {
    "id" : 494,
    "name" : "internal.metrics.input.recordsRead",
    "value" : "769163"
  }, {
    "id" : 493,
    "name" : "internal.metrics.input.bytesRead",
    "value" : "1538820"
  }, {
    "id" : 475,
    "name" : "internal.metrics.executorDeserializeTime",
    "value" : "9"
  } ],
  "tasks" : {
    "9" : {
      "taskId" : 9,
      "index" : 0,
      "attempt" : 0,
      "launchTime" : "2016-09-05T10:51:46.677GMT",
      "executorId" : "driver",
      "host" : "localhost",
      "taskLocality" : "PROCESS_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "taskMetrics" : {
        "executorDeserializeTime" : 9,
        "executorRunTime" : 2661,
        "resultSize" : 1871,
        "jvmGcTime" : 5,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "inputMetrics" : {
          "bytesRead" : 1538820,
          "recordsRead" : 769163
        },
        "outputMetrics" : {
          "bytesWritten" : 0,
          "recordsWritten" : 0
        },
        "shuffleReadMetrics" : {
          "remoteBlocksFetched" : 0,
          "localBlocksFetched" : 0,
          "fetchWaitTime" : 0,
          "remoteBytesRead" : 0,
          "localBytesRead" : 0,
          "recordsRead" : 0
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 68,
          "writeTime" : 371883,
          "recordsWritten" : 1
        }
      }
    }
  },
  "executorSummary" : {
    "driver" : {
      "taskTime" : 2674,
      "failedTasks" : 0,
      "succeededTasks" : 1,
      "inputBytes" : 1538820,
      "outputBytes" : 0,
      "shuffleRead" : 0,
      "shuffleWrite" : 68,
      "memoryBytesSpilled" : 0,
      "diskBytesSpilled" : 0
    }
  }
} ]

[ {
  "status" : "COMPLETE",
  "stageId" : 11,
  "attemptId" : 0,
  "numActiveTasks" : 0,
  "numCompleteTasks" : 1,
  "numFailedTasks" : 0,
  "executorRunTime" : 8,
  "submissionTime" : "2016-09-05T10:51:49.355GMT",
  "firstTaskLaunchedTime" : "2016-09-05T10:51:49.355GMT",
  "completionTime" : "2016-09-05T10:51:49.372GMT",
  "inputBytes" : 0,
  "inputRecords" : 0,
  "outputBytes" : 0,
  "outputRecords" : 0,
  "shuffleReadBytes" : 68,
  "shuffleReadRecords" : 1,
  "shuffleWriteBytes" : 0,
  "shuffleWriteRecords" : 0,
  "memoryBytesSpilled" : 0,
  "diskBytesSpilled" : 0,
  "name" : "collect at /usr/lib/python2.7/timeit.py:100",
  "details" : 
"org.apache.spark.rdd.RDD.collect(RDD.scala:892)\norg.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)\norg.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2513)\norg.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2513)\norg.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2513)\norg.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)\norg.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)\norg.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2512)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native
 
Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:606)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:211)\njava.lang.Thread.run(Thread.java:745)",
  "schedulingPool" : "default",
  "accumulatorUpdates" : [ {
    "id" : 532,
    "name" : "internal.metrics.shuffle.read.fetchWaitTime",
    "value" : "0"
  }, {
    "id" : 528,
    "name" : "internal.metrics.shuffle.read.remoteBlocksFetched",
    "value" : "0"
  }, {
    "id" : 465,
    "name" : "number of output rows",
    "value" : "1"
  }, {
    "id" : 519,
    "name" : "internal.metrics.executorDeserializeTime",
    "value" : "6"
  }, {
    "id" : 531,
    "name" : "internal.metrics.shuffle.read.localBytesRead",
    "value" : "68"
  }, {
    "id" : 521,
    "name" : "internal.metrics.resultSize",
    "value" : "6623"
  }, {
    "id" : 530,
    "name" : "internal.metrics.shuffle.read.remoteBytesRead",
    "value" : "0"
  }, {
    "id" : 533,
    "name" : "internal.metrics.shuffle.read.recordsRead",
    "value" : "1"
  }, {
    "id" : 464,
    "name" : "duration total (min, med, max)",
    "value" : "0"
  }, {
    "id" : 520,
    "name" : "internal.metrics.executorRunTime",
    "value" : "8"
  }, {
    "id" : 529,
    "name" : "internal.metrics.shuffle.read.localBlocksFetched",
    "value" : "1"
  } ],
  "tasks" : {
    "10" : {
      "taskId" : 10,
      "index" : 0,
      "attempt" : 0,
      "launchTime" : "2016-09-05T10:51:49.355GMT",
      "executorId" : "driver",
      "host" : "localhost",
      "taskLocality" : "ANY",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "taskMetrics" : {
        "executorDeserializeTime" : 6,
        "executorRunTime" : 8,
        "resultSize" : 6623,
        "jvmGcTime" : 0,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "inputMetrics" : {
          "bytesRead" : 0,
          "recordsRead" : 0
        },
        "outputMetrics" : {
          "bytesWritten" : 0,
          "recordsWritten" : 0
        },
        "shuffleReadMetrics" : {
          "remoteBlocksFetched" : 0,
          "localBlocksFetched" : 1,
          "fetchWaitTime" : 0,
          "remoteBytesRead" : 0,
          "localBytesRead" : 68,
          "recordsRead" : 1
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      }
    }
  },
  "executorSummary" : {
    "driver" : {
      "taskTime" : 16,
      "failedTasks" : 0,
      "succeededTasks" : 1,
      "inputBytes" : 0,
      "outputBytes" : 0,
      "shuffleRead" : 68,
      "shuffleWrite" : 0,
      "memoryBytesSpilled" : 0,
      "diskBytesSpilled" : 0
    }
  }
} ]

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

Reply via email to