[jira] [Commented] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application
[ https://issues.apache.org/jira/browse/SPARK-17672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525144#comment-15525144 ] Gang Wu commented on SPARK-17672: - They are similar but different. This JIRA deals with the approach to get one specific appId from the whole list (returned from the map). SPARK-17671 deals with the number of app infos to get from the map. > Spark 2.0 history server web Ui takes too long for a single application > --- > > Key: SPARK-17672 > URL: https://issues.apache.org/jira/browse/SPARK-17672 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Gang Wu > > When there are 10K application history in the history server back end, it can > take a very long time to even get a single application history page. After > some investigation, I found the root cause was the following piece of code: > {code:title=OneApplicationResource.scala|borderStyle=solid} > @Produces(Array(MediaType.APPLICATION_JSON)) > private[v1] class OneApplicationResource(uiRoot: UIRoot) { > @GET > def getApp(@PathParam("appId") appId: String): ApplicationInfo = { > val apps = uiRoot.getApplicationInfoList.find { _.id == appId } > apps.getOrElse(throw new NotFoundException("unknown app: " + appId)) > } > } > {code} > Although all application history infos are stored in a LinkedHashMap, here to > code transforms the map to an iterator and then uses the find() api which is > O( n) instead of O(1) from a map.get() operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application
[ https://issues.apache.org/jira/browse/SPARK-17672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524593#comment-15524593 ] Gang Wu commented on SPARK-17672: - Hi [~ajbozarth], can you take a look at the PR? Thanks! > Spark 2.0 history server web Ui takes too long for a single application > --- > > Key: SPARK-17672 > URL: https://issues.apache.org/jira/browse/SPARK-17672 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Gang Wu > > When there are 10K application history in the history server back end, it can > take a very long time to even get a single application history page. After > some investigation, I found the root cause was the following piece of code: > {code:title=OneApplicationResource.scala|borderStyle=solid} > @Produces(Array(MediaType.APPLICATION_JSON)) > private[v1] class OneApplicationResource(uiRoot: UIRoot) { > @GET > def getApp(@PathParam("appId") appId: String): ApplicationInfo = { > val apps = uiRoot.getApplicationInfoList.find { _.id == appId } > apps.getOrElse(throw new NotFoundException("unknown app: " + appId)) > } > } > {code} > Although all application history infos are stored in a LinkedHashMap, here to > code transforms the map to an iterator and then uses the find() api which is > O( n) instead of O(1) from a map.get() operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17671) Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications
[ https://issues.apache.org/jira/browse/SPARK-17671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524595#comment-15524595 ] Gang Wu commented on SPARK-17671: - Hi [~ajbozarth], can you take a look at the PR? Thanks! > Spark 2.0 history server summary page is slow even set > spark.history.ui.maxApplications > --- > > Key: SPARK-17671 > URL: https://issues.apache.org/jira/browse/SPARK-17671 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Gang Wu > > This is a subsequent task of > [SPARK-17243|https://issues.apache.org/jira/browse/SPARK-17243]. After the > fix of SPARK-17243 (limit the number of applications in the JSON string > transferred from history server backend to web UI frontend), the history > server does display the target number of history summaries. > However, when there are more than 10k application history, it still gets > slower and slower. The problem is in the following code: > {code:title=ApplicationListResource.scala|borderStyle=solid} > @Produces(Array(MediaType.APPLICATION_JSON)) > private[v1] class ApplicationListResource(uiRoot: UIRoot) { > @GET > def appList( > @QueryParam("status") status: JList[ApplicationStatus], > @DefaultValue("2010-01-01") @QueryParam("minDate") minDate: > SimpleDateParam, > @DefaultValue("3000-01-01") @QueryParam("maxDate") maxDate: > SimpleDateParam, > @QueryParam("limit") limit: Integer) > : Iterator[ApplicationInfo] = { > // although there is a limit operation in the end > // the following line still does a transformation for all history > // in the list > val allApps = uiRoot.getApplicationInfoList > > // ... > // irrelevant code is omitted > // ... > if (limit != null) { > appList.take(limit) > } else { > appList > } > } > } > {code} > What the code **uiRoot.getApplicationInfoList** does is to transform every > application history from class ApplicationHistoryInfo to class > ApplicationInfo. So if there are 10k applications, 10k transformations will > be done even we have limited 5000 jobs here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application
[ https://issues.apache.org/jira/browse/SPARK-17672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524424#comment-15524424 ] Gang Wu commented on SPARK-17672: - I'm working on a fix and will send a PR soon. > Spark 2.0 history server web Ui takes too long for a single application > --- > > Key: SPARK-17672 > URL: https://issues.apache.org/jira/browse/SPARK-17672 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Gang Wu > > When there are 10K application history in the history server back end, it can > take a very long time to even get a single application history page. After > some investigation, I found the root cause was the following piece of code: > {code:title=OneApplicationResource.scala|borderStyle=solid} > @Produces(Array(MediaType.APPLICATION_JSON)) > private[v1] class OneApplicationResource(uiRoot: UIRoot) { > @GET > def getApp(@PathParam("appId") appId: String): ApplicationInfo = { > val apps = uiRoot.getApplicationInfoList.find { _.id == appId } > apps.getOrElse(throw new NotFoundException("unknown app: " + appId)) > } > } > {code} > Although all application history infos are stored in a LinkedHashMap, here to > code transforms the map to an iterator and then uses the find() api which is > O( n) instead of O(1) from a map.get() operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application
[ https://issues.apache.org/jira/browse/SPARK-17672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu updated SPARK-17672: Description: When there are 10K application history in the history server back end, it can take a very long time to even get a single application history page. After some investigation, I found the root cause was the following piece of code: {code:title=OneApplicationResource.scala|borderStyle=solid} @Produces(Array(MediaType.APPLICATION_JSON)) private[v1] class OneApplicationResource(uiRoot: UIRoot) { @GET def getApp(@PathParam("appId") appId: String): ApplicationInfo = { val apps = uiRoot.getApplicationInfoList.find { _.id == appId } apps.getOrElse(throw new NotFoundException("unknown app: " + appId)) } } {code} Although all application history infos are stored in a LinkedHashMap, here to code transforms the map to an iterator and then uses the find() api which is O( n) instead of O(1) from a map.get() operation. was: When there are 10K application history in the history server back end, it can take a very long time to even get a single application history page. After some investigation, I found the root cause was the following piece of code: {code:title=OneApplicationResource.scala|borderStyle=solid} @Produces(Array(MediaType.APPLICATION_JSON)) private[v1] class OneApplicationResource(uiRoot: UIRoot) { @GET def getApp(@PathParam("appId") appId: String): ApplicationInfo = { val apps = uiRoot.getApplicationInfoList.find { _.id == appId } apps.getOrElse(throw new NotFoundException("unknown app: " + appId)) } } {code} Although all application history infos are stored in a LinkedHashMap, here to code transforms the map to an iterator and then uses the find() api which is O(n) instead of O(1) from a map.get() operation. > Spark 2.0 history server web Ui takes too long for a single application > --- > > Key: SPARK-17672 > URL: https://issues.apache.org/jira/browse/SPARK-17672 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Gang Wu > > When there are 10K application history in the history server back end, it can > take a very long time to even get a single application history page. After > some investigation, I found the root cause was the following piece of code: > {code:title=OneApplicationResource.scala|borderStyle=solid} > @Produces(Array(MediaType.APPLICATION_JSON)) > private[v1] class OneApplicationResource(uiRoot: UIRoot) { > @GET > def getApp(@PathParam("appId") appId: String): ApplicationInfo = { > val apps = uiRoot.getApplicationInfoList.find { _.id == appId } > apps.getOrElse(throw new NotFoundException("unknown app: " + appId)) > } > } > {code} > Although all application history infos are stored in a LinkedHashMap, here to > code transforms the map to an iterator and then uses the find() api which is > O( n) instead of O(1) from a map.get() operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application
Gang Wu created SPARK-17672: --- Summary: Spark 2.0 history server web Ui takes too long for a single application Key: SPARK-17672 URL: https://issues.apache.org/jira/browse/SPARK-17672 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.0.0 Reporter: Gang Wu When there are 10K application history in the history server back end, it can take a very long time to even get a single application history page. After some investigation, I found the root cause was the following piece of code: {code:title=OneApplicationResource.scala|borderStyle=solid} @Produces(Array(MediaType.APPLICATION_JSON)) private[v1] class OneApplicationResource(uiRoot: UIRoot) { @GET def getApp(@PathParam("appId") appId: String): ApplicationInfo = { val apps = uiRoot.getApplicationInfoList.find { _.id == appId } apps.getOrElse(throw new NotFoundException("unknown app: " + appId)) } } {code} Although all application history infos are stored in a LinkedHashMap, here to code transforms the map to an iterator and then uses the find() api which is O(n) instead of O(1) from a map.get() operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17671) Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications
Gang Wu created SPARK-17671: --- Summary: Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications Key: SPARK-17671 URL: https://issues.apache.org/jira/browse/SPARK-17671 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.0.0 Reporter: Gang Wu This is a subsequent task of [SPARK-17243|https://issues.apache.org/jira/browse/SPARK-17243]. After the fix of SPARK-17243 (limit the number of applications in the JSON string transferred from history server backend to web UI frontend), the history server does display the target number of history summaries. However, when there are more than 10k application history, it still gets slower and slower. The problem is in the following code: {code:title=ApplicationListResource.scala|borderStyle=solid} @Produces(Array(MediaType.APPLICATION_JSON)) private[v1] class ApplicationListResource(uiRoot: UIRoot) { @GET def appList( @QueryParam("status") status: JList[ApplicationStatus], @DefaultValue("2010-01-01") @QueryParam("minDate") minDate: SimpleDateParam, @DefaultValue("3000-01-01") @QueryParam("maxDate") maxDate: SimpleDateParam, @QueryParam("limit") limit: Integer) : Iterator[ApplicationInfo] = { // although there is a limit operation in the end // the following line still does a transformation for all history // in the list val allApps = uiRoot.getApplicationInfoList // ... // irrelevant code is omitted // ... if (limit != null) { appList.take(limit) } else { appList } } } {code} What the code **uiRoot.getApplicationInfoList** does is to transform every application history from class ApplicationHistoryInfo to class ApplicationInfo. So if there are 10k applications, 10k transformations will be done even we have limited 5000 jobs here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17671) Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications
[ https://issues.apache.org/jira/browse/SPARK-17671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524173#comment-15524173 ] Gang Wu commented on SPARK-17671: - I'm working on this and will send a pull request soon. > Spark 2.0 history server summary page is slow even set > spark.history.ui.maxApplications > --- > > Key: SPARK-17671 > URL: https://issues.apache.org/jira/browse/SPARK-17671 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Gang Wu > > This is a subsequent task of > [SPARK-17243|https://issues.apache.org/jira/browse/SPARK-17243]. After the > fix of SPARK-17243 (limit the number of applications in the JSON string > transferred from history server backend to web UI frontend), the history > server does display the target number of history summaries. > However, when there are more than 10k application history, it still gets > slower and slower. The problem is in the following code: > {code:title=ApplicationListResource.scala|borderStyle=solid} > @Produces(Array(MediaType.APPLICATION_JSON)) > private[v1] class ApplicationListResource(uiRoot: UIRoot) { > @GET > def appList( > @QueryParam("status") status: JList[ApplicationStatus], > @DefaultValue("2010-01-01") @QueryParam("minDate") minDate: > SimpleDateParam, > @DefaultValue("3000-01-01") @QueryParam("maxDate") maxDate: > SimpleDateParam, > @QueryParam("limit") limit: Integer) > : Iterator[ApplicationInfo] = { > // although there is a limit operation in the end > // the following line still does a transformation for all history > // in the list > val allApps = uiRoot.getApplicationInfoList > > // ... > // irrelevant code is omitted > // ... > if (limit != null) { > appList.take(limit) > } else { > appList > } > } > } > {code} > What the code **uiRoot.getApplicationInfoList** does is to transform every > application history from class ApplicationHistoryInfo to class > ApplicationInfo. So if there are 10k applications, 10k transformations will > be done even we have limited 5000 jobs here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17601) SparkSQL vectorization cannot handle schema evolution for parquet tables when parquet files use Int whereas DataFrame uses Long
[ https://issues.apache.org/jira/browse/SPARK-17601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514019#comment-15514019 ] Gang Wu commented on SPARK-17601: - [~hyukjin.kwon] Yes I agree. I just created these JIRAs for issues we met in production. I think there definitely can be more issues for ORC, Parquet, etc. Schema evolution is always painful to tackle with. Seems that you are working on this. Do you mind telling a little bit more about what's your plan there? I'd like to know. Thanks! > SparkSQL vectorization cannot handle schema evolution for parquet tables when > parquet files use Int whereas DataFrame uses Long > --- > > Key: SPARK-17601 > URL: https://issues.apache.org/jira/browse/SPARK-17601 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gang Wu > > This is a JIRA related to SPARK-17477. > When using SparkSession to read a Hive table which is stored as parquet > files. If there has been a schema evolution from int to long of a column. > There are some old parquet files use int for the column while some new > parquet files use long. In Hive metastore, the type is long (bigint). If we > use vectorization in SparkSQL then we will get following exception: > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562) > at org.apache.spark.sql.Dataset.head(Dataset.scala:1924) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2139) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) > at org.apache.spark.sql.Dataset.show(Dataset.scala:526) > at org.apache.spark.sql.Dataset.show(Dataset.scala:486) > at org.apache.spark.sql.Dataset.show(Dataset.scala:495) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:272) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) >
[jira] [Created] (SPARK-17601) SparkSQL vectorization cannot handle schema evolution for parquet tables when parquet files use Int whereas DataFrame uses Long
Gang Wu created SPARK-17601: --- Summary: SparkSQL vectorization cannot handle schema evolution for parquet tables when parquet files use Int whereas DataFrame uses Long Key: SPARK-17601 URL: https://issues.apache.org/jira/browse/SPARK-17601 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Gang Wu This is a JIRA related to SPARK-17477. When using SparkSession to read a Hive table which is stored as parquet files. If there has been a schema evolution from int to long of a column. There are some old parquet files use int for the column while some new parquet files use long. In Hive metastore, the type is long (bigint). If we use vectorization in SparkSQL then we will get following exception: Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562) at org.apache.spark.sql.Dataset.head(Dataset.scala:1924) at org.apache.spark.sql.Dataset.take(Dataset.scala:2139) at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) at org.apache.spark.sql.Dataset.show(Dataset.scala:526) at org.apache.spark.sql.Dataset.show(Dataset.scala:486) at org.apache.spark.sql.Dataset.show(Dataset.scala:495) ... 48 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:272) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at
[jira] [Commented] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type
[ https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494681#comment-15494681 ] Gang Wu commented on SPARK-17477: - Just confirmed that this also doesn't work with vectorized reader. What I did is as follows: 1. Created a flat hive table with schema "name: String, id: Long". But the parquet file which contains 100 rows is using "name: String, id: Int". 2. Then just did a query "select * from table" and show the result. It works fine with DataFrame.count and DataFrame .printSchema() Got the following exception: Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562) at org.apache.spark.sql.Dataset.head(Dataset.scala:1924) at org.apache.spark.sql.Dataset.take(Dataset.scala:2139) at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) at org.apache.spark.sql.Dataset.show(Dataset.scala:526) at org.apache.spark.sql.Dataset.show(Dataset.scala:486) at org.apache.spark.sql.Dataset.show(Dataset.scala:495) ... 48 elided Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:272) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) > SparkSQL cannot handle schema evolution from Int -> Long when parquet files > have Int
[jira] [Commented] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type
[ https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484629#comment-15484629 ] Gang Wu commented on SPARK-17477: - [~hyukjin.kwon] I agree with you. But both issues are targeting at parquet data sources. I think it applies to all data sources. > SparkSQL cannot handle schema evolution from Int -> Long when parquet files > have Int as its type while hive metastore has Long as its type > -- > > Key: SPARK-17477 > URL: https://issues.apache.org/jira/browse/SPARK-17477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gang Wu > > When using SparkSession to read a Hive table which is stored as parquet > files. If there has been a schema evolution from int to long of a column. > There are some old parquet files use int for the column while some new > parquet files use long. In Hive metastore, the type is long (bigint). > Therefore when I use the following: > {quote} > sparkSession.sql("select * from table").show() > {quote} > I got the following exception: > {quote} > 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 > (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read > value at 0 in block 0 in file > hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to > org.apache.spark.sql.catalyst.expressions.MutableInt > at > org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85) > at > org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) > at >
[jira] [Updated] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type
[ https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu updated SPARK-17477: Target Version/s: (was: 2.1.0) > SparkSQL cannot handle schema evolution from Int -> Long when parquet files > have Int as its type while hive metastore has Long as its type > -- > > Key: SPARK-17477 > URL: https://issues.apache.org/jira/browse/SPARK-17477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gang Wu > > When using SparkSession to read a Hive table which is stored as parquet > files. If there has been a schema evolution from int to long of a column. > There are some old parquet files use int for the column while some new > parquet files use long. In Hive metastore, the type is long (bigint). > Therefore when I use the following: > {quote} > sparkSession.sql("select * from table").show() > {quote} > I got the following exception: > {quote} > 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 > (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read > value at 0 in block 0 in file > hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to > org.apache.spark.sql.catalyst.expressions.MutableInt > at > org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85) > at > org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209) > ... 22 more > {quote} > But this kind of schema evolution (int =>
[jira] [Updated] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type
[ https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu updated SPARK-17477: Shepherd: (was: Gang Wu) > SparkSQL cannot handle schema evolution from Int -> Long when parquet files > have Int as its type while hive metastore has Long as its type > -- > > Key: SPARK-17477 > URL: https://issues.apache.org/jira/browse/SPARK-17477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gang Wu > > When using SparkSession to read a Hive table which is stored as parquet > files. If there has been a schema evolution from int to long of a column. > There are some old parquet files use int for the column while some new > parquet files use long. In Hive metastore, the type is long (bigint). > Therefore when I use the following: > {quote} > sparkSession.sql("select * from table").show() > {quote} > I got the following exception: > {quote} > 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 > (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read > value at 0 in block 0 in file > hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to > org.apache.spark.sql.catalyst.expressions.MutableInt > at > org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85) > at > org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209) > ... 22 more > {quote} > But this kind of schema evolution (int => long)
[jira] [Commented] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type
[ https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477945#comment-15477945 ] Gang Wu commented on SPARK-17477: - I'm working on a fix for this issue. Will send pull request soon. > SparkSQL cannot handle schema evolution from Int -> Long when parquet files > have Int as its type while hive metastore has Long as its type > -- > > Key: SPARK-17477 > URL: https://issues.apache.org/jira/browse/SPARK-17477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gang Wu > > When using SparkSession to read a Hive table which is stored as parquet > files. If there has been a schema evolution from int to long of a column. > There are some old parquet files use int for the column while some new > parquet files use long. In Hive metastore, the type is long (bigint). > Therefore when I use the following: > {quote} > sparkSession.sql("select * from table").show() > {quote} > I got the following exception: > {quote} > 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 > (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read > value at 0 in block 0 in file > hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to > org.apache.spark.sql.catalyst.expressions.MutableInt > at > org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85) > at > org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209) >
[jira] [Updated] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type
[ https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu updated SPARK-17477: Shepherd: Gang Wu > SparkSQL cannot handle schema evolution from Int -> Long when parquet files > have Int as its type while hive metastore has Long as its type > -- > > Key: SPARK-17477 > URL: https://issues.apache.org/jira/browse/SPARK-17477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gang Wu > > When using SparkSession to read a Hive table which is stored as parquet > files. If there has been a schema evolution from int to long of a column. > There are some old parquet files use int for the column while some new > parquet files use long. In Hive metastore, the type is long (bigint). > Therefore when I use the following: > {quote} > sparkSession.sql("select * from table").show() > {quote} > I got the following exception: > {quote} > 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 > (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read > value at 0 in block 0 in file > hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to > org.apache.spark.sql.catalyst.expressions.MutableInt > at > org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85) > at > org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209) > ... 22 more > {quote} > But this kind of schema evolution (int => long) is valid
[jira] [Created] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type
Gang Wu created SPARK-17477: --- Summary: SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type Key: SPARK-17477 URL: https://issues.apache.org/jira/browse/SPARK-17477 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Gang Wu When using SparkSession to read a Hive table which is stored as parquet files. If there has been a schema evolution from int to long of a column. There are some old parquet files use int for the column while some new parquet files use long. In Hive metastore, the type is long (bigint). Therefore when I use the following: {quote} sparkSession.sql("select * from table").show() {quote} I got the following exception: {quote} 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in file hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246) at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161) at org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85) at org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249) at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365) at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209) ... 22 more {quote} But this kind of schema evolution (int => long) is valid is Hive and Presto. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447285#comment-15447285 ] Gang Wu commented on SPARK-17243: - Yup you're right. I finally got some app_ids that were not in the summary page but their urls can be accessed. Our cluster has 100K+ app_ids so it took me a long time to figure it out. Thanks for your help! > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447172#comment-15447172 ] Gang Wu commented on SPARK-17243: - I imported the last change. I can get all application list from rest endpoint /api/v1/applications, (without limit parameter). However, the web UI indicates the app_id is not found when I specify the app_id. I can get it using spark 1.5 history server. > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447099#comment-15447099 ] Gang Wu commented on SPARK-17243: - I've test this PR. It indeed reduces the number of application metadata list. I think it intends to restrict only the summary page; jobs that are dropped from summary web ui should still be available via its URL like http://x.x.x.x:18080/history/application_id/jobs. However, those dropped ones cannot be accessed. This may heavily decrease the usability of history server. > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439517#comment-15439517 ] Gang Wu commented on SPARK-17243: - Thanks [~ajbozarth]! Let me know when it is done. > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437691#comment-15437691 ] Gang Wu commented on SPARK-17243: - This doesn't work. This is for the cache of WEB UIs not for the application metadata. The default value is 50 which is small enough. > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437688#comment-15437688 ] Gang Wu commented on SPARK-17243: - Hi Alex, I think in Spark 1.5 history server obtains all application summary metadata directly from class FsHistoryProvider. You can check in HistoryPage.scala. While in Spark 2.0 it deals with JSON string (in historypage.js) which is MUCH slower than before. It may make sense if the old way is used? > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu updated SPARK-17243: Summary: Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history (was: Spark history server summary page gets stuck at "loading history summary" with 10K+ application history) > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu >Priority: Blocker > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17243) Spark history server summary page gets stuck at "loading history summary" with 10K+ application history
Gang Wu created SPARK-17243: --- Summary: Spark history server summary page gets stuck at "loading history summary" with 10K+ application history Key: SPARK-17243 URL: https://issues.apache.org/jira/browse/SPARK-17243 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.0.0 Environment: Linux Reporter: Gang Wu Priority: Blocker The summary page of Spark history server web UI keep displaying "Loading history summary..." all the time and crashes the browser when there are more than 10K application history event logs on HDFS. I did some investigation, "historypage.js" file sends a REST request to /api/v1/applications endpoint of history server REST endpoint and gets back json response. When there are more than 10K applications inside the event log directory it takes forever to parse them and render the page. When there are only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17243) Spark history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu updated SPARK-17243: Description: The summary page of Spark 2.0 history server web UI keep displaying "Loading history summary..." all the time and crashes the browser when there are more than 10K application history event logs on HDFS. I did some investigation, "historypage.js" file sends a REST request to /api/v1/applications endpoint of history server REST endpoint and gets back json response. When there are more than 10K applications inside the event log directory it takes forever to parse them and render the page. When there are only hundreds or thousands of application history it is running fine. was: The summary page of Spark history server web UI keep displaying "Loading history summary..." all the time and crashes the browser when there are more than 10K application history event logs on HDFS. I did some investigation, "historypage.js" file sends a REST request to /api/v1/applications endpoint of history server REST endpoint and gets back json response. When there are more than 10K applications inside the event log directory it takes forever to parse them and render the page. When there are only hundreds or thousands of application history it is running fine. > Spark history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu >Priority: Blocker > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14959) ​Problem Reading partitioned ORC or Parquet files
[ https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267410#comment-15267410 ] Gang Wu commented on SPARK-14959: - [~syepes] I faced the same exception when I try to query partitioned table on HDFS. Using the latest commit on master branch. > ​Problem Reading partitioned ORC or Parquet files > - > > Key: SPARK-14959 > URL: https://issues.apache.org/jira/browse/SPARK-14959 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Hadoop 2.7.1.2.4.0.0-169 (HDP 2.4) >Reporter: Sebastian YEPES FERNANDEZ > > Hello, > I have noticed that in the pasts days there is an issue when trying to read > partitioned files from HDFS. > I am running on Spark master branch #c544356 > The write actually works but the read fails. > {code:title=Issue Reproduction} > case class Data(id: Int, text: String) > val ds = spark.createDataset( Seq(Data(0, "hello"), Data(1, "hello"), Data(0, > "world"), Data(1, "there")) ) > scala> > ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("parquet").partitionBy("id").save("/user/spark/test.parquet") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > java.io.FileNotFoundException: Path is not a file: > /user/spark/test.parquet/id=0 > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227) > at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1285) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221) > at > org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:228) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209) > at > org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:372) > at > org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:360) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at >