[jira] [Commented] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application

2016-09-26 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525144#comment-15525144
 ] 

Gang Wu commented on SPARK-17672:
-

They are similar but different.
This JIRA deals with the approach to get one specific appId from the whole list 
(returned from the map). SPARK-17671 deals with the number of app infos to get 
from the map.

> Spark 2.0 history server web Ui takes too long for a single application
> ---
>
> Key: SPARK-17672
> URL: https://issues.apache.org/jira/browse/SPARK-17672
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> When there are 10K application history in the history server back end, it can 
> take a very long time to even get a single application history page. After 
> some investigation, I found the root cause was the following piece of code: 
> {code:title=OneApplicationResource.scala|borderStyle=solid}
> @Produces(Array(MediaType.APPLICATION_JSON))
> private[v1] class OneApplicationResource(uiRoot: UIRoot) {
>   @GET
>   def getApp(@PathParam("appId") appId: String): ApplicationInfo = {
> val apps = uiRoot.getApplicationInfoList.find { _.id == appId }
> apps.getOrElse(throw new NotFoundException("unknown app: " + appId))
>   }
> }
> {code}
> Although all application history infos are stored in a LinkedHashMap, here to 
> code transforms the map to an iterator and then uses the find() api which is 
> O( n) instead of O(1) from a map.get() operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application

2016-09-26 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524593#comment-15524593
 ] 

Gang Wu commented on SPARK-17672:
-

Hi [~ajbozarth], can you take a look at the PR? Thanks!

> Spark 2.0 history server web Ui takes too long for a single application
> ---
>
> Key: SPARK-17672
> URL: https://issues.apache.org/jira/browse/SPARK-17672
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> When there are 10K application history in the history server back end, it can 
> take a very long time to even get a single application history page. After 
> some investigation, I found the root cause was the following piece of code: 
> {code:title=OneApplicationResource.scala|borderStyle=solid}
> @Produces(Array(MediaType.APPLICATION_JSON))
> private[v1] class OneApplicationResource(uiRoot: UIRoot) {
>   @GET
>   def getApp(@PathParam("appId") appId: String): ApplicationInfo = {
> val apps = uiRoot.getApplicationInfoList.find { _.id == appId }
> apps.getOrElse(throw new NotFoundException("unknown app: " + appId))
>   }
> }
> {code}
> Although all application history infos are stored in a LinkedHashMap, here to 
> code transforms the map to an iterator and then uses the find() api which is 
> O( n) instead of O(1) from a map.get() operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17671) Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications

2016-09-26 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524595#comment-15524595
 ] 

Gang Wu commented on SPARK-17671:
-

Hi [~ajbozarth], can you take a look at the PR? Thanks!

> Spark 2.0 history server summary page is slow even set 
> spark.history.ui.maxApplications
> ---
>
> Key: SPARK-17671
> URL: https://issues.apache.org/jira/browse/SPARK-17671
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> This is a subsequent task of 
> [SPARK-17243|https://issues.apache.org/jira/browse/SPARK-17243]. After the 
> fix of SPARK-17243 (limit the number of applications in the JSON string 
> transferred from history server backend to web UI frontend), the history 
> server does display the target number of history summaries. 
> However, when there are more than 10k application history, it still gets 
> slower and slower. The problem is in the following code:
> {code:title=ApplicationListResource.scala|borderStyle=solid}
> @Produces(Array(MediaType.APPLICATION_JSON))
> private[v1] class ApplicationListResource(uiRoot: UIRoot) {
>   @GET
>   def appList(
>   @QueryParam("status") status: JList[ApplicationStatus],
>   @DefaultValue("2010-01-01") @QueryParam("minDate") minDate: 
> SimpleDateParam,
>   @DefaultValue("3000-01-01") @QueryParam("maxDate") maxDate: 
> SimpleDateParam,
>   @QueryParam("limit") limit: Integer)
>   : Iterator[ApplicationInfo] = {
> // although there is a limit operation in the end
> // the following line still does a transformation for all history 
> // in the list
> val allApps = uiRoot.getApplicationInfoList
> 
> // ...
> // irrelevant code is omitted 
> // ...
> if (limit != null) {
>   appList.take(limit)
> } else {
>   appList
> }
>   }
> }
> {code}
> What the code **uiRoot.getApplicationInfoList** does is to transform every 
> application history from class ApplicationHistoryInfo to class 
> ApplicationInfo. So if there are 10k applications, 10k transformations will 
> be done even we have limited 5000 jobs here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application

2016-09-26 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524424#comment-15524424
 ] 

Gang Wu commented on SPARK-17672:
-

I'm working on a fix and will send a PR soon.

> Spark 2.0 history server web Ui takes too long for a single application
> ---
>
> Key: SPARK-17672
> URL: https://issues.apache.org/jira/browse/SPARK-17672
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> When there are 10K application history in the history server back end, it can 
> take a very long time to even get a single application history page. After 
> some investigation, I found the root cause was the following piece of code: 
> {code:title=OneApplicationResource.scala|borderStyle=solid}
> @Produces(Array(MediaType.APPLICATION_JSON))
> private[v1] class OneApplicationResource(uiRoot: UIRoot) {
>   @GET
>   def getApp(@PathParam("appId") appId: String): ApplicationInfo = {
> val apps = uiRoot.getApplicationInfoList.find { _.id == appId }
> apps.getOrElse(throw new NotFoundException("unknown app: " + appId))
>   }
> }
> {code}
> Although all application history infos are stored in a LinkedHashMap, here to 
> code transforms the map to an iterator and then uses the find() api which is 
> O( n) instead of O(1) from a map.get() operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application

2016-09-26 Thread Gang Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu updated SPARK-17672:

Description: 
When there are 10K application history in the history server back end, it can 
take a very long time to even get a single application history page. After some 
investigation, I found the root cause was the following piece of code: 

{code:title=OneApplicationResource.scala|borderStyle=solid}
@Produces(Array(MediaType.APPLICATION_JSON))
private[v1] class OneApplicationResource(uiRoot: UIRoot) {

  @GET
  def getApp(@PathParam("appId") appId: String): ApplicationInfo = {
val apps = uiRoot.getApplicationInfoList.find { _.id == appId }
apps.getOrElse(throw new NotFoundException("unknown app: " + appId))
  }

}
{code}

Although all application history infos are stored in a LinkedHashMap, here to 
code transforms the map to an iterator and then uses the find() api which is O( 
n) instead of O(1) from a map.get() operation.

  was:
When there are 10K application history in the history server back end, it can 
take a very long time to even get a single application history page. After some 
investigation, I found the root cause was the following piece of code: 

{code:title=OneApplicationResource.scala|borderStyle=solid}
@Produces(Array(MediaType.APPLICATION_JSON))
private[v1] class OneApplicationResource(uiRoot: UIRoot) {

  @GET
  def getApp(@PathParam("appId") appId: String): ApplicationInfo = {
val apps = uiRoot.getApplicationInfoList.find { _.id == appId }
apps.getOrElse(throw new NotFoundException("unknown app: " + appId))
  }

}
{code}

Although all application history infos are stored in a LinkedHashMap, here to 
code transforms the map to an iterator and then uses the find() api which is 
O(n) instead of O(1) from a map.get() operation.


> Spark 2.0 history server web Ui takes too long for a single application
> ---
>
> Key: SPARK-17672
> URL: https://issues.apache.org/jira/browse/SPARK-17672
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> When there are 10K application history in the history server back end, it can 
> take a very long time to even get a single application history page. After 
> some investigation, I found the root cause was the following piece of code: 
> {code:title=OneApplicationResource.scala|borderStyle=solid}
> @Produces(Array(MediaType.APPLICATION_JSON))
> private[v1] class OneApplicationResource(uiRoot: UIRoot) {
>   @GET
>   def getApp(@PathParam("appId") appId: String): ApplicationInfo = {
> val apps = uiRoot.getApplicationInfoList.find { _.id == appId }
> apps.getOrElse(throw new NotFoundException("unknown app: " + appId))
>   }
> }
> {code}
> Although all application history infos are stored in a LinkedHashMap, here to 
> code transforms the map to an iterator and then uses the find() api which is 
> O( n) instead of O(1) from a map.get() operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application

2016-09-26 Thread Gang Wu (JIRA)
Gang Wu created SPARK-17672:
---

 Summary: Spark 2.0 history server web Ui takes too long for a 
single application
 Key: SPARK-17672
 URL: https://issues.apache.org/jira/browse/SPARK-17672
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.0
Reporter: Gang Wu


When there are 10K application history in the history server back end, it can 
take a very long time to even get a single application history page. After some 
investigation, I found the root cause was the following piece of code: 

{code:title=OneApplicationResource.scala|borderStyle=solid}
@Produces(Array(MediaType.APPLICATION_JSON))
private[v1] class OneApplicationResource(uiRoot: UIRoot) {

  @GET
  def getApp(@PathParam("appId") appId: String): ApplicationInfo = {
val apps = uiRoot.getApplicationInfoList.find { _.id == appId }
apps.getOrElse(throw new NotFoundException("unknown app: " + appId))
  }

}
{code}

Although all application history infos are stored in a LinkedHashMap, here to 
code transforms the map to an iterator and then uses the find() api which is 
O(n) instead of O(1) from a map.get() operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17671) Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications

2016-09-26 Thread Gang Wu (JIRA)
Gang Wu created SPARK-17671:
---

 Summary: Spark 2.0 history server summary page is slow even set 
spark.history.ui.maxApplications
 Key: SPARK-17671
 URL: https://issues.apache.org/jira/browse/SPARK-17671
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.0
Reporter: Gang Wu


This is a subsequent task of 
[SPARK-17243|https://issues.apache.org/jira/browse/SPARK-17243]. After the fix 
of SPARK-17243 (limit the number of applications in the JSON string transferred 
from history server backend to web UI frontend), the history server does 
display the target number of history summaries. 

However, when there are more than 10k application history, it still gets slower 
and slower. The problem is in the following code:

{code:title=ApplicationListResource.scala|borderStyle=solid}
@Produces(Array(MediaType.APPLICATION_JSON))
private[v1] class ApplicationListResource(uiRoot: UIRoot) {

  @GET
  def appList(
  @QueryParam("status") status: JList[ApplicationStatus],
  @DefaultValue("2010-01-01") @QueryParam("minDate") minDate: 
SimpleDateParam,
  @DefaultValue("3000-01-01") @QueryParam("maxDate") maxDate: 
SimpleDateParam,
  @QueryParam("limit") limit: Integer)
  : Iterator[ApplicationInfo] = {
// although there is a limit operation in the end
// the following line still does a transformation for all history 
// in the list
val allApps = uiRoot.getApplicationInfoList

// ...
// irrelevant code is omitted 
// ...

if (limit != null) {
  appList.take(limit)
} else {
  appList
}
  }
}
{code}

What the code **uiRoot.getApplicationInfoList** does is to transform every 
application history from class ApplicationHistoryInfo to class ApplicationInfo. 
So if there are 10k applications, 10k transformations will be done even we have 
limited 5000 jobs here.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17671) Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications

2016-09-26 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524173#comment-15524173
 ] 

Gang Wu commented on SPARK-17671:
-

I'm working on this and will send a pull request soon.

> Spark 2.0 history server summary page is slow even set 
> spark.history.ui.maxApplications
> ---
>
> Key: SPARK-17671
> URL: https://issues.apache.org/jira/browse/SPARK-17671
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> This is a subsequent task of 
> [SPARK-17243|https://issues.apache.org/jira/browse/SPARK-17243]. After the 
> fix of SPARK-17243 (limit the number of applications in the JSON string 
> transferred from history server backend to web UI frontend), the history 
> server does display the target number of history summaries. 
> However, when there are more than 10k application history, it still gets 
> slower and slower. The problem is in the following code:
> {code:title=ApplicationListResource.scala|borderStyle=solid}
> @Produces(Array(MediaType.APPLICATION_JSON))
> private[v1] class ApplicationListResource(uiRoot: UIRoot) {
>   @GET
>   def appList(
>   @QueryParam("status") status: JList[ApplicationStatus],
>   @DefaultValue("2010-01-01") @QueryParam("minDate") minDate: 
> SimpleDateParam,
>   @DefaultValue("3000-01-01") @QueryParam("maxDate") maxDate: 
> SimpleDateParam,
>   @QueryParam("limit") limit: Integer)
>   : Iterator[ApplicationInfo] = {
> // although there is a limit operation in the end
> // the following line still does a transformation for all history 
> // in the list
> val allApps = uiRoot.getApplicationInfoList
> 
> // ...
> // irrelevant code is omitted 
> // ...
> if (limit != null) {
>   appList.take(limit)
> } else {
>   appList
> }
>   }
> }
> {code}
> What the code **uiRoot.getApplicationInfoList** does is to transform every 
> application history from class ApplicationHistoryInfo to class 
> ApplicationInfo. So if there are 10k applications, 10k transformations will 
> be done even we have limited 5000 jobs here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17601) SparkSQL vectorization cannot handle schema evolution for parquet tables when parquet files use Int whereas DataFrame uses Long

2016-09-22 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514019#comment-15514019
 ] 

Gang Wu commented on SPARK-17601:
-

[~hyukjin.kwon] Yes I agree. I just created these JIRAs for issues we met in 
production. I think there definitely can be more issues for ORC, Parquet, etc. 
Schema evolution is always painful to tackle with. Seems that you are working 
on this. Do you mind telling a little bit more about what's your plan there? 
I'd like to know. Thanks!

> SparkSQL vectorization cannot handle schema evolution for parquet tables when 
> parquet files use Int whereas DataFrame uses Long
> ---
>
> Key: SPARK-17601
> URL: https://issues.apache.org/jira/browse/SPARK-17601
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> This is a JIRA related to SPARK-17477.
> When using SparkSession to read a Hive table which is stored as parquet 
> files. If there has been a schema evolution from int to long of a column. 
> There are some old parquet files use int for the column while some new 
> parquet files use long. In Hive metastore, the type is long (bigint). If we 
> use vectorization in SparkSQL then we will get following exception:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:486)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:495)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:272)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   

[jira] [Created] (SPARK-17601) SparkSQL vectorization cannot handle schema evolution for parquet tables when parquet files use Int whereas DataFrame uses Long

2016-09-19 Thread Gang Wu (JIRA)
Gang Wu created SPARK-17601:
---

 Summary: SparkSQL vectorization cannot handle schema evolution for 
parquet tables when parquet files use Int whereas DataFrame uses Long
 Key: SPARK-17601
 URL: https://issues.apache.org/jira/browse/SPARK-17601
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Gang Wu


This is a JIRA related to SPARK-17477.

When using SparkSession to read a Hive table which is stored as parquet files. 
If there has been a schema evolution from int to long of a column. There are 
some old parquet files use int for the column while some new parquet files use 
long. In Hive metastore, the type is long (bigint). If we use vectorization in 
SparkSQL then we will get following exception:

Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
  at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924)
  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:486)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:495)
  ... 48 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:272)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
  at org.apache.spark.scheduler.Task.run(Task.scala:85)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  at 

[jira] [Commented] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type

2016-09-15 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494681#comment-15494681
 ] 

Gang Wu commented on SPARK-17477:
-

Just confirmed that this also doesn't work with vectorized reader. What I did 
is as follows:

1. Created a flat hive table with schema "name: String, id: Long". But the 
parquet file which contains 100 rows is using "name: String, id: Int".
2. Then just did a query "select * from table" and show the result. It works 
fine with DataFrame.count and DataFrame .printSchema()

Got the following exception:

Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
  at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924)
  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:486)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:495)
  ... 48 elided
Caused by: java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:272)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
  at org.apache.spark.scheduler.Task.run(Task.scala:85)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

> SparkSQL cannot handle schema evolution from Int -> Long when parquet files 
> have Int 

[jira] [Commented] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type

2016-09-12 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484629#comment-15484629
 ] 

Gang Wu commented on SPARK-17477:
-

[~hyukjin.kwon] I agree with you. But both issues are targeting at parquet data 
sources. I think it applies to all data sources.

> SparkSQL cannot handle schema evolution from Int -> Long when parquet files 
> have Int as its type while hive metastore has Long as its type
> --
>
> Key: SPARK-17477
> URL: https://issues.apache.org/jira/browse/SPARK-17477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> When using SparkSession to read a Hive table which is stored as parquet 
> files. If there has been a schema evolution from int to long of a column. 
> There are some old parquet files use int for the column while some new 
> parquet files use long. In Hive metastore, the type is long (bigint).
> Therefore when I use the following:
> {quote}
> sparkSession.sql("select * from table").show()
> {quote}
> I got the following exception:
> {quote}
> 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 
> (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read 
> value at 0 in block 0 in file 
> hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
>   at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
>   at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365)
>   at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
>   at 
> 

[jira] [Updated] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type

2016-09-12 Thread Gang Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu updated SPARK-17477:

Target Version/s:   (was: 2.1.0)

> SparkSQL cannot handle schema evolution from Int -> Long when parquet files 
> have Int as its type while hive metastore has Long as its type
> --
>
> Key: SPARK-17477
> URL: https://issues.apache.org/jira/browse/SPARK-17477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> When using SparkSession to read a Hive table which is stored as parquet 
> files. If there has been a schema evolution from int to long of a column. 
> There are some old parquet files use int for the column while some new 
> parquet files use long. In Hive metastore, the type is long (bigint).
> Therefore when I use the following:
> {quote}
> sparkSession.sql("select * from table").show()
> {quote}
> I got the following exception:
> {quote}
> 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 
> (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read 
> value at 0 in block 0 in file 
> hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
>   at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
>   at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365)
>   at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
>   ... 22 more
> {quote}
> But this kind of schema evolution (int => 

[jira] [Updated] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type

2016-09-09 Thread Gang Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu updated SPARK-17477:

Shepherd:   (was: Gang Wu)

> SparkSQL cannot handle schema evolution from Int -> Long when parquet files 
> have Int as its type while hive metastore has Long as its type
> --
>
> Key: SPARK-17477
> URL: https://issues.apache.org/jira/browse/SPARK-17477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> When using SparkSession to read a Hive table which is stored as parquet 
> files. If there has been a schema evolution from int to long of a column. 
> There are some old parquet files use int for the column while some new 
> parquet files use long. In Hive metastore, the type is long (bigint).
> Therefore when I use the following:
> {quote}
> sparkSession.sql("select * from table").show()
> {quote}
> I got the following exception:
> {quote}
> 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 
> (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read 
> value at 0 in block 0 in file 
> hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
>   at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
>   at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365)
>   at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
>   ... 22 more
> {quote}
> But this kind of schema evolution (int => long) 

[jira] [Commented] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type

2016-09-09 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477945#comment-15477945
 ] 

Gang Wu commented on SPARK-17477:
-

I'm working on a fix for this issue. Will send pull request soon.

> SparkSQL cannot handle schema evolution from Int -> Long when parquet files 
> have Int as its type while hive metastore has Long as its type
> --
>
> Key: SPARK-17477
> URL: https://issues.apache.org/jira/browse/SPARK-17477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> When using SparkSession to read a Hive table which is stored as parquet 
> files. If there has been a schema evolution from int to long of a column. 
> There are some old parquet files use int for the column while some new 
> parquet files use long. In Hive metastore, the type is long (bigint).
> Therefore when I use the following:
> {quote}
> sparkSession.sql("select * from table").show()
> {quote}
> I got the following exception:
> {quote}
> 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 
> (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read 
> value at 0 in block 0 in file 
> hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
>   at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
>   at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365)
>   at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
> 

[jira] [Updated] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type

2016-09-09 Thread Gang Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu updated SPARK-17477:

Shepherd: Gang Wu

> SparkSQL cannot handle schema evolution from Int -> Long when parquet files 
> have Int as its type while hive metastore has Long as its type
> --
>
> Key: SPARK-17477
> URL: https://issues.apache.org/jira/browse/SPARK-17477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> When using SparkSession to read a Hive table which is stored as parquet 
> files. If there has been a schema evolution from int to long of a column. 
> There are some old parquet files use int for the column while some new 
> parquet files use long. In Hive metastore, the type is long (bigint).
> Therefore when I use the following:
> {quote}
> sparkSession.sql("select * from table").show()
> {quote}
> I got the following exception:
> {quote}
> 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 
> (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read 
> value at 0 in block 0 in file 
> hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
>   at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
>   at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365)
>   at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
>   ... 22 more
> {quote}
> But this kind of schema evolution (int => long) is valid 

[jira] [Created] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type

2016-09-09 Thread Gang Wu (JIRA)
Gang Wu created SPARK-17477:
---

 Summary: SparkSQL cannot handle schema evolution from Int -> Long 
when parquet files have Int as its type while hive metastore has Long as its 
type
 Key: SPARK-17477
 URL: https://issues.apache.org/jira/browse/SPARK-17477
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Gang Wu


When using SparkSession to read a Hive table which is stored as parquet files. 
If there has been a schema evolution from int to long of a column. There are 
some old parquet files use int for the column while some new parquet files use 
long. In Hive metastore, the type is long (bigint).

Therefore when I use the following:
{quote}
sparkSession.sql("select * from table").show()
{quote}

I got the following exception:
{quote}
16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 
(TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read 
value at 0 in block 0 in file 
hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableInt
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85)
at 
org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249)
at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365)
at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 22 more
{quote}

But this kind of schema evolution (int => long) is valid is Hive and Presto.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-29 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447285#comment-15447285
 ] 

Gang Wu commented on SPARK-17243:
-

Yup you're right. I finally got some app_ids that were not in the summary page 
but their urls can be accessed. Our cluster has 100K+ app_ids so it took me a 
long time to figure it out. Thanks for your help!

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-29 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447172#comment-15447172
 ] 

Gang Wu commented on SPARK-17243:
-

I imported the last change. I can get all application list from rest endpoint 
/api/v1/applications, (without limit parameter). However, the web UI indicates 
the app_id is not found when I specify the app_id. I can get it using spark 1.5 
history server. 

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-29 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447099#comment-15447099
 ] 

Gang Wu commented on SPARK-17243:
-

I've test this PR. It indeed reduces the number of application metadata list. I 
think it intends to restrict only the summary page; jobs that are dropped from 
summary web ui should still be available via its URL like 
http://x.x.x.x:18080/history/application_id/jobs. However, those dropped ones 
cannot be accessed. This may heavily decrease the usability of history server.

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-26 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439517#comment-15439517
 ] 

Gang Wu commented on SPARK-17243:
-

Thanks [~ajbozarth]! Let me know when it is done.

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437691#comment-15437691
 ] 

Gang Wu commented on SPARK-17243:
-

This doesn't work. This is for the cache of WEB UIs not for the application 
metadata. The default value is 50 which is small enough.

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437688#comment-15437688
 ] 

Gang Wu commented on SPARK-17243:
-

Hi Alex, I think in Spark 1.5 history server obtains all application summary 
metadata directly from class FsHistoryProvider. You can check in 
HistoryPage.scala. While in Spark 2.0 it deals with JSON string (in 
historypage.js) which is MUCH slower than before. It may make sense if the old 
way is used?

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Gang Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu updated SPARK-17243:

Summary: Spark 2.0 history server summary page gets stuck at "loading 
history summary" with 10K+ application history  (was: Spark history server 
summary page gets stuck at "loading history summary" with 10K+ application 
history)

> Spark 2.0 history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>Priority: Blocker
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17243) Spark history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Gang Wu (JIRA)
Gang Wu created SPARK-17243:
---

 Summary: Spark history server summary page gets stuck at "loading 
history summary" with 10K+ application history
 Key: SPARK-17243
 URL: https://issues.apache.org/jira/browse/SPARK-17243
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.0
 Environment: Linux
Reporter: Gang Wu
Priority: Blocker


The summary page of Spark history server web UI keep displaying "Loading 
history summary..." all the time and crashes the browser when there are more 
than 10K application history event logs on HDFS. 

I did some investigation, "historypage.js" file sends a REST request to 
/api/v1/applications endpoint of history server REST endpoint and gets back 
json response. When there are more than 10K applications inside the event log 
directory it takes forever to parse them and render the page. When there are 
only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17243) Spark history server summary page gets stuck at "loading history summary" with 10K+ application history

2016-08-25 Thread Gang Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu updated SPARK-17243:

Description: 
The summary page of Spark 2.0 history server web UI keep displaying "Loading 
history summary..." all the time and crashes the browser when there are more 
than 10K application history event logs on HDFS. 

I did some investigation, "historypage.js" file sends a REST request to 
/api/v1/applications endpoint of history server REST endpoint and gets back 
json response. When there are more than 10K applications inside the event log 
directory it takes forever to parse them and render the page. When there are 
only hundreds or thousands of application history it is running fine.

  was:
The summary page of Spark history server web UI keep displaying "Loading 
history summary..." all the time and crashes the browser when there are more 
than 10K application history event logs on HDFS. 

I did some investigation, "historypage.js" file sends a REST request to 
/api/v1/applications endpoint of history server REST endpoint and gets back 
json response. When there are more than 10K applications inside the event log 
directory it takes forever to parse them and render the page. When there are 
only hundreds or thousands of application history it is running fine.


> Spark history server summary page gets stuck at "loading history summary" 
> with 10K+ application history
> ---
>
> Key: SPARK-17243
> URL: https://issues.apache.org/jira/browse/SPARK-17243
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Gang Wu
>Priority: Blocker
>
> The summary page of Spark 2.0 history server web UI keep displaying "Loading 
> history summary..." all the time and crashes the browser when there are more 
> than 10K application history event logs on HDFS. 
> I did some investigation, "historypage.js" file sends a REST request to 
> /api/v1/applications endpoint of history server REST endpoint and gets back 
> json response. When there are more than 10K applications inside the event log 
> directory it takes forever to parse them and render the page. When there are 
> only hundreds or thousands of application history it is running fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14959) ​Problem Reading partitioned ORC or Parquet files

2016-05-02 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267410#comment-15267410
 ] 

Gang Wu commented on SPARK-14959:
-

[~syepes] I faced the same exception when I try to query partitioned table on 
HDFS. Using the latest commit on master branch.

> ​Problem Reading partitioned ORC or Parquet files
> -
>
> Key: SPARK-14959
> URL: https://issues.apache.org/jira/browse/SPARK-14959
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Hadoop 2.7.1.2.4.0.0-169 (HDP 2.4)
>Reporter: Sebastian YEPES FERNANDEZ
>
> Hello,
> I have noticed that in the pasts days there is an issue when trying to read 
> partitioned files from HDFS.
> I am running on Spark master branch #c544356
> The write actually works but the read fails.
> {code:title=Issue Reproduction}
> case class Data(id: Int, text: String)
> val ds = spark.createDataset( Seq(Data(0, "hello"), Data(1, "hello"), Data(0, 
> "world"), Data(1, "there")) )
> scala> 
> ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("parquet").partitionBy("id").save("/user/spark/test.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".  
>   
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> java.io.FileNotFoundException: Path is not a file: 
> /user/spark/test.parquet/id=0
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242)
>   at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1285)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:228)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
>   at 
> org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:372)
>   at 
> org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:360)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
>