[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-09-22 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15515491#comment-15515491
 ] 

DB Tsai commented on SPARK-17134:
-

I did benchmark again. In old implementation, it takes 1.3hrs for one 
iteration, and in new implementation, it takes 3.5hrs for one iteration. I ran 
both experiment in the same spark job for fairness since they will get the same 
# of executors. I suspect that in old implementation, we cache the standardized 
dataset resulting better performance.   

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA

2016-09-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-16240.
---
   Resolution: Fixed
Fix Version/s: 2.0.1

> model loading backward compatibility for ml.clustering.LDA
> --
>
> Key: SPARK-16240
> URL: https://issues.apache.org/jira/browse/SPARK-16240
> Project: Spark
>  Issue Type: Bug
>Reporter: yuhao yang
>Assignee: Gayathri Murali
> Fix For: 2.0.1, 2.1.0
>
>
> After resolving the matrix conversion issue, LDA model still cannot load 1.6 
> models as one of the parameter name is changed.
> https://github.com/apache/spark/pull/12065
> We can perhaps add some special logic in the loading code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17642) support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2016-09-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17642:

Issue Type: Improvement  (was: Task)

> support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.
> We should resolve this jira after column-level statistics including 
> histograms are supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17642) support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2016-09-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17642:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-16026

> support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.
> We should resolve this jira after column-level statistics including 
> histograms are supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17642) support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2016-09-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17642:

Affects Version/s: (was: 2.0.0)
   2.1.0

> support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.
> We should resolve this jira after column-level statistics including 
> histograms are supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16719) RandomForest: communicate fewer trees on each iteration

2016-09-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-16719.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14359
[https://github.com/apache/spark/pull/14359]

> RandomForest: communicate fewer trees on each iteration
> ---
>
> Key: SPARK-16719
> URL: https://issues.apache.org/jira/browse/SPARK-16719
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.1.0
>
>
> RandomForest currently sends the entire forest to each worker on each 
> iteration.  This is because (a) the node queue is FIFO and (b) the closure 
> references the entire array of trees ({{topNodes}}).  (a) causes RFs to 
> handle splits in many trees, especially early on in learning.  (b) sends all 
> trees explicitly.
> Proposal:
> (a) Change the RF node queue to be FILO, so that RFs tend to focus on 1 or a 
> few trees before focusing on others.
> (b) Change topNodes to pass only the trees required on that iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17639) Need to add jce.jar to bootclasspath

2016-09-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17639.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.1.0

> Need to add jce.jar to bootclasspath
> 
>
> Key: SPARK-17639
> URL: https://issues.apache.org/jira/browse/SPARK-17639
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> The following PR is failing because jce.jar is missing from the classpath 
> when compiling the code:
> https://github.com/apache/spark/pull/15172



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-09-22 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15515247#comment-15515247
 ] 

Zhan Zhang commented on SPARK-17637:


cc [~rxin]
A quick prototype shows that  for a tested pipeline, the job can save around 
45% regarding the reserved cpu and memory when the dynamic allocation is 
enabled.

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Zhan Zhang
>Priority: Minor
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17604) Support purging aged file entry for FileStreamSource metadata log

2016-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15515232#comment-15515232
 ] 

Apache Spark commented on SPARK-17604:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/15210

> Support purging aged file entry for FileStreamSource metadata log
> -
>
> Key: SPARK-17604
> URL: https://issues.apache.org/jira/browse/SPARK-17604
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently with SPARK-15698, FileStreamSource metadata log will be compacted 
> periodically (10 batches by default), this means compacted batch file will 
> contain whole file entries been processed. With the time passed, the 
> compacted batch file will be accumulated to a relative large file. 
> With SPARK-17165, now {{FileStreamSource}} doesn't track the aged file entry, 
> but in the log we still keep the full records,  this is not necessary and 
> quite time-consuming during recovery. So here propose to also add file entry 
> purging ability to {{FileStreamSource}} metadata log.
> This is pending on SPARK-15698.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17604) Support purging aged file entry for FileStreamSource metadata log

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17604:


Assignee: (was: Apache Spark)

> Support purging aged file entry for FileStreamSource metadata log
> -
>
> Key: SPARK-17604
> URL: https://issues.apache.org/jira/browse/SPARK-17604
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Saisai Shao
>Priority: Minor
>
> Currently with SPARK-15698, FileStreamSource metadata log will be compacted 
> periodically (10 batches by default), this means compacted batch file will 
> contain whole file entries been processed. With the time passed, the 
> compacted batch file will be accumulated to a relative large file. 
> With SPARK-17165, now {{FileStreamSource}} doesn't track the aged file entry, 
> but in the log we still keep the full records,  this is not necessary and 
> quite time-consuming during recovery. So here propose to also add file entry 
> purging ability to {{FileStreamSource}} metadata log.
> This is pending on SPARK-15698.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17604) Support purging aged file entry for FileStreamSource metadata log

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17604:


Assignee: Apache Spark

> Support purging aged file entry for FileStreamSource metadata log
> -
>
> Key: SPARK-17604
> URL: https://issues.apache.org/jira/browse/SPARK-17604
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> Currently with SPARK-15698, FileStreamSource metadata log will be compacted 
> periodically (10 batches by default), this means compacted batch file will 
> contain whole file entries been processed. With the time passed, the 
> compacted batch file will be accumulated to a relative large file. 
> With SPARK-17165, now {{FileStreamSource}} doesn't track the aged file entry, 
> but in the log we still keep the full records,  this is not necessary and 
> quite time-consuming during recovery. So here propose to also add file entry 
> purging ability to {{FileStreamSource}} metadata log.
> This is pending on SPARK-15698.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17634) Spark job hangs when using dapply

2016-09-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15515176#comment-15515176
 ] 

Felix Cheung commented on SPARK-17634:
--

How long have you let it run?

> Spark job hangs when using dapply
> -
>
> Key: SPARK-17634
> URL: https://issues.apache.org/jira/browse/SPARK-17634
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Thomas Powell
>Priority: Critical
>
> I'm running into an issue when using dapply on yarn. I have a data frame 
> backed by files in parquet with around 200 files that is around 2GB. When I 
> load this in with the new partition coalescing it ends up having around 20 
> partitions so each one roughly 100MB. The data frame itself has 4 columns of 
> integers and doubles. If I run a count over this things work fine.
> However, if I add a {{dapply}} in between the read and the {{count}} that 
> just uses an identity function the tasks hang and make no progress. Both the 
> R and Java processes are running on the Spark nodes and are listening on the 
> {{SPARKR_WORKER_PORT}}.
> {{result <- dapply(df, function(x){x}, SparkR::schema(df))}}
> I took a jstack of the Java process and see that it is just listening on the 
> socket but never seems to make any progress. The R process is harder to debug 
> what it is doing.
> {code}
> Thread 112823: (state = IN_NATIVE)
>  - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], 
> int, int, int) @bci=0 (Interpreted frame)
>  - java.net.SocketInputStream.socketRead(java.io.FileDescriptor, byte[], int, 
> int, int) @bci=8, line=116 (Interpreted frame)
>  - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=170 
> (Interpreted frame)
>  - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=141 
> (Interpreted frame)
>  - java.io.BufferedInputStream.fill() @bci=214, line=246 (Interpreted frame)
>  - java.io.BufferedInputStream.read() @bci=12, line=265 (Compiled frame)
>  - java.io.DataInputStream.readInt() @bci=4, line=387 (Compiled frame)
>  - org.apache.spark.api.r.RRunner.org$apache$spark$api$r$RRunner$$read() 
> @bci=4, line=212 (Interpreted frame)
>  - 
> org.apache.spark.api.r.RRunner$$anon$1.(org.apache.spark.api.r.RRunner) 
> @bci=25, line=96 (Interpreted frame)
>  - org.apache.spark.api.r.RRunner.compute(scala.collection.Iterator, int) 
> @bci=109, line=87 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(scala.collection.Iterator)
>  @bci=322, line=59 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(java.lang.Object)
>  @bci=5, line=29 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(scala.collection.Iterator)
>  @bci=59, line=178 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(java.lang.Object)
>  @bci=5, line=175 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(org.apache.spark.TaskContext,
>  int, scala.collection.Iterator) @bci=8, line=784 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(java.lang.Object,
>  java.lang.Object, java.lang.Object) @bci=13, line=784 (Interpreted frame)
>  - org.apache.spark.rdd.MapPartitionsRDD.compute(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=27, line=38 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=26, line=319 (Interpreted frame)
>  - org.apache.spark.rdd.RDD.iterator(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=33, line=283 (Interpreted frame)
>  - org.apache.spark.rdd.MapPartitionsRDD.compute(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=24, line=38 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=26, line=319 (Interpreted frame)
>  - org.apache.spark.rdd.RDD.iterator(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=33, line=283 (Interpreted frame)
>  - org.apache.spark.rdd.MapPartitionsRDD.compute(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=24, line=38 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=26, line=319 (Interpreted frame)
>  - org.apache.spark.rdd.RDD.iterator(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=33, line=283 (Interpreted frame)
>  - 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
>  @bci=168, line=79 (Interpreted frame)
>  - 
> 

[jira] [Updated] (SPARK-17642) support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2016-09-22 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17642:
-
Description: 
Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.
We should resolve this jira after column-level statistics including histograms 
are supported.


  was:
Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.



> support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Zhenhua Wang
>
> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.
> We should resolve this jira after column-level statistics including 
> histograms are supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17502) Multiple Bugs in DDL Statements on Temporary Views

2016-09-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17502:

Fix Version/s: (was: 2.0.1)
   2.0.2

> Multiple Bugs in DDL Statements on Temporary Views 
> ---
>
> Key: SPARK-17502
> URL: https://issues.apache.org/jira/browse/SPARK-17502
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.2, 2.1.0
>
>
> - When the permanent tables/views do not exist but the temporary view exists, 
> the expected error should be `NoSuchTableException` for partition-related 
> ALTER TABLE commands. However, it always reports a confusing error message. 
> For example, 
> {noformat}
> Partition spec is invalid. The spec (a, b) must match the partition spec () 
> defined in table '`testview`';
> {noformat}
> - When the permanent tables/views do not exist but the temporary view exists, 
> the expected error should be `NoSuchTableException` for `ALTER TABLE ... 
> UNSET TBLPROPERTIES`. However, it reports missing table property. However, 
> the expected error should be `NoSuchTableException`. For example, 
> {noformat}
> Attempted to unset non-existent property 'p' in table '`testView`';
> {noformat}
> - When `ANALYZE TABLE` is called on a view or a temporary view, we should 
> issue an error message. However, it reports a strange error:
> {noformat}
> ANALYZE TABLE is not supported for Project
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17609) SessionCatalog.tableExists should not check temp view

2016-09-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17609:

Target Version/s: 2.0.2, 2.1.0  (was: 2.1.0)

> SessionCatalog.tableExists should not check temp view
> -
>
> Key: SPARK-17609
> URL: https://issues.apache.org/jira/browse/SPARK-17609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.2, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17609) SessionCatalog.tableExists should not check temp view

2016-09-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17609:

Fix Version/s: 2.0.1

> SessionCatalog.tableExists should not check temp view
> -
>
> Key: SPARK-17609
> URL: https://issues.apache.org/jira/browse/SPARK-17609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.2, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17502) Multiple Bugs in DDL Statements on Temporary Views

2016-09-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17502:

Fix Version/s: 2.0.1

> Multiple Bugs in DDL Statements on Temporary Views 
> ---
>
> Key: SPARK-17502
> URL: https://issues.apache.org/jira/browse/SPARK-17502
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.2, 2.1.0
>
>
> - When the permanent tables/views do not exist but the temporary view exists, 
> the expected error should be `NoSuchTableException` for partition-related 
> ALTER TABLE commands. However, it always reports a confusing error message. 
> For example, 
> {noformat}
> Partition spec is invalid. The spec (a, b) must match the partition spec () 
> defined in table '`testview`';
> {noformat}
> - When the permanent tables/views do not exist but the temporary view exists, 
> the expected error should be `NoSuchTableException` for `ALTER TABLE ... 
> UNSET TBLPROPERTIES`. However, it reports missing table property. However, 
> the expected error should be `NoSuchTableException`. For example, 
> {noformat}
> Attempted to unset non-existent property 'p' in table '`testView`';
> {noformat}
> - When `ANALYZE TABLE` is called on a view or a temporary view, we should 
> issue an error message. However, it reports a strange error:
> {noformat}
> ANALYZE TABLE is not supported for Project
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17609) SessionCatalog.tableExists should not check temp view

2016-09-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17609:

Fix Version/s: (was: 2.0.1)
   2.0.2

> SessionCatalog.tableExists should not check temp view
> -
>
> Key: SPARK-17609
> URL: https://issues.apache.org/jira/browse/SPARK-17609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.2, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17641) collect_set should ignore null values

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17641:


Assignee: Apache Spark

> collect_set should ignore null values
> -
>
> Key: SPARK-17641
> URL: https://issues.apache.org/jira/browse/SPARK-17641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> `collect_set` throws the following exception when there are null values. It 
> should ignore null values to be consistent with other aggregation methods.
> {code}
> select collect_set(null) from (select 1) tmp;
> java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
> elements.
>   at 
> scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
>   at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
>   at 
> scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
>   at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
>   at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
>   at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
> {code}
> cc: [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17641) collect_set should ignore null values

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17641:


Assignee: (was: Apache Spark)

> collect_set should ignore null values
> -
>
> Key: SPARK-17641
> URL: https://issues.apache.org/jira/browse/SPARK-17641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> `collect_set` throws the following exception when there are null values. It 
> should ignore null values to be consistent with other aggregation methods.
> {code}
> select collect_set(null) from (select 1) tmp;
> java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
> elements.
>   at 
> scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
>   at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
>   at 
> scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
>   at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
>   at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
>   at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
> {code}
> cc: [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17643) Remove comparable requirement from Offset

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17643:


Assignee: Michael Armbrust  (was: Apache Spark)

> Remove comparable requirement from Offset
> -
>
> Key: SPARK-17643
> URL: https://issues.apache.org/jira/browse/SPARK-17643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>
> For some sources, it can be hard to define a strict ordering that is based 
> only on the data in the offset.  Since we don't really utilize this 
> comparison, lets remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17643) Remove comparable requirement from Offset

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17643:


Assignee: Apache Spark  (was: Michael Armbrust)

> Remove comparable requirement from Offset
> -
>
> Key: SPARK-17643
> URL: https://issues.apache.org/jira/browse/SPARK-17643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>
> For some sources, it can be hard to define a strict ordering that is based 
> only on the data in the offset.  Since we don't really utilize this 
> comparison, lets remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17643) Remove comparable requirement from Offset

2016-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15515058#comment-15515058
 ] 

Apache Spark commented on SPARK-17643:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/15207

> Remove comparable requirement from Offset
> -
>
> Key: SPARK-17643
> URL: https://issues.apache.org/jira/browse/SPARK-17643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>
> For some sources, it can be hard to define a strict ordering that is based 
> only on the data in the offset.  Since we don't really utilize this 
> comparison, lets remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17643) Remove comparable requirement from Offset

2016-09-22 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-17643:


 Summary: Remove comparable requirement from Offset
 Key: SPARK-17643
 URL: https://issues.apache.org/jira/browse/SPARK-17643
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Michael Armbrust


For some sources, it can be hard to define a strict ordering that is based only 
on the data in the offset.  Since we don't really utilize this comparison, lets 
remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17643) Remove comparable requirement from Offset

2016-09-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-17643:


Assignee: Michael Armbrust

> Remove comparable requirement from Offset
> -
>
> Key: SPARK-17643
> URL: https://issues.apache.org/jira/browse/SPARK-17643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>
> For some sources, it can be hard to define a strict ordering that is based 
> only on the data in the offset.  Since we don't really utilize this 
> comparison, lets remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2016-09-22 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15515042#comment-15515042
 ] 

Guoqiang Li commented on SPARK-6235:


ping [~rxin]

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17642) support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2016-09-22 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-17642:


 Summary: support DESC FORMATTED TABLE COLUMN command to show 
column-level statistics
 Key: SPARK-17642
 URL: https://issues.apache.org/jira/browse/SPARK-17642
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Zhenhua Wang


Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17616.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Resolving as fixed by Herma's PR.

> Getting "java.lang.RuntimeException: Distinct columns cannot exist in 
> Aggregate "
> -
>
> Key: SPARK-17616
> URL: https://issues.apache.org/jira/browse/SPARK-17616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> I execute:
> {code}
> select platform, 
> collect_set(user_auth) as paid_types,
> count(distinct sessionid) as sessions
> from non_hss.session
> where
> event = 'stop' and platform != 'testplatform' and
> not (month = MONTH(current_date()) AND year = YEAR(current_date()) 
> and day = day(current_date())) and
> (
> (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
> YEAR(add_months(CURRENT_DATE(), -5)))
> OR
> (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
> YEAR(add_months(CURRENT_DATE(), -5)))
> )
> group by platform
> {code}
> I get:
> {code}
> java.lang.RuntimeException: Distinct columns cannot exist in Aggregate 
> operator containing aggregate functions which don't support partial 
> aggregation.
> {code}
> IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still 
> don't understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17635) Remove hardcode "agg_plan" in HashAggregateExec

2016-09-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17635:

Affects Version/s: (was: 2.0.0)

> Remove hardcode "agg_plan" in HashAggregateExec
> ---
>
> Key: SPARK-17635
> URL: https://issues.apache.org/jira/browse/SPARK-17635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: yucai
>Assignee: yucai
>  Labels: correctness
> Fix For: 2.1.0
>
>
> "agg_plan" is hardcoded in HashAggregateExec, which has potential issue.
> {code}
> ctx.addMutableState(fastHashMapClassName, fastHashMapTerm,
>   s"$fastHashMapTerm = new $fastHashMapClassName(" +
> s"agg_plan.getTaskMemoryManager(), 
> agg_plan.getEmptyAggregationBuffer());")
> {code}
> I faced this issue when I work on sort agg's codegen support. Below codes 
> will trigger bug:
> {code}
>   private def variablePrefix: String = this match {
>  -case _: HashAggregateExec => "agg"   
>  +case _: HashAggregateExec => "hagg"
>  +case _: SortAggregateExec => "sagg"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17635) Remove hardcode "agg_plan" in HashAggregateExec

2016-09-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17635.
-
   Resolution: Fixed
 Assignee: yucai
Fix Version/s: 2.1.0

> Remove hardcode "agg_plan" in HashAggregateExec
> ---
>
> Key: SPARK-17635
> URL: https://issues.apache.org/jira/browse/SPARK-17635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: yucai
>Assignee: yucai
>  Labels: correctness
> Fix For: 2.1.0
>
>
> "agg_plan" is hardcoded in HashAggregateExec, which has potential issue.
> {code}
> ctx.addMutableState(fastHashMapClassName, fastHashMapTerm,
>   s"$fastHashMapTerm = new $fastHashMapClassName(" +
> s"agg_plan.getTaskMemoryManager(), 
> agg_plan.getEmptyAggregationBuffer());")
> {code}
> I faced this issue when I work on sort agg's codegen support. Below codes 
> will trigger bug:
> {code}
>   private def variablePrefix: String = this match {
>  -case _: HashAggregateExec => "agg"   
>  +case _: HashAggregateExec => "hagg"
>  +case _: SortAggregateExec => "sagg"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17599) Folder deletion after globbing may fail StructuredStreaming jobs

2016-09-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17599:

Fix Version/s: 2.0.1

> Folder deletion after globbing may fail StructuredStreaming jobs
> 
>
> Key: SPARK-17599
> URL: https://issues.apache.org/jira/browse/SPARK-17599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.0.1, 2.1.0
>
>
> The FileStreamSource used by StructuredStreaming first resolves globs, and 
> then creates a ListingFileCatalog which listFiles with the resolved glob 
> patterns. If a folder is deleted after glob resolution but before the 
> ListingFileCatalog can list the files, we can run into a 
> 'FileNotFoundException'.
> This should not be a fatal exception for a streaming job. However we should 
> include a warn message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17569) Don't recheck existence of files when generating File Relation resolution in StructuredStreaming

2016-09-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17569:

Fix Version/s: 2.0.1

> Don't recheck existence of files when generating File Relation resolution in 
> StructuredStreaming
> 
>
> Key: SPARK-17569
> URL: https://issues.apache.org/jira/browse/SPARK-17569
> Project: Spark
>  Issue Type: Improvement
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.0.1, 2.1.0
>
>
> Structured Streaming's FileSource lists files to classify files as Offsets. 
> Once this file list is committed to a metadata log for a batch, this file 
> list is turned into a "Batch FileSource" Relation which acts as the source to 
> the incremental execution.
> While this "Batch FileSource" Relation is resolved, we re-check that every 
> single file exists on the Driver. It takes a horrible amount of time, and is 
> a total waste. We can simply skip file existence during execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17641) collect_set should ignore null values

2016-09-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17641:
-
Target Version/s: 2.0.1, 2.1.0

> collect_set should ignore null values
> -
>
> Key: SPARK-17641
> URL: https://issues.apache.org/jira/browse/SPARK-17641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> `collect_set` throws the following exception when there are null values. It 
> should ignore null values to be consistent with other aggregation methods.
> {code}
> select collect_set(null) from (select 1) tmp;
> java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
> elements.
>   at 
> scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
>   at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
>   at 
> scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
>   at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
>   at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
>   at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
> {code}
> cc: [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17641) collect_set should ignore null values

2016-09-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17641:
-
Target Version/s: 2.1.0  (was: 2.0.1, 2.1.0)

> collect_set should ignore null values
> -
>
> Key: SPARK-17641
> URL: https://issues.apache.org/jira/browse/SPARK-17641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> `collect_set` throws the following exception when there are null values. It 
> should ignore null values to be consistent with other aggregation methods.
> {code}
> select collect_set(null) from (select 1) tmp;
> java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
> elements.
>   at 
> scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
>   at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
>   at 
> scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
>   at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
>   at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
>   at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
> {code}
> cc: [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17641) collect_set should ignore null values

2016-09-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-17641:
--
Description: 
`collect_set` throws the following exception when there are null values. It 
should ignore null values to be consistent with other aggregation methods.

{code}
select collect_set(null) from (select 1) tmp;

java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
elements.
at 
scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
{code}

cc: [~yhuai]

  was:
`collect_set` throws the following exception when there are null values. It 
should ignore null values to be consistent with other aggregation methods.

{code}
java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
elements.
at 
scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
{code}

cc: [~yhuai]


> collect_set should ignore null values
> -
>
> Key: SPARK-17641
> URL: https://issues.apache.org/jira/browse/SPARK-17641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> `collect_set` throws the following exception when there are null values. It 
> should ignore null values to be consistent with other aggregation methods.
> {code}
> select collect_set(null) from (select 1) tmp;
> java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
> elements.
>   at 
> 

[jira] [Updated] (SPARK-17641) collect_set should ignore null values

2016-09-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-17641:
--
Description: 
`collect_set` throws the following exception when there are null values. It 
should ignore null values to be consistent with other aggregation methods.

{code}
java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
elements.
at 
scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
{code}

cc: [~yhuai]

  was:
`collect_set` throws the following exception when there are null values. It 
should ignore null values to be consistent with other aggregation methods.

{code}
java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
elements.
at 
scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
{code}


> collect_set should ignore null values
> -
>
> Key: SPARK-17641
> URL: https://issues.apache.org/jira/browse/SPARK-17641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> `collect_set` throws the following exception when there are null values. It 
> should ignore null values to be consistent with other aggregation methods.
> {code}
> java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
> elements.
>   at 
> scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
>   at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
>   at 
> 

[jira] [Created] (SPARK-17641) collect_set should ignore null values

2016-09-22 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-17641:
-

 Summary: collect_set should ignore null values
 Key: SPARK-17641
 URL: https://issues.apache.org/jira/browse/SPARK-17641
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiangrui Meng


`collect_set` throws the following exception when there are null values. It 
should ignore null values to be consistent with other aggregation methods.

{code}
java.lang.IllegalArgumentException: Flat hash tables cannot contain null 
elements.
at 
scala.collection.mutable.FlatHashTable$HashUtils$class.elemHashCode(FlatHashTable.scala:390)
at scala.collection.mutable.HashSet.elemHashCode(HashSet.scala:41)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:136)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:41)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:60)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.Collect.update(collect.scala:64)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:170)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:186)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:180)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:115)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:150)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:232)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:225)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17640) Avoid using -1 as the default batchId for FileStreamSource.FileEntry

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17640:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Avoid using -1 as the default batchId for FileStreamSource.FileEntry
> 
>
> Key: SPARK-17640
> URL: https://issues.apache.org/jira/browse/SPARK-17640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17640) Avoid using -1 as the default batchId for FileStreamSource.FileEntry

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17640:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Avoid using -1 as the default batchId for FileStreamSource.FileEntry
> 
>
> Key: SPARK-17640
> URL: https://issues.apache.org/jira/browse/SPARK-17640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17640) Avoid using -1 as the default batchId for FileStreamSource.FileEntry

2016-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514857#comment-15514857
 ] 

Apache Spark commented on SPARK-17640:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15206

> Avoid using -1 as the default batchId for FileStreamSource.FileEntry
> 
>
> Key: SPARK-17640
> URL: https://issues.apache.org/jira/browse/SPARK-17640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17636) Parquet filter push down doesn't handle struct fields

2016-09-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514853#comment-15514853
 ] 

Hyukjin Kwon commented on SPARK-17636:
--

Yea, Spark implementation currently does not push down filters for nested 
fields and I remember it was confirmes from committers. I will leave another 
comment after trying to search related JIRAs and comments in Github as 
references to make sure.

> Parquet filter push down doesn't handle struct fields
> -
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2
>Reporter: Mitesh
>Priority: Minor
>
> Theres a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {quote} 
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {quote} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17640) Avoid using -1 as the default batchId for FileStreamSource.FileEntry

2016-09-22 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-17640:


 Summary: Avoid using -1 as the default batchId for 
FileStreamSource.FileEntry
 Key: SPARK-17640
 URL: https://issues.apache.org/jira/browse/SPARK-17640
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12945) ERROR LiveListenerBus: Listener JobProgressListener threw an exception

2016-09-22 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514850#comment-15514850
 ] 

Josh Rosen commented on SPARK-12945:


In general, the {{JobProgressListener}} should not be throwing exceptions. Even 
if those exceptions don't cause job failures or errors it's possible that they 
may indicate the presence of a bug which could impact the consistency of the 
metrics displayed on the Spark UI.

It would be helpful to know if there's an easy way to reliably reproduce this 
exception so that we can fix it.

> ERROR LiveListenerBus: Listener JobProgressListener threw an exception
> --
>
> Key: SPARK-12945
> URL: https://issues.apache.org/jira/browse/SPARK-12945
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
> Environment: Linux, yarn-client
>Reporter: Tristan
>Priority: Minor
>
> Seeing this a lot; not sure if it is a problem or spurious error (I recall 
> this was an ignorable issue in previous version). The UI seems to be working 
> fine:
> ERROR LiveListenerBus: Listener JobProgressListener threw an exception
> java.lang.NullPointerException
> at 
> org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361)
> at 
> org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:360)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
> at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
> at 
> org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:360)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
> at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
> at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
> at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
> at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
> at 
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA

2016-09-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514829#comment-15514829
 ] 

Joseph K. Bradley commented on SPARK-16240:
---

I merged this for 2.1 and sent a backport for 2.0: 
[https://github.com/apache/spark/pull/15205]

> model loading backward compatibility for ml.clustering.LDA
> --
>
> Key: SPARK-16240
> URL: https://issues.apache.org/jira/browse/SPARK-16240
> Project: Spark
>  Issue Type: Bug
>Reporter: yuhao yang
>Assignee: Gayathri Murali
> Fix For: 2.1.0
>
>
> After resolving the matrix conversion issue, LDA model still cannot load 1.6 
> models as one of the parameter name is changed.
> https://github.com/apache/spark/pull/12065
> We can perhaps add some special logic in the loading code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA

2016-09-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16240:
--
Fix Version/s: 2.1.0

> model loading backward compatibility for ml.clustering.LDA
> --
>
> Key: SPARK-16240
> URL: https://issues.apache.org/jira/browse/SPARK-16240
> Project: Spark
>  Issue Type: Bug
>Reporter: yuhao yang
>Assignee: Gayathri Murali
> Fix For: 2.1.0
>
>
> After resolving the matrix conversion issue, LDA model still cannot load 1.6 
> models as one of the parameter name is changed.
> https://github.com/apache/spark/pull/12065
> We can perhaps add some special logic in the loading code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA

2016-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514828#comment-15514828
 ] 

Apache Spark commented on SPARK-16240:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/15205

> model loading backward compatibility for ml.clustering.LDA
> --
>
> Key: SPARK-16240
> URL: https://issues.apache.org/jira/browse/SPARK-16240
> Project: Spark
>  Issue Type: Bug
>Reporter: yuhao yang
>Assignee: Gayathri Murali
> Fix For: 2.1.0
>
>
> After resolving the matrix conversion issue, LDA model still cannot load 1.6 
> models as one of the parameter name is changed.
> https://github.com/apache/spark/pull/12065
> We can perhaps add some special logic in the loading code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17639) Need to add jce.jar to bootclasspath

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17639:


Assignee: (was: Apache Spark)

> Need to add jce.jar to bootclasspath
> 
>
> Key: SPARK-17639
> URL: https://issues.apache.org/jira/browse/SPARK-17639
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> The following PR is failing because jce.jar is missing from the classpath 
> when compiling the code:
> https://github.com/apache/spark/pull/15172



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17639) Need to add jce.jar to bootclasspath

2016-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514719#comment-15514719
 ] 

Apache Spark commented on SPARK-17639:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15204

> Need to add jce.jar to bootclasspath
> 
>
> Key: SPARK-17639
> URL: https://issues.apache.org/jira/browse/SPARK-17639
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> The following PR is failing because jce.jar is missing from the classpath 
> when compiling the code:
> https://github.com/apache/spark/pull/15172



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17639) Need to add jce.jar to bootclasspath

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17639:


Assignee: Apache Spark

> Need to add jce.jar to bootclasspath
> 
>
> Key: SPARK-17639
> URL: https://issues.apache.org/jira/browse/SPARK-17639
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> The following PR is failing because jce.jar is missing from the classpath 
> when compiling the code:
> https://github.com/apache/spark/pull/15172



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17639) Need to add jce.jar to bootclasspath

2016-09-22 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-17639:
--

 Summary: Need to add jce.jar to bootclasspath
 Key: SPARK-17639
 URL: https://issues.apache.org/jira/browse/SPARK-17639
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.1.0
Reporter: Marcelo Vanzin
Priority: Minor


The following PR is failing because jce.jar is missing from the classpath when 
compiling the code:

https://github.com/apache/spark/pull/15172



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17569) Don't recheck existence of files when generating File Relation resolution in StructuredStreaming

2016-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514584#comment-15514584
 ] 

Apache Spark commented on SPARK-17569:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15203

> Don't recheck existence of files when generating File Relation resolution in 
> StructuredStreaming
> 
>
> Key: SPARK-17569
> URL: https://issues.apache.org/jira/browse/SPARK-17569
> Project: Spark
>  Issue Type: Improvement
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.0
>
>
> Structured Streaming's FileSource lists files to classify files as Offsets. 
> Once this file list is committed to a metadata log for a batch, this file 
> list is turned into a "Batch FileSource" Relation which acts as the source to 
> the incremental execution.
> While this "Batch FileSource" Relation is resolved, we re-check that every 
> single file exists on the Driver. It takes a horrible amount of time, and is 
> a total waste. We can simply skip file existence during execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17638) Stop JVM StreamingContext when the Python process is dead

2016-09-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17638.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

> Stop JVM StreamingContext when the Python process is dead
> -
>
> Key: SPARK-17638
> URL: https://issues.apache.org/jira/browse/SPARK-17638
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> When the Python process is dead, the JVM StreamingContext is still running. 
> Hence we will see a lot of Py4jException before the JVM process exits. It's 
> better to stop the JVM StreamingContext to avoid those annoying logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17569) Don't recheck existence of files when generating File Relation resolution in StructuredStreaming

2016-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514512#comment-15514512
 ] 

Apache Spark commented on SPARK-17569:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15202

> Don't recheck existence of files when generating File Relation resolution in 
> StructuredStreaming
> 
>
> Key: SPARK-17569
> URL: https://issues.apache.org/jira/browse/SPARK-17569
> Project: Spark
>  Issue Type: Improvement
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.0
>
>
> Structured Streaming's FileSource lists files to classify files as Offsets. 
> Once this file list is committed to a metadata log for a batch, this file 
> list is turned into a "Batch FileSource" Relation which acts as the source to 
> the incremental execution.
> While this "Batch FileSource" Relation is resolved, we re-check that every 
> single file exists on the Driver. It takes a horrible amount of time, and is 
> a total waste. We can simply skip file existence during execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17599) Folder deletion after globbing may fail StructuredStreaming jobs

2016-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514511#comment-15514511
 ] 

Apache Spark commented on SPARK-17599:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15202

> Folder deletion after globbing may fail StructuredStreaming jobs
> 
>
> Key: SPARK-17599
> URL: https://issues.apache.org/jira/browse/SPARK-17599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.0
>
>
> The FileStreamSource used by StructuredStreaming first resolves globs, and 
> then creates a ListingFileCatalog which listFiles with the resolved glob 
> patterns. If a folder is deleted after glob resolution but before the 
> ListingFileCatalog can list the files, we can run into a 
> 'FileNotFoundException'.
> This should not be a fatal exception for a streaming job. However we should 
> include a warn message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17636) Parquet filter push down doesn't handle struct fields

2016-09-22 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-17636:
---
Priority: Minor  (was: Major)

> Parquet filter push down doesn't handle struct fields
> -
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2
>Reporter: Mitesh
>Priority: Minor
>
> Theres a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {quote} 
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {quote} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17635) Remove hardcode "agg_plan" in HashAggregateExec

2016-09-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17635:

Labels: correctness  (was: )

> Remove hardcode "agg_plan" in HashAggregateExec
> ---
>
> Key: SPARK-17635
> URL: https://issues.apache.org/jira/browse/SPARK-17635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: yucai
>  Labels: correctness
>
> "agg_plan" is hardcoded in HashAggregateExec, which has potential issue.
> {code}
> ctx.addMutableState(fastHashMapClassName, fastHashMapTerm,
>   s"$fastHashMapTerm = new $fastHashMapClassName(" +
> s"agg_plan.getTaskMemoryManager(), 
> agg_plan.getEmptyAggregationBuffer());")
> {code}
> I faced this issue when I work on sort agg's codegen support. Below codes 
> will trigger bug:
> {code}
>   private def variablePrefix: String = this match {
>  -case _: HashAggregateExec => "agg"   
>  +case _: HashAggregateExec => "hagg"
>  +case _: SortAggregateExec => "sagg"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17613) PartitioningAwareFileCatalog.allFiles doesn't handle URI specified path at parent

2016-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17613:
---
Fix Version/s: (was: 2.2.0)
   2.1.0

> PartitioningAwareFileCatalog.allFiles doesn't handle URI specified path at 
> parent
> -
>
> Key: SPARK-17613
> URL: https://issues.apache.org/jira/browse/SPARK-17613
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.0.1, 2.1.0
>
>
> Consider you have a bucket as 
> {code}
> s3a://some-bucket
> {code}
> and under it you have files:
> {code}
> s3a://some-bucket/file1.parquet
> s3a://some-bucket/file2.parquet
> {code}
> Getting the parent path of {code}s3a://some-bucket/file1.parquet{code}
> yields
> {code}s3a://some-bucket/{code}
> and the ListingFileCatalog uses this as the key in the hash map.
> When catalog.allFiles is called, we use {code}s3a://some-bucket{code} (no 
> slash at the end) to get the list of files, and we're left with an empty list!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17613) PartitioningAwareFileCatalog.allFiles doesn't handle URI specified path at parent

2016-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17613.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.0.1

Issue resolved by pull request 15169
[https://github.com/apache/spark/pull/15169]

> PartitioningAwareFileCatalog.allFiles doesn't handle URI specified path at 
> parent
> -
>
> Key: SPARK-17613
> URL: https://issues.apache.org/jira/browse/SPARK-17613
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.0.1, 2.2.0
>
>
> Consider you have a bucket as 
> {code}
> s3a://some-bucket
> {code}
> and under it you have files:
> {code}
> s3a://some-bucket/file1.parquet
> s3a://some-bucket/file2.parquet
> {code}
> Getting the parent path of {code}s3a://some-bucket/file1.parquet{code}
> yields
> {code}s3a://some-bucket/{code}
> and the ListingFileCatalog uses this as the key in the hash map.
> When catalog.allFiles is called, we use {code}s3a://some-bucket{code} (no 
> slash at the end) to get the list of files, and we're left with an empty list!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17613) PartitioningAwareFileCatalog.allFiles doesn't handle URI specified path at parent

2016-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17613:
---
Assignee: Burak Yavuz

> PartitioningAwareFileCatalog.allFiles doesn't handle URI specified path at 
> parent
> -
>
> Key: SPARK-17613
> URL: https://issues.apache.org/jira/browse/SPARK-17613
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> Consider you have a bucket as 
> {code}
> s3a://some-bucket
> {code}
> and under it you have files:
> {code}
> s3a://some-bucket/file1.parquet
> s3a://some-bucket/file2.parquet
> {code}
> Getting the parent path of {code}s3a://some-bucket/file1.parquet{code}
> yields
> {code}s3a://some-bucket/{code}
> and the ListingFileCatalog uses this as the key in the hash map.
> When catalog.allFiles is called, we use {code}s3a://some-bucket{code} (no 
> slash at the end) to get the list of files, and we're left with an empty list!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2539) ConnectionManager should handle Uncaught Exception

2016-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2539.
---
Resolution: Won't Fix

Resolving as "won't fix" since the ConnectionManager has been removed in Spark 
1.6.x and 2.x.

> ConnectionManager should handle Uncaught Exception
> --
>
> Key: SPARK-2539
> URL: https://issues.apache.org/jira/browse/SPARK-2539
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Kousuke Saruta
>
> ConnectionManager uses ThreadPool and worker threads run on spawned thread in 
> ThreadPool.
> In current implementation, some uncaught exception thrown from the thread, 
> nobody handle that.
> If the Exception is fatal, it should be handled and executor should exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16830) Executors Keep Trying to Fetch Blocks from a Bad Host

2016-09-22 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514303#comment-15514303
 ] 

Josh Rosen commented on SPARK-16830:


Do you have stacktraces from the failed block fetches? I'd like to see whether 
this may be fixed by a recent patch of mine which helps to avoid failures if 
all locations of non-shuffle blocks are lost / unavailable.

> Executors Keep Trying to Fetch Blocks from a Bad Host
> -
>
> Key: SPARK-16830
> URL: https://issues.apache.org/jira/browse/SPARK-16830
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.2
> Environment: EMR 4.7.2
>Reporter: Renxia Wang
>
> When a host became unreachable, driver removes the executors and block 
> managers on that hosts because it doesn't receive heartbeats. However, 
> executors on other hosts still keep trying to fetch blocks from the bad 
> hosts. 
> I am running a Spark Streaming job to consume data from Kinesis. As a result 
> of this block fetch retrying and failing, I started seeing 
> ProvisionedThroughputExceededException on shards, AmazonHttpClient (to 
> Kinesis) SocketException, Kinesis ExpiredIteratorException etc. 
> This issue also expose a potential memory leak. Starting from the time that 
> the bad host became unreachable, the physical memory usages of executors that 
> keep trying to fetch block from the bad host started increasing and finally 
> hit the physical memory limit and killed by YARN. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17638) Stop JVM StreamingContext when the Python process is dead

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17638:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Stop JVM StreamingContext when the Python process is dead
> -
>
> Key: SPARK-17638
> URL: https://issues.apache.org/jira/browse/SPARK-17638
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> When the Python process is dead, the JVM StreamingContext is still running. 
> Hence we will see a lot of Py4jException before the JVM process exits. It's 
> better to stop the JVM StreamingContext to avoid those annoying logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17638) Stop JVM StreamingContext when the Python process is dead

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17638:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Stop JVM StreamingContext when the Python process is dead
> -
>
> Key: SPARK-17638
> URL: https://issues.apache.org/jira/browse/SPARK-17638
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> When the Python process is dead, the JVM StreamingContext is still running. 
> Hence we will see a lot of Py4jException before the JVM process exits. It's 
> better to stop the JVM StreamingContext to avoid those annoying logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17638) Stop JVM StreamingContext when the Python process is dead

2016-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514280#comment-15514280
 ] 

Apache Spark commented on SPARK-17638:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15201

> Stop JVM StreamingContext when the Python process is dead
> -
>
> Key: SPARK-17638
> URL: https://issues.apache.org/jira/browse/SPARK-17638
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> When the Python process is dead, the JVM StreamingContext is still running. 
> Hence we will see a lot of Py4jException before the JVM process exits. It's 
> better to stop the JVM StreamingContext to avoid those annoying logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17638) Stop JVM StreamingContext when the Python process is dead

2016-09-22 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-17638:


 Summary: Stop JVM StreamingContext when the Python process is dead
 Key: SPARK-17638
 URL: https://issues.apache.org/jira/browse/SPARK-17638
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor


When the Python process is dead, the JVM StreamingContext is still running. 
Hence we will see a lot of Py4jException before the JVM process exits. It's 
better to stop the JVM StreamingContext to avoid those annoying logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2016-09-22 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514246#comment-15514246
 ] 

Suresh Thalamati commented on SPARK-14536:
--

Yes. SPARK-10186 already added array support for postgres ,  the PR 
(https://github.com/apache/spark/pull/15192) i submitted in this Jira will 
address NPE issue for null values.

SPARK-8500 from tiltle (Support for array types in JDBCRDD) ,  sounds more 
generic than specific to postgres , although repo given is for postgres.  

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>  Labels: NullPointerException
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8790) BlockManager.reregister cause OOM

2016-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-8790.
---
Resolution: Cannot Reproduce

I'm going to resolve this issue as "Cannot Reproduce" since it's over a year 
old and appears to only have been reported for Spark 1.2.x. If this is still an 
issue in newer Spark versions then please file a new JIRA ticket. Thanks!

> BlockManager.reregister cause OOM
> -
>
> Key: SPARK-8790
> URL: https://issues.apache.org/jira/browse/SPARK-8790
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Reporter: Patrick Liu
> Attachments: driver.log, executor.log, webui-executor.png, 
> webui-slow-task.png
>
>
> We run SparkSQL 1.2.1 on Yarn.
> A SQL consists of 100 tasks, most them finish in < 10s, but only 1 lasts for 
> 16m.
> The webUI shows that the executor has running GC for 15m brfore OOM.
> The log shows that the executor first try to connect to master to report 
> broadcast value, however the network is not available, so the executor lost 
> heartbeat to Master. 
> Then the master require the executor to reregister. When executor are 
> reporAllBlocks to master, the network is still not so stable, sometimes 
> time-out.
> Finally, the executor OOM.
> Please take a look.
> Attached is the detailed log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2016-09-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514134#comment-15514134
 ] 

Sean Owen commented on SPARK-14536:
---

Yes, sounds like we're both saying this is tied up in the resolution to 
SPARK-8500, as it pertains to array types in Postgres. Is there a separate 
resolution / change you could imagine for the two?

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>  Labels: NullPointerException
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2016-09-22 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514129#comment-15514129
 ] 

Suresh Thalamati commented on SPARK-14536:
--

[~sowen]  I am not sure why this issue got closed  as duplicate after I 
reopened. Based on the test case I tried on the master ,  this issue does not 
looks like a duplicate to me; as I mentioned in my previous comment when i 
reopened the issue.  array data type is supported for postgres.   

Repro :
On postgresdb :
create table spark_array(a int , b text[])
insert into spark_array values(1 , null)
insert into spark_array values(1 , '{"AA", "BB"}')

val psqlProps = new java.util.Properties()
psqlProps.setProperty("user" , "user")
psqlProps.setProperty("password" , "password")

-- works fine
spark.read.jdbc("jdbc:postgresql://localhost:5432/pdb", "(select * from 
spark_array where b is not null) as a ", psqlProps).show() 

-- fails with error.
spark.read.jdbc("jdbc:postgresql://localhost:5432/pdb", "spark_array", 
psqlProps).show()   fails with following error:

Stack :
16/09/21 11:49:41 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
localhost): java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:442)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:440)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:301)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:283)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)


> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>  Labels: NullPointerException
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17453) Broadcast block already exists in MemoryStore

2016-09-22 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514123#comment-15514123
 ] 

Josh Rosen commented on SPARK-17453:


Actually, upon closer inspection I do not believe that this is due to a race 
because all TorrentBroadcast fetches are synchronized on the TorrentBroadcast 
object (so only one TorrentBroadcast instance can be in the middle of a read at 
any given moment).

Instead, I think that this might be a symptom of error-cleanup logic from an 
earlier block manager put failure, which is why I think we'll need more logs to 
debug.

Once we release 2.0.1, it would be great to confirm whether this issue has been 
fixed. Is this deterministically reproducible?

> Broadcast block already exists in MemoryStore
> -
>
> Key: SPARK-17453
> URL: https://issues.apache.org/jira/browse/SPARK-17453
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0
>Reporter: Chris Bannister
>
> Whilst doing a broadcast join we reliably hit this exception, the code worked 
> earlier on in 2.0.0 branch before release, and in 1.6. The data for the join 
> is coming from another RDD which is collected to a Set and then broadcast. 
> This is run in a Mesos cluster.
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.io.IOException: java.lang.IllegalArgumentException: 
> requirement failed: Block broadcast_17_piece0 is already present in the 
> MemoryStore
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1260)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
> at 
> 

[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-09-22 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514105#comment-15514105
 ] 

Zhan Zhang commented on SPARK-17637:


The plan is to introduce a new configuration so that different scheduling 
algorithms can be used for the task scheduling.

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Zhan Zhang
>Priority: Minor
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-09-22 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-17637:
--

 Summary: Packed scheduling for Spark tasks across executors
 Key: SPARK-17637
 URL: https://issues.apache.org/jira/browse/SPARK-17637
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Zhan Zhang
Priority: Minor


Currently Spark scheduler implements round robin scheduling for tasks to 
executors. Which is great as it distributes the load evenly across the cluster, 
but this leads to significant resource waste in some cases, especially when 
dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17453) Broadcast block already exists in MemoryStore

2016-09-22 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514091#comment-15514091
 ] 

Josh Rosen commented on SPARK-17453:


There might be a race-condition here which could occur if multiple tasks 
running on the same executor try to fetch the same broadcast variable piece. Do 
you happen to have any executor logs from the executor with the {{ Block 
broadcast_17_piece0 is already present in the MemoryStore}} error?

> Broadcast block already exists in MemoryStore
> -
>
> Key: SPARK-17453
> URL: https://issues.apache.org/jira/browse/SPARK-17453
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0
>Reporter: Chris Bannister
>
> Whilst doing a broadcast join we reliably hit this exception, the code worked 
> earlier on in 2.0.0 branch before release, and in 1.6. The data for the join 
> is coming from another RDD which is collected to a Set and then broadcast. 
> This is run in a Mesos cluster.
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.io.IOException: java.lang.IllegalArgumentException: 
> requirement failed: Block broadcast_17_piece0 is already present in the 
> MemoryStore
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1260)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:72)
> at 
> 

[jira] [Commented] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)

2016-09-22 Thread Aris Vlasakakis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514085#comment-15514085
 ] 

Aris Vlasakakis commented on SPARK-17131:
-

Hi there,

I discovered a bug, and it also pertains to code generation with many columns 
-- although in my case the bugs within Janino code generation in Catalyst  
start after several hundred columns. Are these somehow related?

My bug report was merged into this one: 
[https://issues.apache.org/jira/browse/SPARK-16845]

> Code generation fails when running SQL expressions against a wide dataset 
> (thousands of columns)
> 
>
> Key: SPARK-17131
> URL: https://issues.apache.org/jira/browse/SPARK-17131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Iaroslav Zeigerman
>
> When reading the CSV file that contains 1776 columns Spark and Janino fail to 
> generate the code with message:
> {noformat}
> Constant pool has grown past JVM limit of 0x
> {noformat}
> When running a common select with all columns it's fine:
> {code}
>   val allCols = df.columns.map(c => col(c).as(c + "_alias"))
>   val newDf = df.select(allCols: _*)
>   newDf.show()
> {code}
> But when I invoke the describe method:
> {code}
> newDf.describe(allCols: _*)
> {code}
> it fails with the following stack trace:
> {noformat}
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   ... 30 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has 
> grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300)
>   at 
> org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307)
>   at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346)
>   at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265)
>   at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at org.codehaus.janino.UnitCompiler.fakeCompile(UnitCompiler.java:2605)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4362)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3975)
>   at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2662)
>   at org.codehaus.janino.UnitCompiler.access$4400(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$7.visitMethodInvocation(UnitCompiler.java:2627)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643)
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14209) Application failure during preemption.

2016-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-14209.

   Resolution: Fixed
 Assignee: Josh Rosen
Fix Version/s: 2.1.0
   2.0.1
   1.6.3

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
>Assignee: Josh Rosen
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14209) Application failure during preemption.

2016-09-22 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514054#comment-15514054
 ] 

Josh Rosen edited comment on SPARK-14209 at 9/22/16 6:17 PM:
-

I have backported SPARK-17485 to Spark 1.6.x (for inclusion in Spark 1.6.3), so 
I believe that this issue should be fixed and therefore I'm going to resolve it.

I believe that this ticket may actually be discussing multiple issues that are 
related to fetch failures following executor lost but which have different 
underlying causes and fixes. I think the original issue reported in this JIRA 
relates to BlockFetchException, an error which occurs due to failed fetches of 
NON-shuffle blocks (such as broadcasts or cached RDD blocks).

If you're a user and are still experiencing Spark application failures due to 
executor pre-emption then please file a new JIRA ticket and make sure to 
include the Spark version and the portion of the driver log which contains the 
job failure log message (since that will show which exception / stack trace 
ultimately triggered the failure, allowing us to distinguish shuffle block 
fetch failures vs. other types of fetch failures). 


was (Author: joshrosen):
I have backported SPARK-17485 to Spark 1.6.x (for inclusion in Spark 1.6.3), so 
I believe that this issue should be fixed and therefore I'm going to resolve it.

I believe that this ticket may actually be discussing multiple issues that are 
related to fetch failures following executor lost but which have different 
underlying causes and fixes. Marcelo has pointed out several patches which 
affect fetching of shuffle blocks, whereas I think the original issue reported 
in this JIRA relates to BlockFetchException, an error which occurs due to 
failed fetches of NON-shuffle blocks (such as broadcasts or cached RDD blocks).

If you're a user and are still experiencing Spark application failures due to 
executor pre-emption then please file a new JIRA ticket and make sure to 
include the Spark version and the portion of the driver log which contains the 
job failure log message (since that will show which exception / stack trace 
ultimately triggered the failure, allowing us to distinguish shuffle block 
fetch failures vs. other types of fetch failures). 

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14209) Application failure during preemption.

2016-09-22 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514054#comment-15514054
 ] 

Josh Rosen commented on SPARK-14209:


I have backported SPARK-17485 to Spark 1.6.x (for inclusion in Spark 1.6.3), so 
I believe that this issue should be fixed and therefore I'm going to resolve it.

I believe that this ticket may actually be discussing multiple issues that are 
related to fetch failures following executor lost but which have different 
underlying causes and fixes. Marcelo has pointed out several patches which 
affect fetching of shuffle blocks, whereas I think the original issue reported 
in this JIRA relates to BlockFetchException, an error which occurs due to 
failed fetches of NON-shuffle blocks (such as broadcasts or cached RDD blocks).

If you're a user and are still experiencing Spark application failures due to 
executor pre-emption then please file a new JIRA ticket and make sure to 
include the Spark version and the portion of the driver log which contains the 
job failure log message (since that will show which exception / stack trace 
ultimately triggered the failure, allowing us to distinguish shuffle block 
fetch failures vs. other types of fetch failures). 

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17485) Failed remote cached block reads can lead to whole job failure

2016-09-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17485.

   Resolution: Fixed
Fix Version/s: 1.6.3

Issue resolved by pull request 15186
[https://github.com/apache/spark/pull/15186]

> Failed remote cached block reads can lead to whole job failure
> --
>
> Key: SPARK-17485
> URL: https://issues.apache.org/jira/browse/SPARK-17485
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> In Spark's RDD.getOrCompute we first try to read a local copy of a cached 
> block, then a remote copy, and only fall back to recomputing the block if no 
> cached copy (local or remote) can be read. This logic works correctly in the 
> case where no remote copies of the block exist, but if there _are_ remote 
> copies but reads of those copies fail (due to network issues or internal 
> Spark bugs) then the BlockManager will throw a {{BlockFetchException}} error 
> that fails the entire job.
> In the case of torrent broadcast we really _do_ want to fail the entire job 
> in case no remote blocks can be fetched, but this logic is inappropriate for 
> cached blocks because those can/should be recomputed.
> Therefore, I think that this exception should be thrown higher up the call 
> stack by the BlockManager client code and not the block manager itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17601) SparkSQL vectorization cannot handle schema evolution for parquet tables when parquet files use Int whereas DataFrame uses Long

2016-09-22 Thread Gang Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514019#comment-15514019
 ] 

Gang Wu commented on SPARK-17601:
-

[~hyukjin.kwon] Yes I agree. I just created these JIRAs for issues we met in 
production. I think there definitely can be more issues for ORC, Parquet, etc. 
Schema evolution is always painful to tackle with. Seems that you are working 
on this. Do you mind telling a little bit more about what's your plan there? 
I'd like to know. Thanks!

> SparkSQL vectorization cannot handle schema evolution for parquet tables when 
> parquet files use Int whereas DataFrame uses Long
> ---
>
> Key: SPARK-17601
> URL: https://issues.apache.org/jira/browse/SPARK-17601
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> This is a JIRA related to SPARK-17477.
> When using SparkSession to read a Hive table which is stored as parquet 
> files. If there has been a schema evolution from int to long of a column. 
> There are some old parquet files use int for the column while some new 
> parquet files use long. In Hive metastore, the type is long (bigint). If we 
> use vectorization in SparkSQL then we will get following exception:
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:526)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:486)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:495)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:272)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   

[jira] [Resolved] (SPARK-17365) Kill multiple executors together to reduce lock contention

2016-09-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17365.

   Resolution: Fixed
 Assignee: Dhruve Ashar
Fix Version/s: 2.1.0

> Kill multiple executors together to reduce lock contention
> --
>
> Key: SPARK-17365
> URL: https://issues.apache.org/jira/browse/SPARK-17365
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Dhruve Ashar
>Assignee: Dhruve Ashar
> Fix For: 2.1.0
>
>
> To regulate pending and running executors we determine the executors which 
> are eligible to kill and kill them iteratively rather than a loop. This does 
> an RPC call and is synchronized leading to lock contention for 
> SparkListenerBus. 
> Side effect - listener bus is blocked while we iteratively remove executors.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17636) Parquet filter push down doesn't handle struct fields

2016-09-22 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-17636:
---
Description: 
Theres a `PushedFilters` for a simple numeric field, but not for a numeric 
field inside a struct. Not sure if this is a Spark limitation because of 
Parquet, or only a Spark limitation.

{quote} 
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")

res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan

res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan

res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file
{quote} 


  was:
The filter gets pushed down for a simple numeric field, but not for a numeric 
field inside a struct. Not sure if this is a Spark limitation because of 
Parquet, or only a Spark limitation.

{quote} 
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")

res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan

res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan

res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file
{quote} 



> Parquet filter push down doesn't handle struct fields
> -
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2
>Reporter: Mitesh
>
> Theres a `PushedFilters` for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {quote} 
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {quote} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17636) Parquet filter push down doesn't handle struct fields

2016-09-22 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-17636:
---
Description: 
Theres a *PushedFilters* for a simple numeric field, but not for a numeric 
field inside a struct. Not sure if this is a Spark limitation because of 
Parquet, or only a Spark limitation.

{quote} 
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")

res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan

res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan

res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file
{quote} 


  was:
Theres a `PushedFilters` for a simple numeric field, but not for a numeric 
field inside a struct. Not sure if this is a Spark limitation because of 
Parquet, or only a Spark limitation.

{quote} 
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")

res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan

res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan

res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file
{quote} 



> Parquet filter push down doesn't handle struct fields
> -
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2
>Reporter: Mitesh
>
> Theres a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {quote} 
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {quote} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17636) Parquet filter push down doesn't handle struct fields

2016-09-22 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-17636:
---
Description: 
The filter gets pushed down for a simple numeric field, but not for a numeric 
field inside a struct. Not sure if this is a Spark limitation because of 
Parquet, or only a Spark limitation.

{quote} 
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")

res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan

res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan

res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3a://some/parquet/file
{quote} 


  was:
The filter gets pushed down for a simple numeric field, but not for a numeric 
field inside a struct. Not sure if this is a Spark limitation because of 
Parquet, or only a Spark limitation.

{quote} 
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")

res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan

res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933,
 PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan

res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933
{quote} 



> Parquet filter push down doesn't handle struct fields
> -
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2
>Reporter: Mitesh
>
> The filter gets pushed down for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {quote} 
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {quote} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17634) Spark job hangs when using dapply

2016-09-22 Thread Thomas Powell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513673#comment-15513673
 ] 

Thomas Powell edited comment on SPARK-17634 at 9/22/16 4:35 PM:


I also ran this on a row limited version of this dataset. The first 22 tasks 
completed immediately (since there was no data to process). The final task 
recorded 13MB of shuffle read, (out of ~300MB of shuffle write in the previous 
stage), at which point it then stalled.


was (Author: iamthomaspowell):
I also ran this on a row limited version of this dataset. The first 22 tasks 
completed immediately (since there was no data to process). The final task 
recorded 32MB of shuffle read, (out of ~300MB of shuffle write in the previous 
stage), at which point it then stalled.

> Spark job hangs when using dapply
> -
>
> Key: SPARK-17634
> URL: https://issues.apache.org/jira/browse/SPARK-17634
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Thomas Powell
>Priority: Critical
>
> I'm running into an issue when using dapply on yarn. I have a data frame 
> backed by files in parquet with around 200 files that is around 2GB. When I 
> load this in with the new partition coalescing it ends up having around 20 
> partitions so each one roughly 100MB. The data frame itself has 4 columns of 
> integers and doubles. If I run a count over this things work fine.
> However, if I add a {{dapply}} in between the read and the {{count}} that 
> just uses an identity function the tasks hang and make no progress. Both the 
> R and Java processes are running on the Spark nodes and are listening on the 
> {{SPARKR_WORKER_PORT}}.
> {{result <- dapply(df, function(x){x}, SparkR::schema(df))}}
> I took a jstack of the Java process and see that it is just listening on the 
> socket but never seems to make any progress. The R process is harder to debug 
> what it is doing.
> {code}
> Thread 112823: (state = IN_NATIVE)
>  - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], 
> int, int, int) @bci=0 (Interpreted frame)
>  - java.net.SocketInputStream.socketRead(java.io.FileDescriptor, byte[], int, 
> int, int) @bci=8, line=116 (Interpreted frame)
>  - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=170 
> (Interpreted frame)
>  - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=141 
> (Interpreted frame)
>  - java.io.BufferedInputStream.fill() @bci=214, line=246 (Interpreted frame)
>  - java.io.BufferedInputStream.read() @bci=12, line=265 (Compiled frame)
>  - java.io.DataInputStream.readInt() @bci=4, line=387 (Compiled frame)
>  - org.apache.spark.api.r.RRunner.org$apache$spark$api$r$RRunner$$read() 
> @bci=4, line=212 (Interpreted frame)
>  - 
> org.apache.spark.api.r.RRunner$$anon$1.(org.apache.spark.api.r.RRunner) 
> @bci=25, line=96 (Interpreted frame)
>  - org.apache.spark.api.r.RRunner.compute(scala.collection.Iterator, int) 
> @bci=109, line=87 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(scala.collection.Iterator)
>  @bci=322, line=59 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(java.lang.Object)
>  @bci=5, line=29 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(scala.collection.Iterator)
>  @bci=59, line=178 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(java.lang.Object)
>  @bci=5, line=175 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(org.apache.spark.TaskContext,
>  int, scala.collection.Iterator) @bci=8, line=784 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(java.lang.Object,
>  java.lang.Object, java.lang.Object) @bci=13, line=784 (Interpreted frame)
>  - org.apache.spark.rdd.MapPartitionsRDD.compute(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=27, line=38 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=26, line=319 (Interpreted frame)
>  - org.apache.spark.rdd.RDD.iterator(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=33, line=283 (Interpreted frame)
>  - org.apache.spark.rdd.MapPartitionsRDD.compute(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=24, line=38 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=26, line=319 (Interpreted frame)
>  - org.apache.spark.rdd.RDD.iterator(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=33, line=283 (Interpreted 

[jira] [Updated] (SPARK-17636) Parquet filter push down doesn't handle struct fields

2016-09-22 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-17636:
---
Description: 
The filter gets pushed down for a simple numeric field, but not for a numeric 
field inside a struct. Not sure if this is a Spark limitation because of 
Parquet, or only a Spark limitation.

{quote} 
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")

res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan

res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933,
 PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan

res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933
{quote} 


  was:
The filter gets pushed down for a simple numeric field, but not for a numeric 
field inside a struct. Not sure if this is a Spark limitation because of 
Parquet, or only a Spark limitation.

{quote} 
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")
res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan
res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933,
 PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933
{quote} 



> Parquet filter push down doesn't handle struct fields
> -
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2
>Reporter: Mitesh
>
> The filter gets pushed down for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {quote} 
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933,
>  PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933
> {quote} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17636) Parquet filter push down doesn't handle struct fields

2016-09-22 Thread Mitesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-17636:
---
Description: 
The filter gets pushed down for a simple numeric field, but not for a numeric 
field inside a struct. Not sure if this is a Spark limitation because of 
Parquet, or only a Spark limitation.

{quote} 
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")
res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan
res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933,
 PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933
{quote} 


  was:
scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")
res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan
res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933,
 PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933



> Parquet filter push down doesn't handle struct fields
> -
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2
>Reporter: Mitesh
>
> The filter gets pushed down for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {quote} 
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933,
>  PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933
> {quote} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17636) Parquet filter push down doesn't handle struct fields

2016-09-22 Thread Mitesh (JIRA)
Mitesh created SPARK-17636:
--

 Summary: Parquet filter push down doesn't handle struct fields
 Key: SPARK-17636
 URL: https://issues.apache.org/jira/browse/SPARK-17636
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.6.2
Reporter: Mitesh


scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
"sale_id")
res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
struct, sale_id: bigint]

scala> res5.filter("sale_id > 4").queryExecution.executedPlan
res9: org.apache.spark.sql.execution.SparkPlan =
Filter[23814] [args=(sale_id#86324L > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933,
 PushedFilters: [GreaterThan(sale_id,4)]

scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
res10: org.apache.spark.sql.execution.SparkPlan =
Filter[23815] [args=(day_timestamp#86302.timestamp > 
4)][outPart=UnknownPartitioning(0)][outOrder=List()]
+- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
s3n://aiq.data.pipeline.prod-restart/gilt/fact_clickstream_page_views/annotated/20150917_123800634_zzz_937571e7_0161_48e4_8c8c_7ce5600ae933




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17634) Spark job hangs when using dapply

2016-09-22 Thread Thomas Powell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513673#comment-15513673
 ] 

Thomas Powell commented on SPARK-17634:
---

I also ran this on a row limited version of this dataset. The first 22 tasks 
completed immediately (since there was no data to process). The final task 
recorded 32MB of shuffle read, (out of ~300MB of shuffle write in the previous 
stage), at which point it then stalled.

> Spark job hangs when using dapply
> -
>
> Key: SPARK-17634
> URL: https://issues.apache.org/jira/browse/SPARK-17634
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Thomas Powell
>Priority: Critical
>
> I'm running into an issue when using dapply on yarn. I have a data frame 
> backed by files in parquet with around 200 files that is around 2GB. When I 
> load this in with the new partition coalescing it ends up having around 20 
> partitions so each one roughly 100MB. The data frame itself has 4 columns of 
> integers and doubles. If I run a count over this things work fine.
> However, if I add a {{dapply}} in between the read and the {{count}} that 
> just uses an identity function the tasks hang and make no progress. Both the 
> R and Java processes are running on the Spark nodes and are listening on the 
> {{SPARKR_WORKER_PORT}}.
> {{result <- dapply(df, function(x){x}, SparkR::schema(df))}}
> I took a jstack of the Java process and see that it is just listening on the 
> socket but never seems to make any progress. The R process is harder to debug 
> what it is doing.
> {code}
> Thread 112823: (state = IN_NATIVE)
>  - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], 
> int, int, int) @bci=0 (Interpreted frame)
>  - java.net.SocketInputStream.socketRead(java.io.FileDescriptor, byte[], int, 
> int, int) @bci=8, line=116 (Interpreted frame)
>  - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=170 
> (Interpreted frame)
>  - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=141 
> (Interpreted frame)
>  - java.io.BufferedInputStream.fill() @bci=214, line=246 (Interpreted frame)
>  - java.io.BufferedInputStream.read() @bci=12, line=265 (Compiled frame)
>  - java.io.DataInputStream.readInt() @bci=4, line=387 (Compiled frame)
>  - org.apache.spark.api.r.RRunner.org$apache$spark$api$r$RRunner$$read() 
> @bci=4, line=212 (Interpreted frame)
>  - 
> org.apache.spark.api.r.RRunner$$anon$1.(org.apache.spark.api.r.RRunner) 
> @bci=25, line=96 (Interpreted frame)
>  - org.apache.spark.api.r.RRunner.compute(scala.collection.Iterator, int) 
> @bci=109, line=87 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(scala.collection.Iterator)
>  @bci=322, line=59 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(java.lang.Object)
>  @bci=5, line=29 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(scala.collection.Iterator)
>  @bci=59, line=178 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(java.lang.Object)
>  @bci=5, line=175 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(org.apache.spark.TaskContext,
>  int, scala.collection.Iterator) @bci=8, line=784 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(java.lang.Object,
>  java.lang.Object, java.lang.Object) @bci=13, line=784 (Interpreted frame)
>  - org.apache.spark.rdd.MapPartitionsRDD.compute(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=27, line=38 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=26, line=319 (Interpreted frame)
>  - org.apache.spark.rdd.RDD.iterator(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=33, line=283 (Interpreted frame)
>  - org.apache.spark.rdd.MapPartitionsRDD.compute(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=24, line=38 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=26, line=319 (Interpreted frame)
>  - org.apache.spark.rdd.RDD.iterator(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=33, line=283 (Interpreted frame)
>  - org.apache.spark.rdd.MapPartitionsRDD.compute(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=24, line=38 (Interpreted frame)
>  - 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(org.apache.spark.Partition, 
> org.apache.spark.TaskContext) @bci=26, line=319 (Interpreted frame)
>  - 

[jira] [Closed] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0

2016-09-22 Thread renzhi he (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renzhi he closed SPARK-17622.
-
Resolution: Not A Bug

> Cannot run create or load DF on Windows- Spark 2.0.0
> 
>
> Key: SPARK-17622
> URL: https://issues.apache.org/jira/browse/SPARK-17622
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
> Environment: windows 10
> R 3.3.1
> RStudio 1.0.20
>Reporter: renzhi he
>  Labels: windows
>
> Under spark2.0.0- on Windows- when try to load or create data with the 
> similar codes below, I also get error message and cannot execute the 
> functions.
> |sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = 
> "2g")) |
> |df <- as.DataFrame(faithful) |
> Here is the error message:
> #Error in invokeJava(isStatic = TRUE, className, methodName, ...) :   
>  
> #java.lang.reflect.InvocationTargetException
> #at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> #at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> #at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> #at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> #at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> #at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> #at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
> #at org.apache.spark.sql.hive.HiveSharedSt
> However, under spark1.6.1 or spark1.6.2, run the same functional functions, 
> there will be no problem.
> |sc1 <- sparkR.init(master = "local", sparkEnvir = 
> list(spark.driver.memory="2g"))|
> |sqlContext <- sparkRSQL.init(sc1)|
> |df <- as.DataFrame(sqlContext,faithful|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0

2016-09-22 Thread renzhi he (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513639#comment-15513639
 ] 

renzhi he commented on SPARK-17622:
---

Yes, I did not build Hive or other Hadoop-related modules.
Still learning these whole things.
I think this issue might be closed ;)

> Cannot run create or load DF on Windows- Spark 2.0.0
> 
>
> Key: SPARK-17622
> URL: https://issues.apache.org/jira/browse/SPARK-17622
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
> Environment: windows 10
> R 3.3.1
> RStudio 1.0.20
>Reporter: renzhi he
>  Labels: windows
>
> Under spark2.0.0- on Windows- when try to load or create data with the 
> similar codes below, I also get error message and cannot execute the 
> functions.
> |sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = 
> "2g")) |
> |df <- as.DataFrame(faithful) |
> Here is the error message:
> #Error in invokeJava(isStatic = TRUE, className, methodName, ...) :   
>  
> #java.lang.reflect.InvocationTargetException
> #at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> #at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> #at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> #at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> #at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> #at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> #at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
> #at org.apache.spark.sql.hive.HiveSharedSt
> However, under spark1.6.1 or spark1.6.2, run the same functional functions, 
> there will be no problem.
> |sc1 <- sparkR.init(master = "local", sparkEnvir = 
> list(spark.driver.memory="2g"))|
> |sqlContext <- sparkRSQL.init(sc1)|
> |df <- as.DataFrame(sqlContext,faithful|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17635) Remove hardcode "agg_plan" in HashAggregateExec

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17635:


Assignee: Apache Spark

> Remove hardcode "agg_plan" in HashAggregateExec
> ---
>
> Key: SPARK-17635
> URL: https://issues.apache.org/jira/browse/SPARK-17635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: yucai
>Assignee: Apache Spark
>
> "agg_plan" is hardcoded in HashAggregateExec, which has potential issue.
> {code}
> ctx.addMutableState(fastHashMapClassName, fastHashMapTerm,
>   s"$fastHashMapTerm = new $fastHashMapClassName(" +
> s"agg_plan.getTaskMemoryManager(), 
> agg_plan.getEmptyAggregationBuffer());")
> {code}
> I faced this issue when I work on sort agg's codegen support. Below codes 
> will trigger bug:
> {code}
>   private def variablePrefix: String = this match {
>  -case _: HashAggregateExec => "agg"   
>  +case _: HashAggregateExec => "hagg"
>  +case _: SortAggregateExec => "sagg"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17635) Remove hardcode "agg_plan" in HashAggregateExec

2016-09-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17635:


Assignee: (was: Apache Spark)

> Remove hardcode "agg_plan" in HashAggregateExec
> ---
>
> Key: SPARK-17635
> URL: https://issues.apache.org/jira/browse/SPARK-17635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: yucai
>
> "agg_plan" is hardcoded in HashAggregateExec, which has potential issue.
> {code}
> ctx.addMutableState(fastHashMapClassName, fastHashMapTerm,
>   s"$fastHashMapTerm = new $fastHashMapClassName(" +
> s"agg_plan.getTaskMemoryManager(), 
> agg_plan.getEmptyAggregationBuffer());")
> {code}
> I faced this issue when I work on sort agg's codegen support. Below codes 
> will trigger bug:
> {code}
>   private def variablePrefix: String = this match {
>  -case _: HashAggregateExec => "agg"   
>  +case _: HashAggregateExec => "hagg"
>  +case _: SortAggregateExec => "sagg"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17635) Remove hardcode "agg_plan" in HashAggregateExec

2016-09-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513504#comment-15513504
 ] 

Apache Spark commented on SPARK-17635:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/15199

> Remove hardcode "agg_plan" in HashAggregateExec
> ---
>
> Key: SPARK-17635
> URL: https://issues.apache.org/jira/browse/SPARK-17635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: yucai
>
> "agg_plan" is hardcoded in HashAggregateExec, which has potential issue.
> {code}
> ctx.addMutableState(fastHashMapClassName, fastHashMapTerm,
>   s"$fastHashMapTerm = new $fastHashMapClassName(" +
> s"agg_plan.getTaskMemoryManager(), 
> agg_plan.getEmptyAggregationBuffer());")
> {code}
> I faced this issue when I work on sort agg's codegen support. Below codes 
> will trigger bug:
> {code}
>   private def variablePrefix: String = this match {
>  -case _: HashAggregateExec => "agg"   
>  +case _: HashAggregateExec => "hagg"
>  +case _: SortAggregateExec => "sagg"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17635) Remove hardcode "agg_plan" in HashAggregateExec

2016-09-22 Thread yucai (JIRA)
yucai created SPARK-17635:
-

 Summary: Remove hardcode "agg_plan" in HashAggregateExec
 Key: SPARK-17635
 URL: https://issues.apache.org/jira/browse/SPARK-17635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: yucai


"agg_plan" is hardcoded in HashAggregateExec, which has potential issue.
{code}
ctx.addMutableState(fastHashMapClassName, fastHashMapTerm,
  s"$fastHashMapTerm = new $fastHashMapClassName(" +
s"agg_plan.getTaskMemoryManager(), 
agg_plan.getEmptyAggregationBuffer());")
{code}

I faced this issue when I work on sort agg's codegen support. Below codes will 
trigger bug:
{code}
  private def variablePrefix: String = this match {
 -case _: HashAggregateExec => "agg" 
 +case _: HashAggregateExec => "hagg"
 +case _: SortAggregateExec => "sagg"
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17634) Spark job hangs when using dapply

2016-09-22 Thread Thomas Powell (JIRA)
Thomas Powell created SPARK-17634:
-

 Summary: Spark job hangs when using dapply
 Key: SPARK-17634
 URL: https://issues.apache.org/jira/browse/SPARK-17634
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.0
Reporter: Thomas Powell
Priority: Critical


I'm running into an issue when using dapply on yarn. I have a data frame backed 
by files in parquet with around 200 files that is around 2GB. When I load this 
in with the new partition coalescing it ends up having around 20 partitions so 
each one roughly 100MB. The data frame itself has 4 columns of integers and 
doubles. If I run a count over this things work fine.

However, if I add a {{dapply}} in between the read and the {{count}} that just 
uses an identity function the tasks hang and make no progress. Both the R and 
Java processes are running on the Spark nodes and are listening on the 
{{SPARKR_WORKER_PORT}}.

{{result <- dapply(df, function(x){x}, SparkR::schema(df))}}

I took a jstack of the Java process and see that it is just listening on the 
socket but never seems to make any progress. The R process is harder to debug 
what it is doing.
{code}
Thread 112823: (state = IN_NATIVE)
 - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, 
int, int) @bci=0 (Interpreted frame)
 - java.net.SocketInputStream.socketRead(java.io.FileDescriptor, byte[], int, 
int, int) @bci=8, line=116 (Interpreted frame)
 - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=170 
(Interpreted frame)
 - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=141 
(Interpreted frame)
 - java.io.BufferedInputStream.fill() @bci=214, line=246 (Interpreted frame)
 - java.io.BufferedInputStream.read() @bci=12, line=265 (Compiled frame)
 - java.io.DataInputStream.readInt() @bci=4, line=387 (Compiled frame)
 - org.apache.spark.api.r.RRunner.org$apache$spark$api$r$RRunner$$read() 
@bci=4, line=212 (Interpreted frame)
 - 
org.apache.spark.api.r.RRunner$$anon$1.(org.apache.spark.api.r.RRunner) 
@bci=25, line=96 (Interpreted frame)
 - org.apache.spark.api.r.RRunner.compute(scala.collection.Iterator, int) 
@bci=109, line=87 (Interpreted frame)
 - 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(scala.collection.Iterator)
 @bci=322, line=59 (Interpreted frame)
 - 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(java.lang.Object) 
@bci=5, line=29 (Interpreted frame)
 - 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(scala.collection.Iterator)
 @bci=59, line=178 (Interpreted frame)
 - 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(java.lang.Object)
 @bci=5, line=175 (Interpreted frame)
 - 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(org.apache.spark.TaskContext,
 int, scala.collection.Iterator) @bci=8, line=784 (Interpreted frame)
 - 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(java.lang.Object,
 java.lang.Object, java.lang.Object) @bci=13, line=784 (Interpreted frame)
 - org.apache.spark.rdd.MapPartitionsRDD.compute(org.apache.spark.Partition, 
org.apache.spark.TaskContext) @bci=27, line=38 (Interpreted frame)
 - org.apache.spark.rdd.RDD.computeOrReadCheckpoint(org.apache.spark.Partition, 
org.apache.spark.TaskContext) @bci=26, line=319 (Interpreted frame)
 - org.apache.spark.rdd.RDD.iterator(org.apache.spark.Partition, 
org.apache.spark.TaskContext) @bci=33, line=283 (Interpreted frame)
 - org.apache.spark.rdd.MapPartitionsRDD.compute(org.apache.spark.Partition, 
org.apache.spark.TaskContext) @bci=24, line=38 (Interpreted frame)
 - org.apache.spark.rdd.RDD.computeOrReadCheckpoint(org.apache.spark.Partition, 
org.apache.spark.TaskContext) @bci=26, line=319 (Interpreted frame)
 - org.apache.spark.rdd.RDD.iterator(org.apache.spark.Partition, 
org.apache.spark.TaskContext) @bci=33, line=283 (Interpreted frame)
 - org.apache.spark.rdd.MapPartitionsRDD.compute(org.apache.spark.Partition, 
org.apache.spark.TaskContext) @bci=24, line=38 (Interpreted frame)
 - org.apache.spark.rdd.RDD.computeOrReadCheckpoint(org.apache.spark.Partition, 
org.apache.spark.TaskContext) @bci=26, line=319 (Interpreted frame)
 - org.apache.spark.rdd.RDD.iterator(org.apache.spark.Partition, 
org.apache.spark.TaskContext) @bci=33, line=283 (Interpreted frame)
 - 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=168, line=79 (Interpreted frame)
 - 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=2, line=47 (Interpreted frame)
 - org.apache.spark.scheduler.Task.run(long, int, 
org.apache.spark.metrics.MetricsSystem) @bci=82, line=85 (Interpreted frame)
 - org.apache.spark.executor.Executor$TaskRunner.run() @bci=374, line=274 
(Interpreted frame)
 - 

[jira] [Reopened] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-09-22 Thread Iulian Dragos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iulian Dragos reopened SPARK-15390:
---

> Memory management issue in complex DataFrame join and filter
> 
>
> Key: SPARK-15390
> URL: https://issues.apache.org/jira/browse/SPARK-15390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 16 workers
>Reporter: Joseph K. Bradley
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> See [SPARK-15389] for a description of the code which produces this bug.  I 
> am filing this as a separate JIRA since the bug in 2.0 is different.
> In 2.0, the code fails with some memory management error.  Here is the 
> stacktrace:
> {code}
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support 
> was removed in 8.0
> 16/05/18 19:23:16 ERROR Uncaught throwable from user code: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L])
>: +- Project
>:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None
>:   :- INPUT
>:   +- Project [id#110L]
>:  +- Filter (degree#115 > 200)
>: +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[id#110L,degree#115])
>:+- INPUT
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>:  +- WholeStageCodegen
>: :  +- Project [row#66.id AS id#70L]
>: : +- Filter isnotnull(row#66.id)
>: :+- INPUT
>: +- Scan ExistingRDD[row#66,uniq_id#67]
>+- Exchange hashpartitioning(id#110L, 200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[id#110L,count#136L])
>  : +- Filter isnotnull(id#110L)
>  :+- INPUT
>  +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L]
> +- WholeStageCodegen
>:  +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT 
> (src#2L = dst#3L))
>: +- INPUT
>+- InMemoryTableScan [src#2L,dst#3L], 
> [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation 
> [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, 
> offheap=false, deserialized=true, replication=1), WholeStageCodegen, None
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at 

[jira] [Commented] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-09-22 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513206#comment-15513206
 ] 

Iulian Dragos commented on SPARK-15390:
---

I'm still seeing a similar stack trace with the 2.0 release.

{code}
scala> res.count
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#286L])
   +- *Project
  +- *BroadcastHashJoin [userId#0L], [selUserId#169L], Inner, BuildRight
 :- *Project [userId#0L]
 :  +- *Filter isnotnull(userId#0L)
 : +- *Scan avro [userId#0L] Format: 
com.databricks.spark.avro.DefaultSource@451b7faf, InputPaths: 
file:/Users/dragos/workspace/consulting/teralytics/11-000.avro, PushedFilters: 
[IsNotNull(userId)], ReadSchema: struct
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
true]))
+- *GlobalLimit 1
   +- Exchange SinglePartition
  +- *LocalLimit 1
 +- *Project [userId#0L AS selUserId#169L]
+- *Filter isnotnull(userId#0L)
   +- *Scan avro [userId#0L] Format: 
com.databricks.spark.avro.DefaultSource@451b7faf, InputPaths: 
file:/Users/dragos/workspace/consulting/teralytics/11-000.avro, PushedFilters: 
[IsNotNull(userId)], ReadSchema: struct

  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
  at 
org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
  at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:138)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
  at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
  at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287)
  at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
  at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2217)
  at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2216)
  at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545)
  at org.apache.spark.sql.Dataset.count(Dataset.scala:2216)
  ... 50 elided
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
  at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
  at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
  at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
  at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
  at 

[jira] [Updated] (SPARK-14909) Spark UI submitted time is wrong

2016-09-22 Thread Jagadeesan A S (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadeesan A S updated SPARK-14909:
---
Attachment: app3.png
app2.png
app1.png
master.png

 Sounds like this was already fixed in Spark-2.0. Please check attached 
screenshots 

Master page
!master.png!

Application 1
 !app1.png!

Application 2
 !app2.png! 

Application 3
!app3.png!

> Spark UI submitted time is wrong
> 
>
> Key: SPARK-14909
> URL: https://issues.apache.org/jira/browse/SPARK-14909
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Christophe
> Attachments: app1.png, app2.png, app3.png, master.png, 
> spark-submission.png, time-spark1.png, time-spark2.png, time-spark3.png
>
>
> There is something wrong with the "submitted time" reported on the main web 
> UI.
> For example, I have jobs submitted every 5 minutes(00; 05; 10; 15 ...)
> Under the "Completed applications", I can see my jobs with a submitted 
> timestamp of same value: 11:04 AM 26/04/2016
> But, if I click on the individual application and look at the submitted time 
> at the top, I get the expected values, for example: Submit Date: Tue Apr 26 
> 01:05:03 UTC 2016
> I'll try to attach some screenshot



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14909) Spark UI submitted time is wrong

2016-09-22 Thread Jagadeesan A S (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadeesan A S updated SPARK-14909:
---
Attachment: (was: app3.png)

> Spark UI submitted time is wrong
> 
>
> Key: SPARK-14909
> URL: https://issues.apache.org/jira/browse/SPARK-14909
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Christophe
> Attachments: spark-submission.png, time-spark1.png, time-spark2.png, 
> time-spark3.png
>
>
> There is something wrong with the "submitted time" reported on the main web 
> UI.
> For example, I have jobs submitted every 5 minutes(00; 05; 10; 15 ...)
> Under the "Completed applications", I can see my jobs with a submitted 
> timestamp of same value: 11:04 AM 26/04/2016
> But, if I click on the individual application and look at the submitted time 
> at the top, I get the expected values, for example: Submit Date: Tue Apr 26 
> 01:05:03 UTC 2016
> I'll try to attach some screenshot



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14909) Spark UI submitted time is wrong

2016-09-22 Thread Jagadeesan A S (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadeesan A S updated SPARK-14909:
---
Attachment: (was: master.png)

> Spark UI submitted time is wrong
> 
>
> Key: SPARK-14909
> URL: https://issues.apache.org/jira/browse/SPARK-14909
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Christophe
> Attachments: spark-submission.png, time-spark1.png, time-spark2.png, 
> time-spark3.png
>
>
> There is something wrong with the "submitted time" reported on the main web 
> UI.
> For example, I have jobs submitted every 5 minutes(00; 05; 10; 15 ...)
> Under the "Completed applications", I can see my jobs with a submitted 
> timestamp of same value: 11:04 AM 26/04/2016
> But, if I click on the individual application and look at the submitted time 
> at the top, I get the expected values, for example: Submit Date: Tue Apr 26 
> 01:05:03 UTC 2016
> I'll try to attach some screenshot



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >