[jira] [Updated] (SPARK-19623) Take rows from DataFrame with empty first partition

2017-02-15 Thread Jaeboo Jung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaeboo Jung updated SPARK-19623:

Description: 
I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions 
having a empty first partition, DataFrame and its RDD have different behaviors 
during taking rows from it. If we take only 1000 rows from DataFrame, it causes 
OOME but RDD is OK.
In detail,
DataFrame without a empty first partition => OK
DataFrame with a empty first partition => OOME
RDD of DataFrame with a empty first partition => OK
Codes below reproduce this error.
{code}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(1 to 1,1000).map(i => 
Row.fromSeq(Array.fill(100)(i)))
val schema = StructType(for(i <- 1 to 100) yield {
StructField("COL"+i,IntegerType, true)
})
val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) 
Iterator[Row]() else iter)
val df1 = sqlContext.createDataFrame(rdd,schema)
df1.take(1000) // OK
val df2 = sqlContext.createDataFrame(rdd2,schema)
df2.rdd.take(1000) // OK
df2.take(1000) // OOME
{code}
I tested it on Spark 1.6.2 with 2gb of driver memory and 5gb of executor memory.

  was:
I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions 
having a empty first partition, DataFrame and its RDD have different behaviors 
during taking rows from it. If we take only 1000 rows from DataFrame, it causes 
OOME but RDD is OK.
In detail,
DataFrame without a empty first partition => OK
DataFrame with a empty first partition => OOME
RDD of DataFrame with a empty first partition => OK
Codes below reproduce this error.
{code}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(1 to 1,1000).map(i => 
Row.fromSeq(Array.fill(100)(i)))
val schema = StructType(for(i <- 1 to 100) yield {
StructField("COL"+i,IntegerType, true)
})
val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) 
Iterator[Row]() else iter)
val df1 = sqlContext.createDataFrame(rdd,schema)
df1.take(1000) // OK
val df2 = sqlContext.createDataFrame(rdd2,schema)
df2.rdd.take(1000) // OK
df2.take(1000) // OOME
{code}


> Take rows from DataFrame with empty first partition
> ---
>
> Key: SPARK-19623
> URL: https://issues.apache.org/jira/browse/SPARK-19623
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Jaeboo Jung
>Priority: Minor
>
> I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions 
> having a empty first partition, DataFrame and its RDD have different 
> behaviors during taking rows from it. If we take only 1000 rows from 
> DataFrame, it causes OOME but RDD is OK.
> In detail,
> DataFrame without a empty first partition => OK
> DataFrame with a empty first partition => OOME
> RDD of DataFrame with a empty first partition => OK
> Codes below reproduce this error.
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val rdd = sc.parallelize(1 to 1,1000).map(i => 
> Row.fromSeq(Array.fill(100)(i)))
> val schema = StructType(for(i <- 1 to 100) yield {
> StructField("COL"+i,IntegerType, true)
> })
> val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) 
> Iterator[Row]() else iter)
> val df1 = sqlContext.createDataFrame(rdd,schema)
> df1.take(1000) // OK
> val df2 = sqlContext.createDataFrame(rdd2,schema)
> df2.rdd.take(1000) // OK
> df2.take(1000) // OOME
> {code}
> I tested it on Spark 1.6.2 with 2gb of driver memory and 5gb of executor 
> memory.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19623) Take rows from DataFrame with empty first partition

2017-02-15 Thread Jaeboo Jung (JIRA)
Jaeboo Jung created SPARK-19623:
---

 Summary: Take rows from DataFrame with empty first partition
 Key: SPARK-19623
 URL: https://issues.apache.org/jira/browse/SPARK-19623
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.2
Reporter: Jaeboo Jung
Priority: Minor


I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions 
having a empty first partition, DataFrame and its RDD have different behaviors 
during taking rows from it. If we take only 1000 rows from DataFrame, it causes 
OOME but RDD is OK.
In detail,
DataFrame without a empty first partition => OK
DataFrame with a empty first partition => OOME
RDD of DataFrame with a empty first partition => OK
Codes below reproduce this error.
{code}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(1 to 1,1000).map(i => 
Row.fromSeq(Array.fill(100)(i)))
val schema = StructType(for(i <- 1 to 100) yield {
StructField("COL"+i,IntegerType, true)
})
val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) 
Iterator[Row]() else iter)
val df1 = sqlContext.createDataFrame(rdd,schema)
df1.take(1000) // OK
val df2 = sqlContext.createDataFrame(rdd2,schema)
df2.rdd.take(1000) // OK
df2.take(1000) // OOME
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18285) approxQuantile in R support multi-column

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18285:


Assignee: (was: Apache Spark)

> approxQuantile in R support multi-column
> 
>
> Key: SPARK-18285
> URL: https://issues.apache.org/jira/browse/SPARK-18285
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: zhengruifeng
>
> approxQuantile in R should support multi-column.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18285) approxQuantile in R support multi-column

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869400#comment-15869400
 ] 

Apache Spark commented on SPARK-18285:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16951

> approxQuantile in R support multi-column
> 
>
> Key: SPARK-18285
> URL: https://issues.apache.org/jira/browse/SPARK-18285
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: zhengruifeng
>
> approxQuantile in R should support multi-column.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19619) SparkR approxQuantile supports input multiple columns

2017-02-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang closed SPARK-19619.
---
Resolution: Duplicate

> SparkR approxQuantile supports input multiple columns
> -
>
> Key: SPARK-19619
> URL: https://issues.apache.org/jira/browse/SPARK-19619
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Priority: Minor
>
> SparkR approxQuantile supports input multiple columns.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18285) approxQuantile in R support multi-column

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18285:


Assignee: Apache Spark

> approxQuantile in R support multi-column
> 
>
> Key: SPARK-18285
> URL: https://issues.apache.org/jira/browse/SPARK-18285
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: zhengruifeng
>Assignee: Apache Spark
>
> approxQuantile in R should support multi-column.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19622:


Assignee: Apache Spark

> Fix a http error in a paged table when using a `Go` button to search.
> -
>
> Key: SPARK-19622
> URL: https://issues.apache.org/jira/browse/SPARK-19622
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: Apache Spark
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> The search function of paged table is not available because of we don't skip 
> the hash data of the reqeust path. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19622:


Assignee: (was: Apache Spark)

> Fix a http error in a paged table when using a `Go` button to search.
> -
>
> Key: SPARK-19622
> URL: https://issues.apache.org/jira/browse/SPARK-19622
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> The search function of paged table is not available because of we don't skip 
> the hash data of the reqeust path. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869389#comment-15869389
 ] 

Apache Spark commented on SPARK-19622:
--

User 'stanzhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/16953

> Fix a http error in a paged table when using a `Go` button to search.
> -
>
> Key: SPARK-19622
> URL: https://issues.apache.org/jira/browse/SPARK-19622
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> The search function of paged table is not available because of we don't skip 
> the hash data of the reqeust path. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.

2017-02-15 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-19622:
-
Attachment: screenshot-1.png

> Fix a http error in a paged table when using a `Go` button to search.
> -
>
> Key: SPARK-19622
> URL: https://issues.apache.org/jira/browse/SPARK-19622
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> The search function of paged table is not available because of we don't skip 
> the hash data of the reqeust path. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.

2017-02-15 Thread StanZhai (JIRA)
StanZhai created SPARK-19622:


 Summary: Fix a http error in a paged table when using a `Go` 
button to search.
 Key: SPARK-19622
 URL: https://issues.apache.org/jira/browse/SPARK-19622
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0
Reporter: StanZhai
Priority: Minor


The search function of paged table is not available because of we don't skip 
the hash data of the reqeust path. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19594) StreamingQueryListener fails to handle QueryTerminatedEvent if more then one listeners exists

2017-02-15 Thread Eyal Zituny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869368#comment-15869368
 ] 

Eyal Zituny commented on SPARK-19594:
-

that will work but i will have to remove the "final"  from the "postToAll" 
method which is part of spark core

another option can be to change the method post(event: 
StreamingQueryListener.Event):

def post(event: StreamingQueryListener.Event) {
event match {
  case s: QueryStartedEvent =>
activeQueryRunIds.synchronized { activeQueryRunIds += s.runId }
sparkListenerBus.post(s)
// post to local listeners to trigger callbacks
postToAll(s)
 case t: QueryTerminatedEvent =>
// run all the listeners synchronized before removing the id from the 
list
postToAll(t)
activeQueryRunIds.synchronized { activeQueryRunIds -= t.runId }
  case _ =>
sparkListenerBus.post(event)
}
  }

> StreamingQueryListener fails to handle QueryTerminatedEvent if more then one 
> listeners exists
> -
>
> Key: SPARK-19594
> URL: https://issues.apache.org/jira/browse/SPARK-19594
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Eyal Zituny
>Priority: Minor
>
> reproduce:
> *create a spark session
> *add multiple streaming query listeners
> *create a simple query
> *stop the query
> result -> only the first listener handle the QueryTerminatedEvent
> this might happen because the query run id is being removed from 
> activeQueryRunIds once the onQueryTerminated is called 
> (StreamingQueryListenerBus:115)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19621) R Windows AppVeyor test should run CRAN checks

2017-02-15 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19621:


 Summary: R Windows AppVeyor test should run CRAN checks
 Key: SPARK-19621
 URL: https://issues.apache.org/jira/browse/SPARK-19621
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung


We should run CRAN checks (see check-cran.sh) even on Windows since 
cross-platform tests is part of the CRAN release requirement.

check-cran.sh however is a bash script as of now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL

2017-02-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19618.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16948
[https://github.com/apache/spark/pull/16948]

> Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
> 
>
> Key: SPARK-19618
> URL: https://issues.apache.org/jira/browse/SPARK-19618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
> Fix For: 2.2.0
>
>
> High number of buckets is allowed while creating a table via SQL query:
> {code}
> sparkSession.sql("""
> CREATE TABLE bucketed_table(col1 INT) USING parquet 
> CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS
> """)
> sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println)
> 
> [Num Buckets:,147483647,]
> [Bucket Columns:,[col1],]
> [Sort Columns:,[col1],]
> 
> {code}
> Trying the same via dataframe API does not work:
> {code}
> > df.write.format("orc").bucketBy(147483647, 
> > "j","k").sortBy("j","k").saveAsTable("bucketed_table")
> java.lang.IllegalArgumentException: requirement failed: Bucket number must be 
> greater than 0 and less than 10.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291)
>   at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL

2017-02-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19618:
---

Assignee: Tejas Patil

> Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
> 
>
> Key: SPARK-19618
> URL: https://issues.apache.org/jira/browse/SPARK-19618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.2.0
>
>
> High number of buckets is allowed while creating a table via SQL query:
> {code}
> sparkSession.sql("""
> CREATE TABLE bucketed_table(col1 INT) USING parquet 
> CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS
> """)
> sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println)
> 
> [Num Buckets:,147483647,]
> [Bucket Columns:,[col1],]
> [Sort Columns:,[col1],]
> 
> {code}
> Trying the same via dataframe API does not work:
> {code}
> > df.write.format("orc").bucketBy(147483647, 
> > "j","k").sortBy("j","k").saveAsTable("bucketed_table")
> java.lang.IllegalArgumentException: requirement failed: Bucket number must be 
> greater than 0 and less than 10.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291)
>   at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19619) SparkR approxQuantile supports input multiple columns

2017-02-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869337#comment-15869337
 ] 

Felix Cheung commented on SPARK-19619:
--

dup of SPARK-18285

> SparkR approxQuantile supports input multiple columns
> -
>
> Key: SPARK-19619
> URL: https://issues.apache.org/jira/browse/SPARK-19619
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Priority: Minor
>
> SparkR approxQuantile supports input multiple columns.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19620) Incorrect exchange coordinator Id in physical plan

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869336#comment-15869336
 ] 

Apache Spark commented on SPARK-19620:
--

User 'carsonwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16952

> Incorrect exchange coordinator Id in physical plan
> --
>
> Key: SPARK-19620
> URL: https://issues.apache.org/jira/browse/SPARK-19620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Carson Wang
>Priority: Minor
>
> When adaptive execution is enabled, an exchange coordinator is used to in the 
> Exchange operators. For Join, the same exchange coordinator is used for its 
> two Exchanges. But the physical plan shows two different coordinator Ids 
> which is confusing.
> Here is an example:
> {code}
> == Physical Plan ==
> *Project [key1#3L, value2#12L]
> +- *SortMergeJoin [key1#3L], [key2#11L], Inner
>:- *Sort [key1#3L ASC NULLS FIRST], false, 0
>:  +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), 
> coordinator[target post-shuffle partition size: 67108864]
>: +- *Project [(id#0L % 500) AS key1#3L]
>:+- *Filter isnotnull((id#0L % 500))
>:   +- *Range (0, 1000, step=1, splits=Some(10))
>+- *Sort [key2#11L ASC NULLS FIRST], false, 0
>   +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L]
> +- *Filter isnotnull((id#8L % 500))
>+- *Range (0, 1000, step=1, splits=Some(10))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19620) Incorrect exchange coordinator Id in physical plan

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19620:


Assignee: Apache Spark

> Incorrect exchange coordinator Id in physical plan
> --
>
> Key: SPARK-19620
> URL: https://issues.apache.org/jira/browse/SPARK-19620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Carson Wang
>Assignee: Apache Spark
>Priority: Minor
>
> When adaptive execution is enabled, an exchange coordinator is used to in the 
> Exchange operators. For Join, the same exchange coordinator is used for its 
> two Exchanges. But the physical plan shows two different coordinator Ids 
> which is confusing.
> Here is an example:
> {code}
> == Physical Plan ==
> *Project [key1#3L, value2#12L]
> +- *SortMergeJoin [key1#3L], [key2#11L], Inner
>:- *Sort [key1#3L ASC NULLS FIRST], false, 0
>:  +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), 
> coordinator[target post-shuffle partition size: 67108864]
>: +- *Project [(id#0L % 500) AS key1#3L]
>:+- *Filter isnotnull((id#0L % 500))
>:   +- *Range (0, 1000, step=1, splits=Some(10))
>+- *Sort [key2#11L ASC NULLS FIRST], false, 0
>   +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L]
> +- *Filter isnotnull((id#8L % 500))
>+- *Range (0, 1000, step=1, splits=Some(10))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19620) Incorrect exchange coordinator Id in physical plan

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19620:


Assignee: (was: Apache Spark)

> Incorrect exchange coordinator Id in physical plan
> --
>
> Key: SPARK-19620
> URL: https://issues.apache.org/jira/browse/SPARK-19620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Carson Wang
>Priority: Minor
>
> When adaptive execution is enabled, an exchange coordinator is used to in the 
> Exchange operators. For Join, the same exchange coordinator is used for its 
> two Exchanges. But the physical plan shows two different coordinator Ids 
> which is confusing.
> Here is an example:
> {code}
> == Physical Plan ==
> *Project [key1#3L, value2#12L]
> +- *SortMergeJoin [key1#3L], [key2#11L], Inner
>:- *Sort [key1#3L ASC NULLS FIRST], false, 0
>:  +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), 
> coordinator[target post-shuffle partition size: 67108864]
>: +- *Project [(id#0L % 500) AS key1#3L]
>:+- *Filter isnotnull((id#0L % 500))
>:   +- *Range (0, 1000, step=1, splits=Some(10))
>+- *Sort [key2#11L ASC NULLS FIRST], false, 0
>   +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L]
> +- *Filter isnotnull((id#8L % 500))
>+- *Range (0, 1000, step=1, splits=Some(10))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19620) Incorrect exchange coordinator Id in physical plan

2017-02-15 Thread Carson Wang (JIRA)
Carson Wang created SPARK-19620:
---

 Summary: Incorrect exchange coordinator Id in physical plan
 Key: SPARK-19620
 URL: https://issues.apache.org/jira/browse/SPARK-19620
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Carson Wang
Priority: Minor


When adaptive execution is enabled, an exchange coordinator is used to in the 
Exchange operators. For Join, the same exchange coordinator is used for its two 
Exchanges. But the physical plan shows two different coordinator Ids which is 
confusing.

Here is an example:
{code}
== Physical Plan ==
*Project [key1#3L, value2#12L]
+- *SortMergeJoin [key1#3L], [key2#11L], Inner
   :- *Sort [key1#3L ASC NULLS FIRST], false, 0
   :  +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), 
coordinator[target post-shuffle partition size: 67108864]
   : +- *Project [(id#0L % 500) AS key1#3L]
   :+- *Filter isnotnull((id#0L % 500))
   :   +- *Range (0, 1000, step=1, splits=Some(10))
   +- *Sort [key2#11L ASC NULLS FIRST], false, 0
  +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), 
coordinator[target post-shuffle partition size: 67108864]
 +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L]
+- *Filter isnotnull((id#8L % 500))
   +- *Range (0, 1000, step=1, splits=Some(10))

{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios

2017-02-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869314#comment-15869314
 ] 

Tejas Patil commented on SPARK-19326:
-

> You might be able to just write an `if` case that checks whether speculation 
> is enabled and run some logic in the listener to detect speculated tasks.

For ExecutorAllocationManager to detect that there needs to be speculation, it 
would basically would have to duplicate what TaskSetManager does to find 
candidates for speculation (unless you have some better way). Thats bad because 
there would be two entities making decisions about speculation.

> Speculated task attempts do not get launched in few scenarios
> -
>
> Key: SPARK-19326
> URL: https://issues.apache.org/jira/browse/SPARK-19326
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> Speculated copies of tasks do not get launched in some cases.
> Examples:
> - All the running executors have no CPU slots left to accommodate a 
> speculated copy of the task(s). If the all running executors reside over a 
> set of slow / bad hosts, they will keep the job running for long time
> - `spark.task.cpus` > 1 and the running executor has not filled up all its 
> CPU slots. Since the [speculated copies of tasks should run on different 
> host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283]
>  and not the host where the first copy was launched.
> In both these cases, `ExecutorAllocationManager` does not know about pending 
> speculation task attempts and thinks that all the resource demands are well 
> taken care of. ([relevant 
> code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265])
> This adds variation in the job completion times and more importantly SLA 
> misses :( In prod, with a large number of jobs, I see this happening more 
> often than one would think. Chasing the bad hosts or reason for slowness 
> doesn't scale.
> Here is a tiny repro. Note that you need to launch this with (Mesos or YARN 
> or standalone deploy mode) along with `--conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100`
> {code}
> val n = 100
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index == 1) {
>   Thread.sleep(Long.MaxValue)  // fake long running task(s)
> }
> it.toList.map(x => index + ", " + x).iterator
> }).collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19619) SparkR approxQuantile supports input multiple columns

2017-02-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-19619:

Description: SparkR approxQuantile supports input multiple columns.  (was: 
SparkR approxQuantile support multiple columns)

> SparkR approxQuantile supports input multiple columns
> -
>
> Key: SPARK-19619
> URL: https://issues.apache.org/jira/browse/SPARK-19619
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Priority: Minor
>
> SparkR approxQuantile supports input multiple columns.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19619) SparkR approxQuantile support multiple columns

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19619:


Assignee: (was: Apache Spark)

> SparkR approxQuantile support multiple columns
> --
>
> Key: SPARK-19619
> URL: https://issues.apache.org/jira/browse/SPARK-19619
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Priority: Minor
>
> SparkR approxQuantile support multiple columns



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19619) SparkR approxQuantile support multiple columns

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19619:


Assignee: Apache Spark

> SparkR approxQuantile support multiple columns
> --
>
> Key: SPARK-19619
> URL: https://issues.apache.org/jira/browse/SPARK-19619
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> SparkR approxQuantile support multiple columns



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19619) SparkR approxQuantile supports input multiple columns

2017-02-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-19619:

Summary: SparkR approxQuantile supports input multiple columns  (was: 
SparkR approxQuantile support multiple columns)

> SparkR approxQuantile supports input multiple columns
> -
>
> Key: SPARK-19619
> URL: https://issues.apache.org/jira/browse/SPARK-19619
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Priority: Minor
>
> SparkR approxQuantile support multiple columns



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19619) SparkR approxQuantile support multiple columns

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869283#comment-15869283
 ] 

Apache Spark commented on SPARK-19619:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16951

> SparkR approxQuantile support multiple columns
> --
>
> Key: SPARK-19619
> URL: https://issues.apache.org/jira/browse/SPARK-19619
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Priority: Minor
>
> SparkR approxQuantile support multiple columns



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19619) SparkR approxQuantile support multiple columns

2017-02-15 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-19619:
---

 Summary: SparkR approxQuantile support multiple columns
 Key: SPARK-19619
 URL: https://issues.apache.org/jira/browse/SPARK-19619
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Yanbo Liang
Priority: Minor


SparkR approxQuantile support multiple columns



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios

2017-02-15 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869244#comment-15869244
 ] 

Andrew Or commented on SPARK-19326:
---

I would say it's a bad idea to make ExecutorAllocationManager talk to the 
TaskSetManager. The existing listener interface is relatively isolated. I'm not 
sure if you need to introduce a new event to capture speculation. You might be 
able to just write an `if` case that checks whether speculation is enabled and 
run some logic in the listener to detect speculated tasks.

> Speculated task attempts do not get launched in few scenarios
> -
>
> Key: SPARK-19326
> URL: https://issues.apache.org/jira/browse/SPARK-19326
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> Speculated copies of tasks do not get launched in some cases.
> Examples:
> - All the running executors have no CPU slots left to accommodate a 
> speculated copy of the task(s). If the all running executors reside over a 
> set of slow / bad hosts, they will keep the job running for long time
> - `spark.task.cpus` > 1 and the running executor has not filled up all its 
> CPU slots. Since the [speculated copies of tasks should run on different 
> host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283]
>  and not the host where the first copy was launched.
> In both these cases, `ExecutorAllocationManager` does not know about pending 
> speculation task attempts and thinks that all the resource demands are well 
> taken care of. ([relevant 
> code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265])
> This adds variation in the job completion times and more importantly SLA 
> misses :( In prod, with a large number of jobs, I see this happening more 
> often than one would think. Chasing the bad hosts or reason for slowness 
> doesn't scale.
> Here is a tiny repro. Note that you need to launch this with (Mesos or YARN 
> or standalone deploy mode) along with `--conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100`
> {code}
> val n = 100
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index == 1) {
>   Thread.sleep(Long.MaxValue)  // fake long running task(s)
> }
> it.toList.map(x => index + ", " + x).iterator
> }).collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19603) Fix StreamingQuery explain command

2017-02-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19603.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> Fix StreamingQuery explain command
> --
>
> Key: SPARK-19603
> URL: https://issues.apache.org/jira/browse/SPARK-19603
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> Right now StreamingQuery.explain doesn't show the correct streaming physical 
> plan.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios

2017-02-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869112#comment-15869112
 ] 

Tejas Patil commented on SPARK-19326:
-

Thanks for the info !!

[~andrewor14] / [~kayousterhout] : I am happy to work on this. Two approaches I 
can think of are:
- Add an event in listener to inform `ExecutorAllocationManager` about tasks 
from speculation.
- `ExecutorAllocationManager` should not be depending on listener and have some 
other event based mechanism to drive have communication between 
`ExecutorAllocationManager` and `TaskSetManager`. This is cleaner solution but 
it will be bigger change.

What do you think ?

> Speculated task attempts do not get launched in few scenarios
> -
>
> Key: SPARK-19326
> URL: https://issues.apache.org/jira/browse/SPARK-19326
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> Speculated copies of tasks do not get launched in some cases.
> Examples:
> - All the running executors have no CPU slots left to accommodate a 
> speculated copy of the task(s). If the all running executors reside over a 
> set of slow / bad hosts, they will keep the job running for long time
> - `spark.task.cpus` > 1 and the running executor has not filled up all its 
> CPU slots. Since the [speculated copies of tasks should run on different 
> host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283]
>  and not the host where the first copy was launched.
> In both these cases, `ExecutorAllocationManager` does not know about pending 
> speculation task attempts and thinks that all the resource demands are well 
> taken care of. ([relevant 
> code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265])
> This adds variation in the job completion times and more importantly SLA 
> misses :( In prod, with a large number of jobs, I see this happening more 
> often than one would think. Chasing the bad hosts or reason for slowness 
> doesn't scale.
> Here is a tiny repro. Note that you need to launch this with (Mesos or YARN 
> or standalone deploy mode) along with `--conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100`
> {code}
> val n = 100
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index == 1) {
>   Thread.sleep(Long.MaxValue)  // fake long running task(s)
> }
> it.toList.map(x => index + ", " + x).iterator
> }).collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19399) R Coalesce on DataFrame and coalesce on column

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869067#comment-15869067
 ] 

Apache Spark commented on SPARK-19399:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16950

> R Coalesce on DataFrame and coalesce on column
> --
>
> Key: SPARK-19399
> URL: https://issues.apache.org/jira/browse/SPARK-19399
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
> coalesce on DataFrame is different from repartition, where shuffling is 
> avoided. We should have that in SparkR.
> coalesce on Column is convenient to have in expression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16122) Spark History Server REST API missing an environment endpoint per application

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16122:


Assignee: (was: Apache Spark)

> Spark History Server REST API missing an environment endpoint per application
> -
>
> Key: SPARK-16122
> URL: https://issues.apache.org/jira/browse/SPARK-16122
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Web UI
>Affects Versions: 1.6.1
>Reporter: Neelesh Srinivas Salian
>Priority: Minor
>  Labels: Docs, WebUI
>
> The WebUI for the Spark History Server has the Environment tab that allows 
> you to view the Environment for that job.
> With Runtime , Spark properties...etc.
> How about adding an endpoint to the REST API that looks and points to this 
> environment tab for that application?
> /applications/[app-id]/environment
> Added Docs too so that we can spawn a subsequent Documentation addition to 
> get it included in the API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16122) Spark History Server REST API missing an environment endpoint per application

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16122:


Assignee: Apache Spark

> Spark History Server REST API missing an environment endpoint per application
> -
>
> Key: SPARK-16122
> URL: https://issues.apache.org/jira/browse/SPARK-16122
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Web UI
>Affects Versions: 1.6.1
>Reporter: Neelesh Srinivas Salian
>Assignee: Apache Spark
>Priority: Minor
>  Labels: Docs, WebUI
>
> The WebUI for the Spark History Server has the Environment tab that allows 
> you to view the Environment for that job.
> With Runtime , Spark properties...etc.
> How about adding an endpoint to the REST API that looks and points to this 
> environment tab for that application?
> /applications/[app-id]/environment
> Added Docs too so that we can spawn a subsequent Documentation addition to 
> get it included in the API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16122) Spark History Server REST API missing an environment endpoint per application

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869062#comment-15869062
 ] 

Apache Spark commented on SPARK-16122:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16949

> Spark History Server REST API missing an environment endpoint per application
> -
>
> Key: SPARK-16122
> URL: https://issues.apache.org/jira/browse/SPARK-16122
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Web UI
>Affects Versions: 1.6.1
>Reporter: Neelesh Srinivas Salian
>Priority: Minor
>  Labels: Docs, WebUI
>
> The WebUI for the Spark History Server has the Environment tab that allows 
> you to view the Environment for that job.
> With Runtime , Spark properties...etc.
> How about adding an endpoint to the REST API that looks and points to this 
> environment tab for that application?
> /applications/[app-id]/environment
> Added Docs too so that we can spawn a subsequent Documentation addition to 
> get it included in the API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions

2017-02-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19460:
-

Yes- it's better to address the root issue with column name but it wouldn't 
hurt to avoid confusing everyone by not using iris every where.



> Update dataset used in R documentation, examples to reduce warning noise and 
> confusions
> ---
>
> Key: SPARK-19460
> URL: https://issues.apache.org/jira/browse/SPARK-19460
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> Running build we have a bunch of warnings from using the `iris` dataset, for 
> example.
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> These are the results of having `.` in the column name. For reference, see 
> SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't 
> support that there then we should strongly consider using other dataset 
> without `.`, eg. `cars`
> And we should update this in API doc (roxygen2 doc string), vignettes, 
> programming guide, R code example.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19604) Log the start of every Python test

2017-02-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869023#comment-15869023
 ] 

Yin Huai commented on SPARK-19604:
--

It has been resolved by https://github.com/apache/spark/pull/16935. 

> Log the start of every Python test
> --
>
> Key: SPARK-19604
> URL: https://issues.apache.org/jira/browse/SPARK-19604
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.3, 2.1.1
>
>
> Right now, we only have info level log after we finish the tests of a Python 
> test file. We should also log the start of a test. So, if a test is hanging, 
> we can tell which test file is running.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19604) Log the start of every Python test

2017-02-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-19604.
--
   Resolution: Fixed
Fix Version/s: 2.1.1
   2.0.3

> Log the start of every Python test
> --
>
> Key: SPARK-19604
> URL: https://issues.apache.org/jira/browse/SPARK-19604
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.3, 2.1.1
>
>
> Right now, we only have info level log after we finish the tests of a Python 
> test file. We should also log the start of a test. So, if a test is hanging, 
> we can tell which test file is running.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios

2017-02-15 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869020#comment-15869020
 ] 

Andrew Or commented on SPARK-19326:
---

Sorry for slipping on this. When I was implementing the feature the goal was to 
get it working for normal cases first, so I wouldn't be surprised if it doesn't 
work with speculation. I don't think there's a fundamental reason why it can't 
be supported. Someone just needs to implement it.

> Speculated task attempts do not get launched in few scenarios
> -
>
> Key: SPARK-19326
> URL: https://issues.apache.org/jira/browse/SPARK-19326
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> Speculated copies of tasks do not get launched in some cases.
> Examples:
> - All the running executors have no CPU slots left to accommodate a 
> speculated copy of the task(s). If the all running executors reside over a 
> set of slow / bad hosts, they will keep the job running for long time
> - `spark.task.cpus` > 1 and the running executor has not filled up all its 
> CPU slots. Since the [speculated copies of tasks should run on different 
> host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283]
>  and not the host where the first copy was launched.
> In both these cases, `ExecutorAllocationManager` does not know about pending 
> speculation task attempts and thinks that all the resource demands are well 
> taken care of. ([relevant 
> code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265])
> This adds variation in the job completion times and more importantly SLA 
> misses :( In prod, with a large number of jobs, I see this happening more 
> often than one would think. Chasing the bad hosts or reason for slowness 
> doesn't scale.
> Here is a tiny repro. Note that you need to launch this with (Mesos or YARN 
> or standalone deploy mode) along with `--conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100`
> {code}
> val n = 100
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index == 1) {
>   Thread.sleep(Long.MaxValue)  // fake long running task(s)
> }
> it.toList.map(x => index + ", " + x).iterator
> }).collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-15 Thread xukun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867124#comment-15867124
 ] 

xukun edited comment on SPARK-18113 at 2/16/17 1:58 AM:


[~aash]

According my scenario and [https://github.com/palantir/spark/pull/94] code

task 678.0
outputCommitCoordinator.canCommit will match 
CommitState(NO_AUTHORIZED_COMMITTER, _, Uncommitted) =>  
CommitState(attemptNumber, System.nanoTime(), MidCommit)

outputCommitCoordinator.commitDone match CommitState(existingCommitter, 
startTime, MidCommit) if attemptNumber == existingCommitter =>
 CommitState(attemptNumber, startTime, Committed)

task 678.1
outputCommitCoordinator.canCommit match CommitState(existingCommitter, _, 
Committed) 

If executor is preempted after outputCommitCoordinator.commitDone, driver still 
enter into task deadloop


was (Author: xukun):
[~aash]

According my scenario and [https://github.com/palantir/spark/pull/94] code

task 678.0
outputCommitCoordinator.canCommit will match 
CommitState(NO_AUTHORIZED_COMMITTER, _, Uncommitted) =>  
CommitState(attemptNumber, System.nanoTime(), MidCommit)

outputCommitCoordinator.commitDone match CommitState(existingCommitter, 
startTime, MidCommit) if attemptNumber == existingCommitter =>
 CommitState(attemptNumber, startTime, Committed)

task 678.1
outputCommitCoordinator.canCommit match CommitState(existingCommitter, _, 
Committed) 

then driver enter into task deadloop

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> 

[jira] [Assigned] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19618:


Assignee: (was: Apache Spark)

> Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
> 
>
> Key: SPARK-19618
> URL: https://issues.apache.org/jira/browse/SPARK-19618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>
> High number of buckets is allowed while creating a table via SQL query:
> {code}
> sparkSession.sql("""
> CREATE TABLE bucketed_table(col1 INT) USING parquet 
> CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS
> """)
> sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println)
> 
> [Num Buckets:,147483647,]
> [Bucket Columns:,[col1],]
> [Sort Columns:,[col1],]
> 
> {code}
> Trying the same via dataframe API does not work:
> {code}
> > df.write.format("orc").bucketBy(147483647, 
> > "j","k").sortBy("j","k").saveAsTable("bucketed_table")
> java.lang.IllegalArgumentException: requirement failed: Bucket number must be 
> greater than 0 and less than 10.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291)
>   at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868963#comment-15868963
 ] 

Apache Spark commented on SPARK-19618:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/16948

> Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
> 
>
> Key: SPARK-19618
> URL: https://issues.apache.org/jira/browse/SPARK-19618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>
> High number of buckets is allowed while creating a table via SQL query:
> {code}
> sparkSession.sql("""
> CREATE TABLE bucketed_table(col1 INT) USING parquet 
> CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS
> """)
> sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println)
> 
> [Num Buckets:,147483647,]
> [Bucket Columns:,[col1],]
> [Sort Columns:,[col1],]
> 
> {code}
> Trying the same via dataframe API does not work:
> {code}
> > df.write.format("orc").bucketBy(147483647, 
> > "j","k").sortBy("j","k").saveAsTable("bucketed_table")
> java.lang.IllegalArgumentException: requirement failed: Bucket number must be 
> greater than 0 and less than 10.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291)
>   at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19618:


Assignee: Apache Spark

> Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
> 
>
> Key: SPARK-19618
> URL: https://issues.apache.org/jira/browse/SPARK-19618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Assignee: Apache Spark
>
> High number of buckets is allowed while creating a table via SQL query:
> {code}
> sparkSession.sql("""
> CREATE TABLE bucketed_table(col1 INT) USING parquet 
> CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS
> """)
> sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println)
> 
> [Num Buckets:,147483647,]
> [Bucket Columns:,[col1],]
> [Sort Columns:,[col1],]
> 
> {code}
> Trying the same via dataframe API does not work:
> {code}
> > df.write.format("orc").bucketBy(147483647, 
> > "j","k").sortBy("j","k").saveAsTable("bucketed_table")
> java.lang.IllegalArgumentException: requirement failed: Bucket number must be 
> greater than 0 and less than 10.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291)
>   at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL

2017-02-15 Thread Tejas Patil (JIRA)
Tejas Patil created SPARK-19618:
---

 Summary: Inconsistency wrt max. buckets allowed from Dataframe API 
vs SQL
 Key: SPARK-19618
 URL: https://issues.apache.org/jira/browse/SPARK-19618
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Tejas Patil


High number of buckets is allowed while creating a table via SQL query:

{code}
sparkSession.sql("""
CREATE TABLE bucketed_table(col1 INT) USING parquet 
CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS
""")

sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println)

[Num Buckets:,147483647,]
[Bucket Columns:,[col1],]
[Sort Columns:,[col1],]

{code}

Trying the same via dataframe API does not work:

{code}
> df.write.format("orc").bucketBy(147483647, 
> "j","k").sortBy("j","k").saveAsTable("bucketed_table")

java.lang.IllegalArgumentException: requirement failed: Bucket number must be 
greater than 0 and less than 10.
  at scala.Predef$.require(Predef.scala:224)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291)
  at scala.Option.map(Option.scala:146)
  at 
org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291)
  at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410)
  at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365)
  ... 50 elided
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19617) Fix a case that a query may not stop due to HADOOP-14084

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19617:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix a case that a query may not stop due to HADOOP-14084
> 
>
> Key: SPARK-19617
> URL: https://issues.apache.org/jira/browse/SPARK-19617
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Saw the following exception in some test log:
> {code}
> 17/02/14 21:20:10.987 stream execution thread for this_query [id = 
> 09fd5d6d-bea3-4891-88c7-0d0f1909188d, runId = 
> a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining 
> on: Thread[Thread-48,5,main]
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Thread.join(Thread.java:1249)
>   at java.lang.Thread.join(Thread.java:1323)
>   at org.apache.hadoop.util.Shell.joinThread(Shell.java:626)
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:577)
>   at org.apache.hadoop.util.Shell.run(Shell.java:479)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509)
>   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176)
>   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.(HDFSMetadataLog.scala:75)
>   at 
> org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.(CompactibleFileStreamLog.scala:46)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSourceLog.(FileStreamSourceLog.scala:36)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:59)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191)
> {code}
> This is the cause of some test timeout failures on Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19617) Fix a case that a query may not stop due to HADOOP-14084

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868913#comment-15868913
 ] 

Apache Spark commented on SPARK-19617:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16947

> Fix a case that a query may not stop due to HADOOP-14084
> 
>
> Key: SPARK-19617
> URL: https://issues.apache.org/jira/browse/SPARK-19617
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Saw the following exception in some test log:
> {code}
> 17/02/14 21:20:10.987 stream execution thread for this_query [id = 
> 09fd5d6d-bea3-4891-88c7-0d0f1909188d, runId = 
> a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining 
> on: Thread[Thread-48,5,main]
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Thread.join(Thread.java:1249)
>   at java.lang.Thread.join(Thread.java:1323)
>   at org.apache.hadoop.util.Shell.joinThread(Shell.java:626)
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:577)
>   at org.apache.hadoop.util.Shell.run(Shell.java:479)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509)
>   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176)
>   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.(HDFSMetadataLog.scala:75)
>   at 
> org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.(CompactibleFileStreamLog.scala:46)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSourceLog.(FileStreamSourceLog.scala:36)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:59)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191)
> {code}
> This is the cause of some test timeout failures on Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions

2017-02-15 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868914#comment-15868914
 ] 

Miao Wang commented on SPARK-19460:
---

By the way, I remembered that you had discussion about fixing the underlying 
issue on some PR review.

> Update dataset used in R documentation, examples to reduce warning noise and 
> confusions
> ---
>
> Key: SPARK-19460
> URL: https://issues.apache.org/jira/browse/SPARK-19460
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> Running build we have a bunch of warnings from using the `iris` dataset, for 
> example.
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> These are the results of having `.` in the column name. For reference, see 
> SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't 
> support that there then we should strongly consider using other dataset 
> without `.`, eg. `cars`
> And we should update this in API doc (roxygen2 doc string), vignettes, 
> programming guide, R code example.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19617) Fix a case that a query may not stop due to HADOOP-14084

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19617:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix a case that a query may not stop due to HADOOP-14084
> 
>
> Key: SPARK-19617
> URL: https://issues.apache.org/jira/browse/SPARK-19617
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Saw the following exception in some test log:
> {code}
> 17/02/14 21:20:10.987 stream execution thread for this_query [id = 
> 09fd5d6d-bea3-4891-88c7-0d0f1909188d, runId = 
> a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining 
> on: Thread[Thread-48,5,main]
> java.lang.InterruptedException
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Thread.join(Thread.java:1249)
>   at java.lang.Thread.join(Thread.java:1323)
>   at org.apache.hadoop.util.Shell.joinThread(Shell.java:626)
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:577)
>   at org.apache.hadoop.util.Shell.run(Shell.java:479)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509)
>   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176)
>   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730)
>   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.(HDFSMetadataLog.scala:75)
>   at 
> org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.(CompactibleFileStreamLog.scala:46)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSourceLog.(FileStreamSourceLog.scala:36)
>   at 
> org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:59)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191)
> {code}
> This is the cause of some test timeout failures on Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions

2017-02-15 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868911#comment-15868911
 ] 

Miao Wang commented on SPARK-19460:
---

Seems a lots of work. :) I can give a try.

> Update dataset used in R documentation, examples to reduce warning noise and 
> confusions
> ---
>
> Key: SPARK-19460
> URL: https://issues.apache.org/jira/browse/SPARK-19460
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> Running build we have a bunch of warnings from using the `iris` dataset, for 
> example.
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> Warning in FUN(X[[4L]], ...) :
> Use Petal_Width instead of Petal.Width as column name
> Warning in FUN(X[[1L]], ...) :
> Use Sepal_Length instead of Sepal.Length as column name
> Warning in FUN(X[[2L]], ...) :
> Use Sepal_Width instead of Sepal.Width as column name
> Warning in FUN(X[[3L]], ...) :
> Use Petal_Length instead of Petal.Length as column name
> These are the results of having `.` in the column name. For reference, see 
> SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't 
> support that there then we should strongly consider using other dataset 
> without `.`, eg. `cars`
> And we should update this in API doc (roxygen2 doc string), vignettes, 
> programming guide, R code example.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19617) Fix a case that a query may not stop due to HADOOP-14084

2017-02-15 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19617:


 Summary: Fix a case that a query may not stop due to HADOOP-14084
 Key: SPARK-19617
 URL: https://issues.apache.org/jira/browse/SPARK-19617
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0, 2.0.2
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Saw the following exception in some test log:
{code}
17/02/14 21:20:10.987 stream execution thread for this_query [id = 
09fd5d6d-bea3-4891-88c7-0d0f1909188d, runId = 
a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining on: 
Thread[Thread-48,5,main]
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1249)
at java.lang.Thread.join(Thread.java:1323)
at org.apache.hadoop.util.Shell.joinThread(Shell.java:626)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:577)
at org.apache.hadoop.util.Shell.run(Shell.java:479)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066)
at 
org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176)
at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733)
at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385)
at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLog.(HDFSMetadataLog.scala:75)
at 
org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.(CompactibleFileStreamLog.scala:46)
at 
org.apache.spark.sql.execution.streaming.FileStreamSourceLog.(FileStreamSourceLog.scala:36)
at 
org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:59)
at 
org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191)
{code}

This is the cause of some test timeout failures on Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios

2017-02-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868889#comment-15868889
 ] 

Tejas Patil commented on SPARK-19326:
-

[~andrewor14] : Ping !!

> Speculated task attempts do not get launched in few scenarios
> -
>
> Key: SPARK-19326
> URL: https://issues.apache.org/jira/browse/SPARK-19326
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>
> Speculated copies of tasks do not get launched in some cases.
> Examples:
> - All the running executors have no CPU slots left to accommodate a 
> speculated copy of the task(s). If the all running executors reside over a 
> set of slow / bad hosts, they will keep the job running for long time
> - `spark.task.cpus` > 1 and the running executor has not filled up all its 
> CPU slots. Since the [speculated copies of tasks should run on different 
> host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283]
>  and not the host where the first copy was launched.
> In both these cases, `ExecutorAllocationManager` does not know about pending 
> speculation task attempts and thinks that all the resource demands are well 
> taken care of. ([relevant 
> code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265])
> This adds variation in the job completion times and more importantly SLA 
> misses :( In prod, with a large number of jobs, I see this happening more 
> often than one would think. Chasing the bad hosts or reason for slowness 
> doesn't scale.
> Here is a tiny repro. Note that you need to launch this with (Mesos or YARN 
> or standalone deploy mode) along with `--conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100`
> {code}
> val n = 100
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index == 1) {
>   Thread.sleep(Long.MaxValue)  // fake long running task(s)
> }
> it.toList.map(x => index + ", " + x).iterator
> }).collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API

2017-02-15 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868887#comment-15868887
 ] 

Yanbo Liang edited comment on SPARK-18080 at 2/16/17 12:32 AM:
---

[~josephkb] I'm sorry that I did not notice that you are shepherding this task, 
and I have committed the PR. I will take a look in advance next time. Thanks.


was (Author: yanboliang):
[~josephkb] I'm sorry that I did not notice that you are shepherding this task, 
and I have committed it. I will take a look in advance next time. Thanks.

> Locality Sensitive Hashing (LSH) Python API
> ---
>
> Key: SPARK-18080
> URL: https://issues.apache.org/jira/browse/SPARK-18080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API

2017-02-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-18080.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Locality Sensitive Hashing (LSH) Python API
> ---
>
> Key: SPARK-18080
> URL: https://issues.apache.org/jira/browse/SPARK-18080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API

2017-02-15 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868887#comment-15868887
 ] 

Yanbo Liang commented on SPARK-18080:
-

[~josephkb] I'm sorry that I did not notice that you are shepherding this task, 
and I have committed it. I will take a look in advance next time. Thanks.

> Locality Sensitive Hashing (LSH) Python API
> ---
>
> Key: SPARK-18080
> URL: https://issues.apache.org/jira/browse/SPARK-18080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19599) Clean up HDFSMetadataLog

2017-02-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19599.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.2.0
   2.1.1

> Clean up HDFSMetadataLog
> 
>
> Key: SPARK-19599
> URL: https://issues.apache.org/jira/browse/SPARK-19599
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
> cleanup for HDFSMetadataLog
> Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us 
> from removing the workaround codes. Anyway, I sill did some clean up and also 
> updated the comments to point to HADOOP-14084.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19554) YARN backend should use history server URL for tracking when UI is disabled

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19554:


Assignee: (was: Apache Spark)

> YARN backend should use history server URL for tracking when UI is disabled
> ---
>
> Key: SPARK-19554
> URL: https://issues.apache.org/jira/browse/SPARK-19554
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Currently, if the app has disabled its UI, Spark does not set a tracking URL 
> in YARN. The UI is still available, even if with a lag, in the history 
> server, if it's configured. We should use that as the tracking URL in these 
> cases, instead of letting YARN show its default page for applications without 
> a UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19554) YARN backend should use history server URL for tracking when UI is disabled

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19554:


Assignee: Apache Spark

> YARN backend should use history server URL for tracking when UI is disabled
> ---
>
> Key: SPARK-19554
> URL: https://issues.apache.org/jira/browse/SPARK-19554
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, if the app has disabled its UI, Spark does not set a tracking URL 
> in YARN. The UI is still available, even if with a lag, in the history 
> server, if it's configured. We should use that as the tracking URL in these 
> cases, instead of letting YARN show its default page for applications without 
> a UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19554) YARN backend should use history server URL for tracking when UI is disabled

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868843#comment-15868843
 ] 

Apache Spark commented on SPARK-19554:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16946

> YARN backend should use history server URL for tracking when UI is disabled
> ---
>
> Key: SPARK-19554
> URL: https://issues.apache.org/jira/browse/SPARK-19554
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Currently, if the app has disabled its UI, Spark does not set a tracking URL 
> in YARN. The UI is still available, even if with a lag, in the history 
> server, if it's configured. We should use that as the tracking URL in these 
> cases, instead of letting YARN show its default page for applications without 
> a UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19616) weightCol and aggregationDepth should be improved for some SparkR APIs

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19616:


Assignee: Apache Spark

> weightCol and aggregationDepth should be improved for some SparkR APIs 
> ---
>
> Key: SPARK-19616
> URL: https://issues.apache.org/jira/browse/SPARK-19616
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Miao Wang
>Assignee: Apache Spark
>Priority: Minor
>
> When doing SPARK-19456, we found that "" should be consider a NULL column 
> name and should not be set. aggregationDepth should be exposed as an expert 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19616) weightCol and aggregationDepth should be improved for some SparkR APIs

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868782#comment-15868782
 ] 

Apache Spark commented on SPARK-19616:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/16945

> weightCol and aggregationDepth should be improved for some SparkR APIs 
> ---
>
> Key: SPARK-19616
> URL: https://issues.apache.org/jira/browse/SPARK-19616
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Miao Wang
>Priority: Minor
>
> When doing SPARK-19456, we found that "" should be consider a NULL column 
> name and should not be set. aggregationDepth should be exposed as an expert 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19616) weightCol and aggregationDepth should be improved for some SparkR APIs

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19616:


Assignee: (was: Apache Spark)

> weightCol and aggregationDepth should be improved for some SparkR APIs 
> ---
>
> Key: SPARK-19616
> URL: https://issues.apache.org/jira/browse/SPARK-19616
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Miao Wang
>Priority: Minor
>
> When doing SPARK-19456, we found that "" should be consider a NULL column 
> name and should not be set. aggregationDepth should be exposed as an expert 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19616) weightCol and aggregationDepth should be improved for some SparkR APIs

2017-02-15 Thread Miao Wang (JIRA)
Miao Wang created SPARK-19616:
-

 Summary: weightCol and aggregationDepth should be improved for 
some SparkR APIs 
 Key: SPARK-19616
 URL: https://issues.apache.org/jira/browse/SPARK-19616
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.1.0, 2.2.0
Reporter: Miao Wang
Priority: Minor


When doing SPARK-19456, we found that "" should be consider a NULL column name 
and should not be set. aggregationDepth should be exposed as an expert 
parameter.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-15 Thread Sunitha Kambhampati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-19602:

Attachment: (was: Design_ColResolution_JIRA19602.docx)

> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
> Attachments: Design_ColResolution_JIRA19602.docx
>
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-15 Thread Sunitha Kambhampati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-19602:

Attachment: Design_ColResolution_JIRA19602.docx

> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
> Attachments: Design_ColResolution_JIRA19602.docx, 
> Design_ColResolution_JIRA19602.docx
>
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19615) Provide Dataset union convenience for divergent schema

2017-02-15 Thread Nick Dimiduk (JIRA)
Nick Dimiduk created SPARK-19615:


 Summary: Provide Dataset union convenience for divergent schema
 Key: SPARK-19615
 URL: https://issues.apache.org/jira/browse/SPARK-19615
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Nick Dimiduk
Priority: Minor


Creating a union DataFrame over two sources that have different schema 
definitions is surprisingly complex. Provide a version of the union method that 
will create a infer a target schema as the result of merging the sources. 
Automatically add extend either side with {{null}} columns for any missing 
columns that are nullable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19614) add type-preserving null function

2017-02-15 Thread Nick Dimiduk (JIRA)
Nick Dimiduk created SPARK-19614:


 Summary: add type-preserving null function
 Key: SPARK-19614
 URL: https://issues.apache.org/jira/browse/SPARK-19614
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Nick Dimiduk
Priority: Trivial


There's currently no easy way to extend the columns of a DataFrame with null 
columns that also preserves the type. {{lit(null)}} evaluates to 
{{Literal(null, NullType)}}, despite any subsequent hinting, for instance with 
{{Column.as(String, Metadata)}}. This comes up when programmatically munging 
data from disparate sources. A function such as {{null(DataType)}} would be 
nice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19497) dropDuplicates with watermark

2017-02-15 Thread sam elamin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868736#comment-15868736
 ] 

sam elamin commented on SPARK-19497:


I would love to be able to help on this [~zsxwing], please do get in touch if 
there is anything I can do


> dropDuplicates with watermark
> -
>
> Key: SPARK-19497
> URL: https://issues.apache.org/jira/browse/SPARK-19497
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Assignee: Shixiong Zhu
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19610) multi line support for CSV

2017-02-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868699#comment-15868699
 ] 

Hyukjin Kwon commented on SPARK-19610:
--

Sure, let me try. Thanks for cc'ing me.

> multi line support for CSV
> --
>
> Key: SPARK-19610
> URL: https://issues.apache.org/jira/browse/SPARK-19610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19610) multi line support for CSV

2017-02-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868700#comment-15868700
 ] 

Hyukjin Kwon commented on SPARK-19610:
--

Sure, let me try. Thanks for cc'ing me.

> multi line support for CSV
> --
>
> Key: SPARK-19610
> URL: https://issues.apache.org/jira/browse/SPARK-19610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19610) multi line support for CSV

2017-02-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19610:
-
Comment: was deleted

(was: Sure, let me try. Thanks for cc'ing me.)

> multi line support for CSV
> --
>
> Key: SPARK-19610
> URL: https://issues.apache.org/jira/browse/SPARK-19610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868677#comment-15868677
 ] 

Apache Spark commented on SPARK-19611:
--

User 'budde' has created a pull request for this issue:
https://github.com/apache/spark/pull/16942

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868666#comment-15868666
 ] 

Apache Spark commented on SPARK-19611:
--

User 'budde' has created a pull request for this issue:
https://github.com/apache/spark/pull/16944

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19613) Flaky test: StateStoreRDDSuite.versioning and immutability

2017-02-15 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-19613:
--

 Summary: Flaky test: StateStoreRDDSuite.versioning and immutability
 Key: SPARK-19613
 URL: https://issues.apache.org/jira/browse/SPARK-19613
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming, Tests
Affects Versions: 2.1.1
Reporter: Kay Ousterhout
Priority: Minor


This test: 
org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite.versioning 
and immutability failed on a recent PR: 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72948/testReport/junit/org.apache.spark.sql.execution.streaming.state/StateStoreRDDSuite/versioning_and_immutability/




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18937) Timezone support in CSV/JSON parsing

2017-02-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-18937:
---

Assignee: Takuya Ueshin

> Timezone support in CSV/JSON parsing
> 
>
> Key: SPARK-18937
> URL: https://issues.apache.org/jira/browse/SPARK-18937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19599) Clean up HDFSMetadataLog

2017-02-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19599:
-
Description: 
SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
cleanup for HDFSMetadataLog

Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us 
from removing the workaround codes. Anyway, I sill did some clean up to make 
HDFSMetadataLog simply.

  was:
SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
cleanup for HDFSMetadataLog

Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us 
from removing the workaround codes. 


> Clean up HDFSMetadataLog
> 
>
> Key: SPARK-19599
> URL: https://issues.apache.org/jira/browse/SPARK-19599
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
> cleanup for HDFSMetadataLog
> Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us 
> from removing the workaround codes. Anyway, I sill did some clean up to make 
> HDFSMetadataLog simply.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19599) Clean up HDFSMetadataLog

2017-02-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19599:
-
Description: 
SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
cleanup for HDFSMetadataLog

Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us 
from removing the workaround codes. Anyway, I sill did some clean up and also 
updated the comments to point to HADOOP-14084.

  was:
SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
cleanup for HDFSMetadataLog

Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us 
from removing the workaround codes. Anyway, I sill did some clean up to make 
HDFSMetadataLog simple.


> Clean up HDFSMetadataLog
> 
>
> Key: SPARK-19599
> URL: https://issues.apache.org/jira/browse/SPARK-19599
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
> cleanup for HDFSMetadataLog
> Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us 
> from removing the workaround codes. Anyway, I sill did some clean up and also 
> updated the comments to point to HADOOP-14084.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18937) Timezone support in CSV/JSON parsing

2017-02-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18937.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16750
[https://github.com/apache/spark/pull/16750]

> Timezone support in CSV/JSON parsing
> 
>
> Key: SPARK-18937
> URL: https://issues.apache.org/jira/browse/SPARK-18937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19599) Clean up HDFSMetadataLog

2017-02-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19599:
-
Description: 
SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
cleanup for HDFSMetadataLog

Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us 
from removing the workaround codes. Anyway, I sill did some clean up to make 
HDFSMetadataLog simple.

  was:
SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
cleanup for HDFSMetadataLog

Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us 
from removing the workaround codes. Anyway, I sill did some clean up to make 
HDFSMetadataLog simply.


> Clean up HDFSMetadataLog
> 
>
> Key: SPARK-19599
> URL: https://issues.apache.org/jira/browse/SPARK-19599
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
> cleanup for HDFSMetadataLog
> Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us 
> from removing the workaround codes. Anyway, I sill did some clean up to make 
> HDFSMetadataLog simple.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19599) Clean up HDFSMetadataLog

2017-02-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19599:
-
Summary: Clean up HDFSMetadataLog  (was: Clean up HDFSMetadataLog for 
Hadoop 2.6+)

> Clean up HDFSMetadataLog
> 
>
> Key: SPARK-19599
> URL: https://issues.apache.org/jira/browse/SPARK-19599
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
> cleanup for HDFSMetadataLog
> Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us 
> from removing the workaround codes. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19599) Clean up HDFSMetadataLog for Hadoop 2.6+

2017-02-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19599:
-
Description: 
SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
cleanup for HDFSMetadataLog

Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us 
from removing the workaround codes. 

  was:SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
cleanup for HDFSMetadataLog


> Clean up HDFSMetadataLog for Hadoop 2.6+
> 
>
> Key: SPARK-19599
> URL: https://issues.apache.org/jira/browse/SPARK-19599
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
> cleanup for HDFSMetadataLog
> Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us 
> from removing the workaround codes. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19329) after alter a datasource table's location to a not exist location and then insert data throw Exception

2017-02-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19329.
-
   Resolution: Fixed
 Assignee: Song Jun
Fix Version/s: 2.2.0

> after alter a datasource table's location to a not exist location and then 
> insert data throw Exception
> --
>
> Key: SPARK-19329
> URL: https://issues.apache.org/jira/browse/SPARK-19329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>Assignee: Song Jun
> Fix For: 2.2.0
>
>
> spark.sql("create table t(a string, b int) using parquet")
> spark.sql(s"alter table t set location '$notexistedlocation'")
> spark.sql("insert into table t select 'c', 1")
> this will throw an exception:
> com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> $notexistedlocation;
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19612) Tests failing with timeout

2017-02-15 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868565#comment-15868565
 ] 

Kay Ousterhout commented on SPARK-19612:


Does that mean we could potentially fix this by limiting the concurrency on 
Jenkins? 

> Tests failing with timeout
> --
>
> Key: SPARK-19612
> URL: https://issues.apache.org/jira/browse/SPARK-19612
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.1
>Reporter: Kay Ousterhout
>Priority: Minor
>
> I've seen at least one recent test failure due to hitting the 250m timeout: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/
> Filing this JIRA to track this; if it happens repeatedly we should up the 
> timeout.
> cc [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17689) _temporary files breaks the Spark SQL streaming job.

2017-02-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868552#comment-15868552
 ] 

Sean Owen commented on SPARK-17689:
---

This is created by for example HDFS copy jobs to hold the files before they are 
fully written. It exists transiently and could stick around if something failed.

> _temporary files breaks the Spark SQL streaming job.
> 
>
> Key: SPARK-17689
> URL: https://issues.apache.org/jira/browse/SPARK-17689
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Prashant Sharma
>
> Steps to reproduce:
> 1) Start a streaming job which reads from HDFS location hdfs://xyz/*
> 2) Write content to hdfs://xyz/a
> .
> .
> repeat a few times.
> And then job breaks as follows.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 49 in 
> stage 304.0 failed 1 times, most recent failure: Lost task 49.0 in stage 
> 304.0 (TID 14794, localhost): java.io.FileNotFoundException: File does not 
> exist: hdfs://localhost:9000/input/t5/_temporary
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:464)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:462)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19612) Tests failing with timeout

2017-02-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868549#comment-15868549
 ] 

Sean Owen commented on SPARK-19612:
---

I think this happens when Jenkins is quite busy; it probably isn't even a flaky 
test situation. That has been my experience.
Not that it isn't a problem but may not be due to a test per se.

> Tests failing with timeout
> --
>
> Key: SPARK-19612
> URL: https://issues.apache.org/jira/browse/SPARK-19612
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.1
>Reporter: Kay Ousterhout
>Priority: Minor
>
> I've seen at least one recent test failure due to hitting the 250m timeout: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/
> Filing this JIRA to track this; if it happens repeatedly we should up the 
> timeout.
> cc [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19607) Finding QueryExecution that matches provided executionId

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868532#comment-15868532
 ] 

Apache Spark commented on SPARK-19607:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/16943

> Finding QueryExecution that matches provided executionId
> 
>
> Key: SPARK-19607
> URL: https://issues.apache.org/jira/browse/SPARK-19607
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ala Luszczak
>Assignee: Ala Luszczak
> Fix For: 2.2.0
>
>
> Create a method for finding QueryExecution that matches provided executionId 
> for future use.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19584) Update Structured Streaming documentation to include Batch query description

2017-02-15 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-19584:
-

Assignee: Tyson Condie

> Update Structured Streaming documentation to include Batch query description
> 
>
> Key: SPARK-19584
> URL: https://issues.apache.org/jira/browse/SPARK-19584
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>Assignee: Tyson Condie
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19492) Dataset, filter and pattern matching on elements

2017-02-15 Thread Niek Bartholomeus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868503#comment-15868503
 ] 

Niek Bartholomeus commented on SPARK-19492:
---

I'm having this issue since starting to use spark a year ago. I thought it was 
a minor issue that would get solved in the next update but it's still there in 
2.1.0. The workaround is indeed to create a val func as described above or even 
simpler to wrap it with a match clause:

{code}
 departments.filter{ x => x match {case Department(_, name)=>
  name == "hr"
}}
{code}

> Dataset, filter and pattern matching on elements
> 
>
> Key: SPARK-19492
> URL: https://issues.apache.org/jira/browse/SPARK-19492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Loic Descotte
>Priority: Minor
>
> It seems it is impossible to use pattern matching to define input parameters 
> for function filter on datasets.
> Example :
> This one is working :
> {code}
> val departments = Seq(
> Department(1, "hr"),
> Department(2, "it")
> ).toDS
> departments.filter{ d=> 
>   d.name == "hr"
> }
> {code}
> but not this one :
> {code}
>  departments.filter{ case Department(_, name)=>
>   name == "hr"
> }
> {code}
> Error :
> {code}
> error: missing parameter type for expanded function
> The argument types of an anonymous function must be fully known. (SLS 8.5)
> Expected type was: ?
> departments.filter{ case Department(_, name)=>
> {code}
> This kind of pattern matching should work (as departements dataset type is 
> known) like Scala collections filter function, or RDD filter function for 
> example.
> Please note that it works on map function : 
> {code}
>  departments.map{ case Department(_, name)=>
>   name
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19612) Tests failing with timeout

2017-02-15 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-19612:
--

 Summary: Tests failing with timeout
 Key: SPARK-19612
 URL: https://issues.apache.org/jira/browse/SPARK-19612
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.1.1
Reporter: Kay Ousterhout
Priority: Minor


I've seen at least one recent test failure due to hitting the 250m timeout: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/

Filing this JIRA to track this; if it happens repeatedly we should up the 
timeout.

cc [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19594) StreamingQueryListener fails to handle QueryTerminatedEvent if more then one listeners exists

2017-02-15 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868469#comment-15868469
 ] 

Shixiong Zhu commented on SPARK-19594:
--

I suggest that overriding "def postToAll(event: E)" and remove the query id 
after all listeners process the event.

> StreamingQueryListener fails to handle QueryTerminatedEvent if more then one 
> listeners exists
> -
>
> Key: SPARK-19594
> URL: https://issues.apache.org/jira/browse/SPARK-19594
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Eyal Zituny
>Priority: Minor
>
> reproduce:
> *create a spark session
> *add multiple streaming query listeners
> *create a simple query
> *stop the query
> result -> only the first listener handle the QueryTerminatedEvent
> this might happen because the query run id is being removed from 
> activeQueryRunIds once the onQueryTerminated is called 
> (StreamingQueryListenerBus:115)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17689) _temporary files breaks the Spark SQL streaming job.

2017-02-15 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868466#comment-15868466
 ] 

Shixiong Zhu commented on SPARK-17689:
--

Just curious: who created "_temporary"?

> _temporary files breaks the Spark SQL streaming job.
> 
>
> Key: SPARK-17689
> URL: https://issues.apache.org/jira/browse/SPARK-17689
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Prashant Sharma
>
> Steps to reproduce:
> 1) Start a streaming job which reads from HDFS location hdfs://xyz/*
> 2) Write content to hdfs://xyz/a
> .
> .
> repeat a few times.
> And then job breaks as follows.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 49 in 
> stage 304.0 failed 1 times, most recent failure: Lost task 49.0 in stage 
> 304.0 (TID 14794, localhost): java.io.FileNotFoundException: File does not 
> exist: hdfs://localhost:9000/input/t5/_temporary
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:464)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:462)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19455) Add option for case-insensitive Parquet field resolution

2017-02-15 Thread Adam Budde (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868457#comment-15868457
 ] 

Adam Budde commented on SPARK-19455:


Closing this in favor of https://issues.apache.org/jira/browse/SPARK-19611

> Add option for case-insensitive Parquet field resolution
> 
>
> Key: SPARK-19455
> URL: https://issues.apache.org/jira/browse/SPARK-19455
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> This change initially included a [patch to 
> ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284]
>  that attempted to remedy this conflict by using a case-insentive fallback 
> mapping when resolving field names during the schema clipping step. 
> [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333]  later removed 
> this patch after 
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for embedding a case-sensitive schema as a Hive Metastore table property. 
> AFAIK the assumption here was that the data schema obtained from the 
> Metastore table property will be case sensitive and should match the Parquet 
> schema exactly.
> The problem arises when dealing with Parquet-backed tables for which this 
> schema has not been embedded as a table attributes and for which the 
> underlying files contain case-sensitive field names. This will happen for any 
> Hive table that was not created by Spark or created by a version prior to 
> 2.1.0. We've seen Spark SQL return no results for any query containing a 
> case-sensitive field name for such tables.
> The change we're proposing is to introduce a configuration parameter that 
> will re-enable case-insensitive field name resolution in ParquetReadSupport. 
> This option will also disable filter push-down for Parquet, as the filter 
> predicate constructed by Spark SQL contains the case-insensitive field names 
> which Parquet will return 0 records for when filtering against a 
> case-sensitive column name. I was hoping to find a way to construct the 
> filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the 
> Configuration object passed to this class to the underlying 
> InternalParquetRecordReader class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19455) Add option for case-insensitive Parquet field resolution

2017-02-15 Thread Adam Budde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Budde closed SPARK-19455.
--
Resolution: Duplicate

Closing in favor of https://issues.apache.org/jira/browse/SPARK-19611

> Add option for case-insensitive Parquet field resolution
> 
>
> Key: SPARK-19455
> URL: https://issues.apache.org/jira/browse/SPARK-19455
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> This change initially included a [patch to 
> ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284]
>  that attempted to remedy this conflict by using a case-insentive fallback 
> mapping when resolving field names during the schema clipping step. 
> [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333]  later removed 
> this patch after 
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for embedding a case-sensitive schema as a Hive Metastore table property. 
> AFAIK the assumption here was that the data schema obtained from the 
> Metastore table property will be case sensitive and should match the Parquet 
> schema exactly.
> The problem arises when dealing with Parquet-backed tables for which this 
> schema has not been embedded as a table attributes and for which the 
> underlying files contain case-sensitive field names. This will happen for any 
> Hive table that was not created by Spark or created by a version prior to 
> 2.1.0. We've seen Spark SQL return no results for any query containing a 
> case-sensitive field name for such tables.
> The change we're proposing is to introduce a configuration parameter that 
> will re-enable case-insensitive field name resolution in ParquetReadSupport. 
> This option will also disable filter push-down for Parquet, as the filter 
> predicate constructed by Spark SQL contains the case-insensitive field names 
> which Parquet will return 0 records for when filtering against a 
> case-sensitive column name. I was hoping to find a way to construct the 
> filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the 
> Configuration object passed to this class to the underlying 
> InternalParquetRecordReader class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19611:


Assignee: Apache Spark

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Assignee: Apache Spark
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19455) Add option for case-insensitive Parquet field resolution

2017-02-15 Thread Adam Budde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Budde updated SPARK-19455:
---
Comment: was deleted

(was: Closing this in favor of 
https://issues.apache.org/jira/browse/SPARK-19611)

> Add option for case-insensitive Parquet field resolution
> 
>
> Key: SPARK-19455
> URL: https://issues.apache.org/jira/browse/SPARK-19455
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> This change initially included a [patch to 
> ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284]
>  that attempted to remedy this conflict by using a case-insentive fallback 
> mapping when resolving field names during the schema clipping step. 
> [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333]  later removed 
> this patch after 
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for embedding a case-sensitive schema as a Hive Metastore table property. 
> AFAIK the assumption here was that the data schema obtained from the 
> Metastore table property will be case sensitive and should match the Parquet 
> schema exactly.
> The problem arises when dealing with Parquet-backed tables for which this 
> schema has not been embedded as a table attributes and for which the 
> underlying files contain case-sensitive field names. This will happen for any 
> Hive table that was not created by Spark or created by a version prior to 
> 2.1.0. We've seen Spark SQL return no results for any query containing a 
> case-sensitive field name for such tables.
> The change we're proposing is to introduce a configuration parameter that 
> will re-enable case-insensitive field name resolution in ParquetReadSupport. 
> This option will also disable filter push-down for Parquet, as the filter 
> predicate constructed by Spark SQL contains the case-insensitive field names 
> which Parquet will return 0 records for when filtering against a 
> case-sensitive column name. I was hoping to find a way to construct the 
> filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the 
> Configuration object passed to this class to the underlying 
> InternalParquetRecordReader class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868458#comment-15868458
 ] 

Apache Spark commented on SPARK-19611:
--

User 'budde' has created a pull request for this issue:
https://github.com/apache/spark/pull/16942

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19611:


Assignee: (was: Apache Spark)

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19610) multi line support for CSV

2017-02-15 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868424#comment-15868424
 ] 

Wenchen Fan commented on SPARK-19610:
-

[~hyukjin.kwon] do you have time to work on it?

> multi line support for CSV
> --
>
> Key: SPARK-19610
> URL: https://issues.apache.org/jira/browse/SPARK-19610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19610) multi line support for CSV

2017-02-15 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-19610:
---

 Summary: multi line support for CSV
 Key: SPARK-19610
 URL: https://issues.apache.org/jira/browse/SPARK-19610
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-02-15 Thread Adam Budde (JIRA)
Adam Budde created SPARK-19611:
--

 Summary: Spark 2.1.0 breaks some Hive tables backed by 
case-sensitive data files
 Key: SPARK-19611
 URL: https://issues.apache.org/jira/browse/SPARK-19611
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Adam Budde


This issue replaces 
[SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
#16797|https://github.com/apache/spark/pull/16797]

[SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
schema inferrence from the HiveMetastoreCatalog class when converting a 
MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
favor of simply using the schema returend by the metastore. This results in an 
optimization as the underlying file status no longer need to be resolved until 
after the partition pruning step, reducing the number of files to be touched 
significantly in some cases. The downside is that the data schema used may no 
longer match the underlying file schema for case-sensitive formats such as 
Parquet.

Unfortunately, this silently breaks queries over tables where the underlying 
data fields are case-sensitive but a case-sensitive schema wasn't written to 
the table properties by Spark. This situation will occur for any Hive table 
that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
user attempts to run a query over such a table containing a case-sensitive 
field name in the query projection or in the query filter, the query will 
return 0 results in every case.

The change we are proposing is to bring back the schema inference that was used 
prior to Spark 2.1.0 if a case-sensitive schema can't be read from the table 
properties.
- INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
schema can be read from the table properties. Attempt to save the inferred 
schema in the table properties to avoid future inference.
- INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
don't attempt to save it.
- NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
Hive Metatore. Useful if the user knows that none of the underlying data is 
case-sensitive.

See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19568) Must include class/method documentation for CRAN check

2017-02-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868376#comment-15868376
 ] 

Felix Cheung commented on SPARK-19568:
--

that would be great - it looks like nightly build is a Jenkins config - I don't 
find anything in the git repo on how that is setup

> Must include class/method documentation for CRAN check
> --
>
> Key: SPARK-19568
> URL: https://issues.apache.org/jira/browse/SPARK-19568
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> While tests are running, R CMD check --as-cran is still complaining
> {code}
> * checking for missing documentation entries ... WARNING
> Undocumented code objects:
>   ‘add_months’ ‘agg’ ‘approxCountDistinct’ ‘approxQuantile’ ‘arrange’
>   ‘array_contains’ ‘as.DataFrame’ ‘as.data.frame’ ‘asc’ ‘ascii’ ‘avg’
>   ‘base64’ ‘between’ ‘bin’ ‘bitwiseNOT’ ‘bround’ ‘cache’ ‘cacheTable’
>   ‘cancelJobGroup’ ‘cast’ ‘cbrt’ ‘ceil’ ‘clearCache’ ‘clearJobGroup’
>   ‘collect’ ‘colnames’ ‘colnames<-’ ‘coltypes’ ‘coltypes<-’ ‘column’
>   ‘columns’ ‘concat’ ‘concat_ws’ ‘contains’ ‘conv’ ‘corr’ ‘count’
>   ‘countDistinct’ ‘cov’ ‘covar_pop’ ‘covar_samp’ ‘crc32’
>   ‘createDataFrame’ ‘createExternalTable’ ‘createOrReplaceTempView’
>   ‘crossJoin’ ‘crosstab’ ‘cume_dist’ ‘dapply’ ‘dapplyCollect’
>   ‘date_add’ ‘date_format’ ‘date_sub’ ‘datediff’ ‘dayofmonth’
>   ‘dayofyear’ ‘decode’ ‘dense_rank’ ‘desc’ ‘describe’ ‘distinct’ ‘drop’
> ...
> {code}
> This is because of lack of .Rd files in a clean environment when running 
> against the content of the R source package.
> I think we need to generate the .Rd files under man\ when building the 
> release and then package with them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12957) Derive and propagate data constrains in logical plan

2017-02-15 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868366#comment-15868366
 ] 

Nick Dimiduk commented on SPARK-12957:
--

Filed SPARK-19609. IMHO, it would be another subtask on this ticket.

> Derive and propagate data constrains in logical plan 
> -
>
> Key: SPARK-12957
> URL: https://issues.apache.org/jira/browse/SPARK-12957
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Sameer Agarwal
> Attachments: ConstraintPropagationinSparkSQL.pdf
>
>
> Based on the semantic of a query plan, we can derive data constrains (e.g. if 
> a filter defines {{a > 10}}, we know that the output data of this filter 
> satisfy the constrain of {{a > 10}} and {{a is not null}}). We should build a 
> framework to derive and propagate constrains in the logical plan, which can 
> help us to build more advanced optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >