from:"Song Jun \(JIRA\)"

[jira] [Created] (SPARK-19484) continue work to create a table with an empty schema

2017-02-06 Thread Song Jun (JIRA)

Song Jun created SPARK-19484:


 Summary: continue work to create a table with an empty schema
 Key: SPARK-19484
 URL: https://issues.apache.org/jira/browse/SPARK-19484
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


after SPARK-19279, we could not create a Hive table with an empty schema,
we should tighten up the condition when create a hive table in 

https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835

That is if a CatalogTable t has an empty schema, and (there is no 
`spark.sql.schema.numParts` or its value is 0), we should not add a default 
`col` schema, if we did, a table with an empty schema will be created, that is 
not we expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns

2017-02-07 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855677#comment-15855677
 ] 

Song Jun commented on SPARK-19477:
--

thanks， I got it~

> [SQL] Datasets created from a Dataframe with extra columns retain the extra 
> columns
> ---
>
> Key: SPARK-19477
> URL: https://issues.apache.org/jira/browse/SPARK-19477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Drake
>
> In 1.6, when you created a Dataset from a Dataframe that had extra columns, 
> the columns not in the case class were dropped from the Dataset.
> For example in 1.6, the column c4 is gone:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import sqlContext.implicits._
> import sqlContext.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
> string]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
> scala> ds.show
> +---+---+---+
> | f1| f2| f3|
> +---+---+---+
> |  a|  b|  c|
> |  d|  e|  f|
> |  h|  i|  j|
> {code}
> This seems to have changed in Spark 2.0 and also 2.1:
> Spark 2.1.0:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more 
> fields]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more 
> fields]
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
> scala> val fEncoder = Encoders.product[F]
> fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: 
> string, f3[0]: string]
> scala> fEncoder.schema == ds.schema
> res2: Boolean = false
> scala> ds.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true), StructField(c4,StringType,true))
> scala> fEncoder.schema
> res4: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19491) add a config for tableRelation cache size in SessionCatalog

2017-02-07 Thread Song Jun (JIRA)

Song Jun created SPARK-19491:


 Summary: add a config for tableRelation cache size in 
SessionCatalog
 Key: SPARK-19491
 URL: https://issues.apache.org/jira/browse/SPARK-19491
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


currently the table relation cache size is hardcode to 1000, it is better to 
add a config to set its size.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19496) to_date with format has weird behavior

2017-02-07 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856205#comment-15856205
 ] 

Song Jun commented on SPARK-19496:
--

I am working on this~

> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19463) refresh the table cache after InsertIntoHadoopFsRelation

2017-02-05 Thread Song Jun (JIRA)

Song Jun created SPARK-19463:


 Summary: refresh the table cache after InsertIntoHadoopFsRelation
 Key: SPARK-19463
 URL: https://issues.apache.org/jira/browse/SPARK-19463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


If we first cache a DataSource table, then we insert some data into the table, 
we should refresh the data in the cache after the insert command. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19496) to_date with format has weird behavior

2017-02-07 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857541#comment-15857541
 ] 

Song Jun edited comment on SPARK-19496 at 2/8/17 7:11 AM:
--

mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null

that is mysql both return null when the date is invalidate or the formate is 
invalidate.

and hive will transform the invalidate  date to  valid, e.g 2014-31-12 -> 31/12 
= 2 -> 2014+2=2016
, 31 - 12*2=7 -> 2016-07-12


was (Author: windpiger):
mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null

mysql both return null when the date is invalidate or the formate is invalidate

> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19496) to_date with format has weird behavior

2017-02-07 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857570#comment-15857570
 ] 

Song Jun commented on SPARK-19496:
--

[~hyukjin.kwon]  

> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19496) to_date with format has weird behavior

2017-02-07 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857541#comment-15857541
 ] 

Song Jun edited comment on SPARK-19496 at 2/8/17 7:18 AM:
--

mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null

that is mysql both return null when the date is invalidate or the formate is 
invalidate.

and hive will transform the invalidate  date to  valid, e.g 2014-31-12 -> 31/12 
= 2 -> 2014+2=2016
, 31 - 12*2=7 -> 2016-07-12

currently spark can handle wrong format / wrong date  when to_date has the 
format parameter (like hive's transform), what about we also make to_date 
without format parameter follow its action, that is replace null with a 
transformed date to return


was (Author: windpiger):
mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null

that is mysql both return null when the date is invalidate or the formate is 
invalidate.

and hive will transform the invalidate  date to  valid, e.g 2014-31-12 -> 31/12 
= 2 -> 2014+2=2016
, 31 - 12*2=7 -> 2016-07-12

> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19496) to_date with format has weird behavior

2017-02-07 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857541#comment-15857541
 ] 

Song Jun commented on SPARK-19496:
--

mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null


> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19496) to_date with format has weird behavior

2017-02-07 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857541#comment-15857541
 ] 

Song Jun edited comment on SPARK-19496 at 2/8/17 7:09 AM:
--

mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null

mysql both return null when the date is invalidate or the formate is invalidate


was (Author: windpiger):
mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null


> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19458) loading hive jars from the local repo which has already downloaded

2017-02-04 Thread Song Jun (JIRA)

Song Jun created SPARK-19458:


 Summary: loading hive jars from the local repo which has already 
downloaded
 Key: SPARK-19458
 URL: https://issues.apache.org/jira/browse/SPARK-19458
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


Currently when we new a HiveClient for a specific metastore version and 
`spark.sql.hive.metastore.jars` is setted to `maven`, Spark will download the 
hive jars from remote repo(http://www.datanucleus.org/downloads/maven2).

we should allow the user to load hive jars from the local repo which has 
already downloaded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19430) Cannot read external tables with VARCHAR columns if they're backed by ORC files written by Hive 1.2.1

2017-02-05 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853466#comment-15853466
 ] 

Song Jun commented on SPARK-19430:
--

I think this is not a bug. If you want to access the hive table ,you can 
directly use
`
spark.table("orc_varchar_test").show 
`

> Cannot read external tables with VARCHAR columns if they're backed by ORC 
> files written by Hive 1.2.1
> -
>
> Key: SPARK-19430
> URL: https://issues.apache.org/jira/browse/SPARK-19430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0
>Reporter: Sameer Agarwal
>
> Spark throws an exception when trying to read external tables with VARCHAR 
> columns if they're backed by ORC files that were written by Hive 1.2.1 (and 
> possibly other versions of hive).
> Steps to reproduce (credits to [~lian cheng]):
> # Write an ORC table using Hive 1.2.1 with
>{noformat}
> CREATE TABLE orc_varchar_test STORED AS ORC
> AS SELECT CASTE('a' AS VARCHAR(10)) AS c0{noformat}
> # Get the raw path of the written ORC file
> # Create an external table pointing to this file and read the table using 
> Spark
>   {noformat}
> val path = "/tmp/orc_varchar_test"
> sql(s"create external table if not exists test (c0 varchar(10)) stored as orc 
> location '$path'")
> spark.table("test").show(){noformat}
> The problem here is that the metadata in the ORC file written by Hive is 
> different from those written by Spark. We can inspect the ORC file written 
> above:
> {noformat}
> $ hive --orcfiledump 
> file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/00_0
> Structure for 
> file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/00_0
> File Version: 0.12 with HIVE_8732
> Rows: 1
> Compression: ZLIB
> Compression size: 262144
> Type: struct<_col0:varchar(10)>   <
> ...
> {noformat}
> On the other hand, if you create an ORC table using the same DDL and inspect 
> the written ORC file, you'll see:
> {noformat}
> ...
> Type: struct
> ...
> {noformat}
> Note that all tests are done with {{spark.sql.hive.convertMetastoreOrc}} set 
> to {{false}}, which is the default case.
> I've verified that Spark 1.6.x, 2.0.x and 2.1.x all fail with instances of 
> the following error:
> {code}
> java.lang.ClassCastException: 
> org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
> org.apache.hadoop.io.Text
> at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
> at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19447) Fix input metrics for range operator

2017-02-05 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853462#comment-15853462
 ] 

Song Jun commented on SPARK-19447:
--

spark.range(1,100).show

there is some information in the SQL UI like:
`
Range
number of output rows: 99
`

I didn't see some information like `0 rows`

maybe I didn't get the right place. could you help to describe it more clearly?

thanks!

> Fix input metrics for range operator
> 
>
> Key: SPARK-19447
> URL: https://issues.apache.org/jira/browse/SPARK-19447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Reynold Xin
>
> Range operator currently does not output any input metrics, and as a result 
> in the SQL UI the number of rows shown is always 0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19448) unify some duplication function in MetaStoreRelation

2017-02-03 Thread Song Jun (JIRA)

Song Jun created SPARK-19448:


 Summary: unify some duplication function in MetaStoreRelation
 Key: SPARK-19448
 URL: https://issues.apache.org/jira/browse/SPARK-19448
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


1. MetaStoreRelation' hiveQlTable can be replaced by calling HiveClientImpl's 
toHiveTable
2. MetaStoreRelation's toHiveColumn can be replaced by calling HiveClientImpl's 
toHiveColumn
3. process another TODO
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-21 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935
 ] 

Song Jun edited comment on SPARK-19257 at 1/21/17 11:09 AM:


I found it that `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`， while this is not a 
legal URI (no encoded whitespace) 


was (Author: windpiger):
I found it than `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`， while this is not a 
legal URI (no encoded whitespace) 

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-21 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935
 ] 

Song Jun commented on SPARK-19257:
--

I found it than `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`， while this is not a 
legal URI (no encoded whitespace) 

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19359) partition path created by Hive should be deleted after rename a partition with upper-case

2017-01-24 Thread Song Jun (JIRA)

Song Jun created SPARK-19359:


 Summary: partition path created by Hive should be deleted after 
rename a partition with upper-case
 Key: SPARK-19359
 URL: https://issues.apache.org/jira/browse/SPARK-19359
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Song Jun
Priority: Minor


Hive metastore is not case preserving and keep partition columns with lower 
case names. 

If SparkSQL create a table with upper-case partion name use 
HiveExternalCatalog, when we rename partition, it first call the HiveClient to 
renamePartition, which will create a new lower case partition path, then 
SparkSql rename the lower case path to the upper-case.

while if the renamed partition contains more than one depth partition ,e.g. 
A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to 
A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19359) partition path created by Hive should be deleted after rename a partition with upper-case

2017-01-24 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19359:
-
Issue Type: Improvement  (was: Bug)

> partition path created by Hive should be deleted after rename a partition 
> with upper-case
> -
>
> Key: SPARK-19359
> URL: https://issues.apache.org/jira/browse/SPARK-19359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Song Jun
>Priority: Minor
>
> Hive metastore is not case preserving and keep partition columns with lower 
> case names. 
> If SparkSQL create a table with upper-case partion name use 
> HiveExternalCatalog, when we rename partition, it first call the HiveClient 
> to renamePartition, which will create a new lower case partition path, then 
> SparkSql rename the lower case path to the upper-case.
> while if the renamed partition contains more than one depth partition ,e.g. 
> A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to 
> A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19340) Opening a file in CSV format will result in an exception if the filename contains special characters

2017-01-25 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837407#comment-15837407
 ] 

Song Jun commented on SPARK-19340:
--

the reason is that spark sql treat the test{00-1}.txt as a globpath.
we can not put a file name like text{00-1}.txt to hdfs, it will throw an 
exception.

I think this is not a bug

> Opening a file in CSV format will result in an exception if the filename 
> contains special characters
> 
>
> Key: SPARK-19340
> URL: https://issues.apache.org/jira/browse/SPARK-19340
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.2.0
>Reporter: Reza Safi
>Priority: Minor
>
> If you want to open a file that its name is like  {noformat} "*{*}*.*" 
> {noformat} or {noformat} "*[*]*.*" {noformat} using CSV format, you will get 
> the "org.apache.spark.sql.AnalysisException: Path does not exist" whether the 
> file is a local file or on hdfs.
> This bug can be reproduced on master and all other Spark 2 branches.
> To reproduce:
> # Create a file like "test{00-1}.txt" on a local directory (like in 
> /Users/reza/test/test{00-1}.txt)
> # Run spark-shell
> # Execute this command:
> {noformat}
> val df=spark.read.option("header","false").csv("/Users/reza/test/*.txt")
> {noformat}
> You will see the following stack trace:
> {noformat}
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/Users/reza/test/test\{00-01\}.txt;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.readText(CSVFileFormat.scala:208)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:158)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:423)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:360)
>   ... 48 elided
> {noformat}
> If you put the file on hadoop (like on /user/root) when you try to run the 
> following:
> {noformat}
> val df=spark.read.option("header", false).csv("/user/root/*.txt")
> {noformat}
>  
> You will get the following exception:
> {noformat}
> org.apache.hadoop.mapred.InvalidInputException: Input Pattern 
> hdfs://hosturl/user/root/test\{00-01\}.txt matches 0 files
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at

[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-21 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935
 ] 

Song Jun edited comment on SPARK-19257 at 1/22/17 6:41 AM:
---

I found it that `CatalogPartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`， while this is not a 
legal URI (no encoded whitespace) 

a)should we seperate the CatalogTableStorageFormat from CatalogPartition and 
CatalogTable?

b)or we just check if the ` locationUri: String` is illegal when make a DDL 
command, such as ` alter table set location xx` or `create table t options 
(path xx) `

Hive's implement is like b) :
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

[~cloud_fan] [~smilegator]


was (Author: windpiger):
I found it that `CatalogPartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`， while this is not a 
legal URI (no encoded whitespace) 

should we seperate the CatalogTableStorageFormat from CatalogPartition and 
CatalogTable?

[~cloud_fan] [~smilegator]

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-21 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935
 ] 

Song Jun edited comment on SPARK-19257 at 1/22/17 6:00 AM:
---

I found it that `CatalogPartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`， while this is not a 
legal URI (no encoded whitespace) 

should we seperate the CatalogTableStorageFormat from CatalogPartition and 
CatalogTable?

[~cloud_fan] [~smilegator]


was (Author: windpiger):
I found it that `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`， while this is not a 
legal URI (no encoded whitespace) 

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-21 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935
 ] 

Song Jun edited comment on SPARK-19257 at 1/22/17 6:42 AM:
---

I found it that `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`， while this is not a 
legal URI (no encoded whitespace) 

a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition 
and CatalogTable?

b)or we just check if the ` locationUri: String` is illegal when make a DDL 
command, such as ` alter table set location xx` or `create table t options 
(path xx) `

Hive's implement is like b) :
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

[~cloud_fan] [~smilegator]


was (Author: windpiger):
I found it that `CatalogPartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`， while this is not a 
legal URI (no encoded whitespace) 

a)should we seperate the CatalogTableStorageFormat from CatalogPartition and 
CatalogTable?

b)or we just check if the ` locationUri: String` is illegal when make a DDL 
command, such as ` alter table set location xx` or `create table t options 
(path xx) `

Hive's implement is like b) :
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

[~cloud_fan] [~smilegator]

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19332) table's location should check if a URI is legal

2017-01-22 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19332:
-
Description: 
SPARK-19257 ‘s work is to change the type of  `CatalogStorageFormat` 's 
locationUri to `URI`, while it has some problem:

1.`CatalogTable` and `CatalogTablePartition` use the same class 
`CatalogStorageFormat`
2. the type URI is ok for `CatalogTable`, but it is not proper for 
`CatalogTablePartition`
3. the location of a table partition can contains a not encode whitespace, so 
  if a partition location contains this not encode whitespace, and it will 
throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a 
partition location which has whitespace

so if we change the type to URI, it is bad for `CatalogTablePartition`

and I found Hive has the same issue HIVE-6185
before hive 0.13 the location is URI, while after above PR, it change it to 
Path, and do some check when DDL.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

so I think ,we can do the URI check for the table's location , and it is not 
proper to change the type to URI.


  was:
~SPARK-19257 ‘s work is to change the type of  `CatalogStorageFormat` 's 
locationUri to `URI`, while it has some problem:

1.`CatalogTable` and `CatalogTablePartition` use the same class 
`CatalogStorageFormat`
2. the type URI is ok for `CatalogTable`, but it is not proper for 
`CatalogTablePartition`
3. the location of a table partition can contains a not encode whitespace, so 
  if a partition location contains this not encode whitespace, and it will 
throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a 
partition location which has whitespace

so if we change the type to URI, it is bad for `CatalogTablePartition`

and I found Hive has the same issue ~HIVE-6185
before hive 0.13 the location is URI, while after above PR, it change it to 
Path, and do some check when DDL.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

so I think ,we can do the URI check for the table's location , and it is not 
proper to change the type to URI.



> table's location should check if a URI is legal
> ---
>
> Key: SPARK-19332
> URL: https://issues.apache.org/jira/browse/SPARK-19332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>
> SPARK-19257 ‘s work is to change the type of  `CatalogStorageFormat` 's 
> locationUri to `URI`, while it has some problem:
> 1.`CatalogTable` and `CatalogTablePartition` use the same class 
> `CatalogStorageFormat`
> 2. the type URI is ok for `CatalogTable`, but it is not proper for 
> `CatalogTablePartition`
> 3. the location of a table partition can contains a not encode whitespace, so 
>   if a partition location contains this not encode whitespace, and it will 
> throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a 
> partition location which has whitespace
> so if we change the type to URI, it is bad for `CatalogTablePartition`
> and I found Hive has the same issue HIVE-6185
> before hive 0.13 the location is URI, while after above PR, it change it to 
> Path, and do some check when DDL.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732
> so I think ,we can do the URI check for the table's location , and it is not 
> proper to change the type to URI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19332) table's location should check if a URI is legal

2017-01-22 Thread Song Jun (JIRA)

Song Jun created SPARK-19332:


 Summary: table's location should check if a URI is legal
 Key: SPARK-19332
 URL: https://issues.apache.org/jira/browse/SPARK-19332
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Song Jun


~SPARK-19257 ‘s work is to change the type of  `CatalogStorageFormat` 's 
locationUri to `URI`, while it has some problem:

1.`CatalogTable` and `CatalogTablePartition` use the same class 
`CatalogStorageFormat`
2. the type URI is ok for `CatalogTable`, but it is not proper for 
`CatalogTablePartition`
3. the location of a table partition can contains a not encode whitespace, so 
  if a partition location contains this not encode whitespace, and it will 
throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a 
partition location which has whitespace

so if we change the type to URI, it is bad for `CatalogTablePartition`

and I found Hive has the same issue ~HIVE-6185
before hive 0.13 the location is URI, while after above PR, it change it to 
Path, and do some check when DDL.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

so I think ,we can do the URI check for the table's location , and it is not 
proper to change the type to URI.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19329) after alter a datasource table's location to a not exist location and then insert data throw Exception

2017-01-22 Thread Song Jun (JIRA)

Song Jun created SPARK-19329:


 Summary: after alter a datasource table's location to a not exist 
location and then insert data throw Exception
 Key: SPARK-19329
 URL: https://issues.apache.org/jira/browse/SPARK-19329
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Song Jun


spark.sql("create table t(a string, b int) using parquet")
spark.sql(s"alter table t set location '$notexistedlocation'")
spark.sql("insert into table t select 'c', 1")

this will throw an exception:

com.google.common.util.concurrent.UncheckedExecutionException: 
org.apache.spark.sql.AnalysisException: Path does not exist: 
$notexistedlocation;
at 
com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
at 
org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-22 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935
 ] 

Song Jun edited comment on SPARK-19257 at 1/22/17 11:22 AM:


I found it that `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`， while this is not a 
legal URI (no encoded whitespace) 

a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition 
and CatalogTable?

b)or we just check if the ` locationUri: String` is illegal when make a DDL 
command, such as ` alter table set location xx` or `create table t options 
(path xx) `

Hive's implement is like b) :
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

HIVE-0.12 the table's location method's paramenter Type is URI, after that 
version change to Path
https://github.com/apache/hive/blob/branch-0.12/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java#L501

this is the same issue in hive: 
https://issues.apache.org/jira/browse/HIVE-6185

[~cloud_fan] [~smilegator]


was (Author: windpiger):
I found it that `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`， while this is not a 
legal URI (no encoded whitespace) 

a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition 
and CatalogTable?

b)or we just check if the ` locationUri: String` is illegal when make a DDL 
command, such as ` alter table set location xx` or `create table t options 
(path xx) `

Hive's implement is like b) :
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

[~cloud_fan] [~smilegator]

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19667) Create table with HiveEnabled in default database use warehouse path instead of the location of default database

2017-02-20 Thread Song Jun (JIRA)

Song Jun created SPARK-19667:


 Summary: Create table with HiveEnabled in default database use 
warehouse path instead of the location of default database
 Key: SPARK-19667
 URL: https://issues.apache.org/jira/browse/SPARK-19667
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: Song Jun


Currently, when we create a managed table with HiveEnabled in default database, 
 Spark will use the location of default database as the table's location, this 
is ok in non-shared metastore.

While if we use a shared metastore between different clusters， for example，
1) there is a hive metastore in Cluster-A, and the metastore use a remote mysql 
as its db, and create a default database in metastore, then the location of the 
default database is the path in Cluster-A

2) then we create another Cluster-B, and Cluster-B also use the same remote 
mysql as its metastore's db, so the default database conf in Cluster-B download 
from mysql, which location is the path of Cluster-A

3) then we create a table in Cluster-B in default database, it will throw an 
exception, that UnknowHost Cluster-A

In Hive2.0.0, it is allowed to create a table in default database which shared 
between clusters , and this action is not allowed in other database, just for 
default.

As a spark User, we will want to have the same action as Hive, thus we can 
create table in default databse



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19723) create table for data source tables should work with an non-existent location

2017-02-23 Thread Song Jun (JIRA)

Song Jun created SPARK-19723:


 Summary: create table for data source tables should work with an 
non-existent location
 Key: SPARK-19723
 URL: https://issues.apache.org/jira/browse/SPARK-19723
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


This JIRA is a follow up work after SPARK-19583

As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 

The following DDL for datasource table with an non-existent location should 
work:
``
CREATE TABLE ... (PARTITIONED BY ...) LOCATION path
```
Currently it will throw exception  that path not exists



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19724) create table for hive tables with an existed default location should throw an exception

2017-02-24 Thread Song Jun (JIRA)

Song Jun created SPARK-19724:


 Summary: create table for hive tables with an existed default 
location should throw an exception
 Key: SPARK-19724
 URL: https://issues.apache.org/jira/browse/SPARK-19724
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


This JIRA is a follow up work after SPARK-19583

As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 

The following DDL for hive table with an existed default location should throw 
an exception:
{code}
CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
{code}
Currently it will success for this situation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19724) create managed table for hive tables with an existed default location should throw an exception

2017-02-24 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19724:
-
Summary: create managed table for hive tables with an existed default 
location should throw an exception  (was: create table for hive tables with an 
existed default location should throw an exception)

> create managed table for hive tables with an existed default location should 
> throw an exception
> ---
>
> Key: SPARK-19724
> URL: https://issues.apache.org/jira/browse/SPARK-19724
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> This JIRA is a follow up work after SPARK-19583
> As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 
> The following DDL for hive table with an existed default location should 
> throw an exception:
> {code}
> CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> {code}
> Currently it will success for this situation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19724) create managed table for hive tables with an existed default location should throw an exception

2017-02-24 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19724:
-
Description: 
This JIRA is a follow up work after 
[SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583)

As we discussed in that [PR](https://github.com/apache/spark/pull/16938)

The following DDL for a managed table with an existed default location should 
throw an exception:
{code}
CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
CREATE TABLE ... (PARTITIONED BY ...)
{code}
Currently there are some situations which are not consist with above logic:

1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default 
location
situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog)

2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
situation: hive table succeed with an existed default location


  was:
This JIRA is a follow up work after SPARK-19583

As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 

The following DDL for hive table with an existed default location should throw 
an exception:
{code}
CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
{code}
Currently it will success for this situation


> create managed table for hive tables with an existed default location should 
> throw an exception
> ---
>
> Key: SPARK-19724
> URL: https://issues.apache.org/jira/browse/SPARK-19724
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> This JIRA is a follow up work after 
> [SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583)
> As we discussed in that [PR](https://github.com/apache/spark/pull/16938)
> The following DDL for a managed table with an existed default location should 
> throw an exception:
> {code}
> CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> CREATE TABLE ... (PARTITIONED BY ...)
> {code}
> Currently there are some situations which are not consist with above logic:
> 1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default 
> location
> situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog)
> 2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> situation: hive table succeed with an existed default location



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19724) create a managed table with an existed default location should throw an exception

2017-02-24 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19724:
-
Summary: create a managed table with an existed default location should 
throw an exception  (was: create managed table for hive tables with an existed 
default location should throw an exception)

> create a managed table with an existed default location should throw an 
> exception
> -
>
> Key: SPARK-19724
> URL: https://issues.apache.org/jira/browse/SPARK-19724
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> This JIRA is a follow up work after 
> [SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583)
> As we discussed in that [PR](https://github.com/apache/spark/pull/16938)
> The following DDL for a managed table with an existed default location should 
> throw an exception:
> {code}
> CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> CREATE TABLE ... (PARTITIONED BY ...)
> {code}
> Currently there are some situations which are not consist with above logic:
> 1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default 
> location
> situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog)
> 2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> situation: hive table succeed with an existed default location



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19664) put 'hive.metastore.warehouse.dir' in hadoopConf place

2017-02-19 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19664:
-
Description: 
In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in the 
logic, when use the value of  'spark.sql.warehouse.dir' to overwrite 
'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it 
should put in 'sparkContext.hadoopConfiguration' and overwrite the original 
value of hadoopConf

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64

  was:
In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in the 
logic, when use the value of  'spark.sql.warehouse.dir' to overwrite 
'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it 
should put in 'sparkContext.hadoopConfiguration'

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64


> put 'hive.metastore.warehouse.dir' in hadoopConf place
> --
>
> Key: SPARK-19664
> URL: https://issues.apache.org/jira/browse/SPARK-19664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Song Jun
>Priority: Minor
>
> In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in 
> the logic, when use the value of  'spark.sql.warehouse.dir' to overwrite 
> 'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it 
> should put in 'sparkContext.hadoopConfiguration' and overwrite the original 
> value of hadoopConf
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19664) put 'hive.metastore.warehouse.dir' in hadoopConf place

2017-02-19 Thread Song Jun (JIRA)

Song Jun created SPARK-19664:


 Summary: put 'hive.metastore.warehouse.dir' in hadoopConf place
 Key: SPARK-19664
 URL: https://issues.apache.org/jira/browse/SPARK-19664
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Song Jun
Priority: Minor


In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in the 
logic, when use the value of  'spark.sql.warehouse.dir' to overwrite 
'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it 
should put in 'sparkContext.hadoopConfiguration'

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19484) continue work to create a table with an empty schema

2017-02-14 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun closed SPARK-19484.

Resolution: Won't Fix

this has been contained in https://github.com/apache/spark/pull/16787

> continue work to create a table with an empty schema
> 
>
> Key: SPARK-19484
> URL: https://issues.apache.org/jira/browse/SPARK-19484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> after SPARK-19279, we could not create a Hive table with an empty schema,
> we should tighten up the condition when create a hive table in 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835
> That is if a CatalogTable t has an empty schema, and (there is no 
> `spark.sql.schema.numParts` or its value is 0), we should not add a default 
> `col` schema, if we did, a table with an empty schema will be created, that 
> is not we expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19491) add a config for tableRelation cache size in SessionCatalog

2017-02-14 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun closed SPARK-19491.

Resolution: Duplicate

duplicate with https://github.com/apache/spark/pull/16736

> add a config for tableRelation cache size in SessionCatalog
> ---
>
> Key: SPARK-19491
> URL: https://issues.apache.org/jira/browse/SPARK-19491
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> currently the table relation cache size is hardcode to 1000, it is better to 
> add a config to set its size.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19558) Provide a config option to attach QueryExecutionListener to SparkSession

2017-02-12 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862768#comment-15862768
 ] 

Song Jun commented on SPARK-19558:
--

sparkSession.listenerManager.register is not enough?

> Provide a config option to attach QueryExecutionListener to SparkSession
> 
>
> Key: SPARK-19558
> URL: https://issues.apache.org/jira/browse/SPARK-19558
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Salil Surendran
>
> Provide a configuration property(just like spark.extraListeners) to attach a 
> QueryExecutionListener to a SparkSession



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19583) CTAS for data source tables with an created location does not work

2017-02-13 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865111#comment-15865111
 ] 

Song Jun commented on SPARK-19583:
--

ok, I'd like to take this one, thanks a lot!

> CTAS for data source tables with an created location does not work
> --
>
> Key: SPARK-19583
> URL: https://issues.apache.org/jira/browse/SPARK-19583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> {noformat}
> spark.sql(
>   s"""
>  |CREATE TABLE t
>  |USING parquet
>  |PARTITIONED BY(a, b)
>  |LOCATION '$dir'
>  |AS SELECT 3 as a, 4 as b, 1 as c, 2 as d
>""".stripMargin)
> {noformat}
> Failed with the error message:
> {noformat}
> path 
> file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4cgn/T/spark-195cd513-428a-4df9-b196-87db0c73e772
>  already exists.;
> org.apache.spark.sql.AnalysisException: path 
> file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4cgn/T/spark-195cd513-428a-4df9-b196-87db0c73e772
>  already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:102)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19166) change method name from InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix

2017-02-14 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun closed SPARK-19166.

Resolution: Not A Bug

minor issue

> change method name from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix
> 
>
> Key: SPARK-19166
> URL: https://issues.apache.org/jira/browse/SPARK-19166
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Song Jun
>Priority: Minor
>
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions delete all files 
> that match a static prefix, such as a partition file path(/table/foo=1), or a 
> no partition file path(/xxx/a.json).
> while the method name deleteMatchingPartitions indicates that only the 
> partition file will be deleted. This name make a confused.
> It is better to rename the method name.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19598) Remove the alias parameter in UnresolvedRelation

2017-02-15 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867548#comment-15867548
 ] 

Song Jun edited comment on SPARK-19598 at 2/15/17 9:46 AM:
---

[~rxin] When I do this jira, I found it that it is not proper to remove 
UnresolvedRelation's alias argument. If we remove it and use SubqueryAlias, 
that is:

{quote}SubqueryAlias(alias,UnresolvedRelation(tableIdentifier),None){quote}

to replace 

{quote}UnresolvedRelation(tableIdentifier, alias){quote}

While there are lots of  *match case* codes for *UnresolvedRelation*,  and in 
matched logic it will use the alias parameter of *UnresolvedRelation*， 
currently table with/without alias can processed in one *match case 
UnresolvedRelation* logic, after this change, we should process table with 
alias and without alias seperately in two *match case*：
{quote}
case u:UnresolvedRelation => func(u,None)
case s@SubqueryAlias(alias,u:UnresolvedRelation,_) => func(u, alias)
{quote}
 
such as:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L589-L591

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L626

Is this right? Or am I missing something?


was (Author: windpiger):
[~rxin] When I do this jira, I found it that it is not proper to remove 
UnresolvedRelation's alias argument. If we remove it and use SubqueryAlias, 
that is:

{quote}SubqueryAlias(alias,UnresolvedRelation(tableIdentifier),None){quote}

to replace 

{quote}UnresolvedRelation(tableIdentifier, aliase){quote}

While there are lots of  *match case* codes for *UnresolvedRelation*,  and in 
matched logic it will use the alias parameter of *UnresolvedRelation*， 
currently table with/without alias can processed in one *match case 
UnresolvedRelation* logic, after this change, we should process table with 
alias and without alias seperately in two *match case*：
{quote}
case u:UnresolvedRelation => func(u,None)
case s@SubqueryAlias(alias,u:UnresolvedRelation,_) => func(u, alias)
{quote}
 
such as:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L589-L591

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L626

Is this right? Or am I missing something?

> Remove the alias parameter in UnresolvedRelation
> 
>
> Key: SPARK-19598
> URL: https://issues.apache.org/jira/browse/SPARK-19598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> UnresolvedRelation has a second argument named "alias", for assigning the 
> relation an alias. I think we can actually remove it and replace its use with 
> a SubqueryAlias.
> This would actually simplify some analyzer code to only match on 
> SubqueryAlias. For example, the broadcast hint pull request can have one 
> fewer case https://github.com/apache/spark/pull/16925/files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation

2017-02-14 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867398#comment-15867398
 ] 

Song Jun commented on SPARK-19598:
--

OK~ I'd like to do this. Thank you very much!

> Remove the alias parameter in UnresolvedRelation
> 
>
> Key: SPARK-19598
> URL: https://issues.apache.org/jira/browse/SPARK-19598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> UnresolvedRelation has a second argument named "alias", for assigning the 
> relation an alias. I think we can actually remove it and replace its use with 
> a SubqueryAlias.
> This would actually simplify some analyzer code to only match on 
> SubqueryAlias. For example, the broadcast hint pull request can have one 
> fewer case https://github.com/apache/spark/pull/16925/files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation

2017-02-15 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867548#comment-15867548
 ] 

Song Jun commented on SPARK-19598:
--

[~rxin] When I do this jira, I found it that it is not proper to remove 
UnresolvedRelation's alias argument. If we remove it and use SubqueryAlias, 
that is:

{quote}SubqueryAlias(alias,UnresolvedRelation(tableIdentifier),None){quote}

to replace 

{quote}UnresolvedRelation(tableIdentifier, aliase){quote}

While there are lots of  *match case* codes for *UnresolvedRelation*,  and in 
matched logic it will use the alias parameter of *UnresolvedRelation*， 
currently table with/without alias can processed in one *match case 
UnresolvedRelation* logic, after this change, we should process table with 
alias and without alias seperately in two *match case*：
{quote}
case u:UnresolvedRelation => func(u,None)
case s@SubqueryAlias(alias,u:UnresolvedRelation,_) => func(u, alias)
{quote}
 
such as:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L589-L591

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L626

Is this right? Or am I missing something?

> Remove the alias parameter in UnresolvedRelation
> 
>
> Key: SPARK-19598
> URL: https://issues.apache.org/jira/browse/SPARK-19598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> UnresolvedRelation has a second argument named "alias", for assigning the 
> relation an alias. I think we can actually remove it and replace its use with 
> a SubqueryAlias.
> This would actually simplify some analyzer code to only match on 
> SubqueryAlias. For example, the broadcast hint pull request can have one 
> fewer case https://github.com/apache/spark/pull/16925/files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19577) insert into a partition datasource table with InMemoryCatalog after the partition location alter by alter command failed

2017-02-13 Thread Song Jun (JIRA)

Song Jun created SPARK-19577:


 Summary: insert into a partition datasource table with 
InMemoryCatalog after the partition location alter by alter command failed
 Key: SPARK-19577
 URL: https://issues.apache.org/jira/browse/SPARK-19577
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


If we use InMemoryCatalog, then we insert into a partition datasource table, 
which partition location has changed by `alter table t partition(a="xx") set 
location $newpath`, the insert operation is ok, and the data can be insert into 
$newpath, while if we then select partition from the table, it will not return 
the value we inserted.

The reason is that the InMemoryFileIndex to inferPartition by the table's 
rootPath, it does not track the user specific $newPath which provided by alter 
command.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation

2017-02-16 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869623#comment-15869623
 ] 

Song Jun commented on SPARK-19598:
--

Thanks~ let me investigate more~

> Remove the alias parameter in UnresolvedRelation
> 
>
> Key: SPARK-19598
> URL: https://issues.apache.org/jira/browse/SPARK-19598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> UnresolvedRelation has a second argument named "alias", for assigning the 
> relation an alias. I think we can actually remove it and replace its use with 
> a SubqueryAlias.
> This would actually simplify some analyzer code to only match on 
> SubqueryAlias. For example, the broadcast hint pull request can have one 
> fewer case https://github.com/apache/spark/pull/16925/files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19575) Reading from or writing to a hive serde table with a non pre-existing location should succeed

2017-02-13 Thread Song Jun (JIRA)

Song Jun created SPARK-19575:


 Summary: Reading from or writing to a hive serde table with a non 
pre-existing location should succeed
 Key: SPARK-19575
 URL: https://issues.apache.org/jira/browse/SPARK-19575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


currently when we select from a hive serde table which has a non pre-existing 
location will throw an exception:

```
Input path does not exist: file:/tmp/spark-37caa4e6-5a6a-4361-a905-06cc56afb274
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/tmp/spark-37caa4e6-5a6a-4361-a905-06cc56afb274
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2080)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:258)
```

this is a folllowup work from SPARK-19329 which has unify the action when we 
reading from or writing to a datasource table with a non pre-existing locaiton, 
so here we should also unify the hive serde tables



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19577) insert into a partition datasource table with InMemoryCatalog after the partition location alter by alter command failed

2017-02-13 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863511#comment-15863511
 ] 

Song Jun commented on SPARK-19577:
--

I am working on this~

> insert into a partition datasource table with InMemoryCatalog after the 
> partition location alter by alter command failed
> 
>
> Key: SPARK-19577
> URL: https://issues.apache.org/jira/browse/SPARK-19577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> If we use InMemoryCatalog, then we insert into a partition datasource table, 
> which partition location has changed by `alter table t partition(a="xx") set 
> location $newpath`, the insert operation is ok, and the data can be insert 
> into $newpath, while if we then select partition from the table, it will not 
> return the value we inserted.
> The reason is that the InMemoryFileIndex to inferPartition by the table's 
> rootPath, it does not track the user specific $newPath which provided by 
> alter command.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19284) append to a existed partitioned datasource table should have no CustomPartitionLocations

2017-01-19 Thread Song Jun (JIRA)

Song Jun created SPARK-19284:


 Summary: append to a existed partitioned datasource table should 
have no CustomPartitionLocations
 Key: SPARK-19284
 URL: https://issues.apache.org/jira/browse/SPARK-19284
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Song Jun
Priority: Minor


when we append data to a existed partitioned datasource table, the 
InsertIntoHadoopFsRelationCommand.getCustomPartitionLocations currently
return the same location with Hive default, it should return None.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19241) remove hive generated table properties if they are not useful in Spark

2017-01-16 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823654#comment-15823654
 ] 

Song Jun commented on SPARK-19241:
--

I am working on this~

> remove hive generated table properties if they are not useful in Spark
> --
>
> Key: SPARK-19241
> URL: https://issues.apache.org/jira/browse/SPARK-19241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> When we save a table into hive metastore, hive will generate some table 
> properties automatically. e.g. transient_lastDdlTime, last_modified_by, 
> rawDataSize, etc. Some of them are useless in Spark SQL, we should remove 
> them.
> It will be good if we can get the list of Hive-generated table properties via 
> Hive API, so that we don't need to hardcode them.
> We can take a look at Hive code to see how it excludes these auto-generated 
> table properties when describe table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-16 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823657#comment-15823657
 ] 

Song Jun commented on SPARK-19153:
--

I am sorry, It is my fault,I forget to comment

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19246) CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema

2017-01-16 Thread Song Jun (JIRA)

Song Jun created SPARK-19246:


 Summary: CataLogTable's partitionSchema should check if each 
column name in partitionColumnNames must match one and only one field in schema
 Key: SPARK-19246
 URL: https://issues.apache.org/jira/browse/SPARK-19246
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Song Jun


get CataLogTable's partitionSchema should check if each column name in 
partitionColumnNames must match one and only one field in schema, if not we 
should throw an exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19246) CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, and keep order with partitionColumnN

2017-01-16 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19246:
-
Description: 
get CataLogTable's partitionSchema should check if each column name in 
partitionColumnNames must match one and only one field in schema, if not we 
should throw an exception

and CataLogTable's partitionSchema should keep order with partitionColumnNames 

  was:get CataLogTable's partitionSchema should check if each column name in 
partitionColumnNames must match one and only one field in schema, if not we 
should throw an exception


> CataLogTable's partitionSchema should check if each column name in 
> partitionColumnNames must match one and only one field in schema, and keep 
> order with partitionColumnNames 
> --
>
> Key: SPARK-19246
> URL: https://issues.apache.org/jira/browse/SPARK-19246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>
> get CataLogTable's partitionSchema should check if each column name in 
> partitionColumnNames must match one and only one field in schema, if not we 
> should throw an exception
> and CataLogTable's partitionSchema should keep order with 
> partitionColumnNames 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19246) CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, and keep order with partitionColumnN

2017-01-16 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19246:
-
Summary: CataLogTable's partitionSchema should check if each column name in 
partitionColumnNames must match one and only one field in schema, and keep 
order with partitionColumnNames   (was: CataLogTable's partitionSchema should 
check if each column name in partitionColumnNames must match one and only one 
field in schema)

> CataLogTable's partitionSchema should check if each column name in 
> partitionColumnNames must match one and only one field in schema, and keep 
> order with partitionColumnNames 
> --
>
> Key: SPARK-19246
> URL: https://issues.apache.org/jira/browse/SPARK-19246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>
> get CataLogTable's partitionSchema should check if each column name in 
> partitionColumnNames must match one and only one field in schema, if not we 
> should throw an exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19246) CataLogTable's partitionSchema should check order

2017-01-16 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19246:
-
Summary: CataLogTable's partitionSchema should check order  (was: 
CataLogTable's partitionSchema should check if each column name in 
partitionColumnNames must match one and only one field in schema, and keep 
order with partitionColumnNames )

> CataLogTable's partitionSchema should check order
> ---
>
> Key: SPARK-19246
> URL: https://issues.apache.org/jira/browse/SPARK-19246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>
> get CataLogTable's partitionSchema should check if each column name in 
> partitionColumnNames must match one and only one field in schema, if not we 
> should throw an exception
> and CataLogTable's partitionSchema should keep order with 
> partitionColumnNames 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-17 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825645#comment-15825645
 ] 

Song Jun commented on SPARK-19257:
--

I am working on this~ thanks

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19761) create InMemoryFileIndex with empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero

2017-02-27 Thread Song Jun (JIRA)

Song Jun created SPARK-19761:


 Summary: create InMemoryFileIndex with empty rootPaths when set 
PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero
 Key: SPARK-19761
 URL: https://issues.apache.org/jira/browse/SPARK-19761
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


if we create a InMemoryFileIndex with an empty rootPaths when set 
PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero, it will throw an  exception:

{code}
Positive number of slices required
java.lang.IllegalArgumentException: Positive number of slices required
at 
org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:119)
at 
org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(ParallelCollectionRDD.scala:97)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2084)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles(PartitioningAwareFileIndex.scala:357)
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.listLeafFiles(PartitioningAwareFileIndex.scala:256)
at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:74)
at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(InMemoryFileIndex.scala:50)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9$$anonfun$apply$mcV$sp$2.apply$mcV$sp(FileIndexSuite.scala:186)
at 
org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:105)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite.withSQLConf(FileIndexSuite.scala:33)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply$mcV$sp(FileIndexSuite.scala:185)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19763) qualified external datasource table location stored in catalog

2017-02-27 Thread Song Jun (JIRA)

Song Jun created SPARK-19763:


 Summary: qualified external datasource table location stored in 
catalog
 Key: SPARK-19763
 URL: https://issues.apache.org/jira/browse/SPARK-19763
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


If we create a external datasource table with a non-qualified location , we 
should qualified it to store in catalog.

{code}
CREATE TABLE t(a string)
USING parquet
LOCATION '/path/xx'


CREATE TABLE t1(a string, b string)
USING parquet
PARTITIONED BY(b)
LOCATION '/path/xx'
{code}

when we get the table from catalog, the location should be qualified, 
e.g.'file:/path/xxx' 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19748) refresh for InMemoryFileIndex with FileStatusCache does not work correctly

2017-02-26 Thread Song Jun (JIRA)

Song Jun created SPARK-19748:


 Summary: refresh for InMemoryFileIndex with FileStatusCache does 
not work correctly
 Key: SPARK-19748
 URL: https://issues.apache.org/jira/browse/SPARK-19748
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use the 
FileStatusCache to generate the cachedLeafFiles etc, then call 
FileStatusCache.invalidateAll. the order to do these two actions is wrong, this 
lead to the refresh action does not take effect.

{code}
  override def refresh(): Unit = {
refresh0()
fileStatusCache.invalidateAll()
  }

  private def refresh0(): Unit = {
val files = listLeafFiles(rootPaths)
cachedLeafFiles =
  new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => 
f.getPath -> f)
cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent)
cachedPartitionSpec = null
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19742) When using SparkSession to write a dataset to Hive the schema is ignored

2017-02-27 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885399#comment-15885399
 ] 

Song Jun commented on SPARK-19742:
--

this is expected, see the comment.

{code}
/**
   * Inserts the content of the `DataFrame` to the specified table. It requires 
that
   * the schema of the `DataFrame` is the same as the schema of the table.
   *
   * @note Unlike `saveAsTable`, `insertInto` ignores the column names and just 
uses position-based
   * resolution. For example:
   *
   * {{{
   *scala> Seq((1, 2)).toDF("i", 
"j").write.mode("overwrite").saveAsTable("t1")
   *scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1")
   *scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1")
   *scala> sql("select * from t1").show
   *+---+---+
   *|  i|  j|
   *+---+---+
   *|  5|  6|
   *|  3|  4|
   *|  1|  2|
   *+---+---+
   * }}}
   *
   * Because it inserts data to an existing table, format or options will be 
ignored.
   *
   * @since 1.4.0
   */
  def insertInto(tableName: String): Unit = {

insertInto(df.sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName))
  }
{code}

> When using SparkSession to write a dataset to Hive the schema is ignored
> 
>
> Key: SPARK-19742
> URL: https://issues.apache.org/jira/browse/SPARK-19742
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.1
> Environment: Running on Ubuntu with HDP 2.4.
>Reporter: Navin Goel
>
> I am saving a Dataset that is created form reading a json and some selects 
> and filters into a hive table. The dataset.write().insertInto function does 
> not look at schema when writing to the table but instead writes in order to 
> the hive table.
> The schemas for both the tables are same.
> schema printed from spark of the dataset being written:
> StructType(StructField(countrycode,StringType,true), 
> StructField(systemflag,StringType,true), 
> StructField(classcode,StringType,true), 
> StructField(classname,StringType,true), 
> StructField(rangestart,StringType,true), 
> StructField(rangeend,StringType,true), 
> StructField(tablename,StringType,true), 
> StructField(last_updated_date,TimestampType,true))
> Schema of the dataset after loading the same table from Hive:
> StructType(StructField(systemflag,StringType,true), 
> StructField(RangeEnd,StringType,true), 
> StructField(classcode,StringType,true), 
> StructField(classname,StringType,true), 
> StructField(last_updated_date,TimestampType,true), 
> StructField(countrycode,StringType,true), 
> StructField(rangestart,StringType,true), 
> StructField(tablename,StringType,true))



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19784) refresh datasource table after alter the location

2017-03-01 Thread Song Jun (JIRA)

Song Jun created SPARK-19784:


 Summary: refresh datasource table after alter the location
 Key: SPARK-19784
 URL: https://issues.apache.org/jira/browse/SPARK-19784
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


currently if we alter the location of a datasource table, then we select from 
it, it still return the data of  the old location.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18137) RewriteDistinctAggregates UnresolvedException because of UDAF TypeCheckFailure

2016-10-27 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-18137:
-
Description: 
when run a sql with distinct(on spark github master branch), it throw 
UnresolvedException.

For example:
run a test case on spark(branch master)  with sql:
SELECT percentile_approx(key, 0.9), count(distinct key),sum(distinct key) 
FROM src LIMIT 1

and it throw exception:

```
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'percentile_approx(CAST(src.`key` AS 
DOUBLE), CAST(0.9BD AS DOUBLE), 1)
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:92)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$nullify(RewriteDistinctAggregates.scala:261)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$evalWithinGroup$1(RewriteDistinctAggregates.scala:136)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:187)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at

[jira] [Created] (SPARK-18137) RewriteDistinctAggregates UnresolvedException because of UDAF TypeCheckFailure

2016-10-27 Thread Song Jun (JIRA)

Song Jun created SPARK-18137:


 Summary: RewriteDistinctAggregates UnresolvedException because of 
UDAF TypeCheckFailure
 Key: SPARK-18137
 URL: https://issues.apache.org/jira/browse/SPARK-18137
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Song Jun


when run a sql with distinct(on master branch), it throw UnresolvedException.

For example:
run a test case on spark(branch master)  with sql:
SELECT percentile_approx(key, 0.9), count(distinct key),sum(distinct key) 
FROM src LIMIT 1

and it throw exception:

```
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'percentile_approx(CAST(src.`key` AS 
DOUBLE), CAST(0.9BD AS DOUBLE), 1)
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:92)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$nullify(RewriteDistinctAggregates.scala:261)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$evalWithinGroup$1(RewriteDistinctAggregates.scala:136)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:187)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)

[jira] [Updated] (SPARK-18137) RewriteDistinctAggregates UnresolvedException when a UDAF has a foldable TypeCheck

2016-10-27 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-18137:
-
Summary: RewriteDistinctAggregates UnresolvedException when a UDAF has a 
foldable TypeCheck  (was: RewriteDistinctAggregates UnresolvedException because 
of UDAF TypeCheckFailure)

> RewriteDistinctAggregates UnresolvedException when a UDAF has a foldable 
> TypeCheck
> --
>
> Key: SPARK-18137
> URL: https://issues.apache.org/jira/browse/SPARK-18137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Song Jun
>
> when run a sql with distinct(on spark github master branch), it throw 
> UnresolvedException.
> For example:
> run a test case on spark(branch master)  with sql:
> SELECT percentile_approx(key, 0.9), count(distinct key),sum(distinct key) 
> FROM src LIMIT 1
> and it throw exception:
> ```
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'percentile_approx(CAST(src.`key` AS 
> DOUBLE), CAST(0.9BD AS DOUBLE), 1)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:92)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$nullify(RewriteDistinctAggregates.scala:261)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$evalWithinGroup$1(RewriteDistinctAggregates.scala:136)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:187)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
>

[jira] [Commented] (SPARK-18298) HistoryServer use GMT time all time

2016-11-07 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646333#comment-15646333
 ] 

Song Jun commented on SPARK-18298:
--

[~WangTao]  I test it,and the ui should show the user's local time which the 
1.6.x did, so I think this is a bug, and I post a pull request.

> HistoryServer use GMT time all time
> ---
>
> Key: SPARK-18298
> URL: https://issues.apache.org/jira/browse/SPARK-18298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.0.1
> Environment: suse 11.3 with CST time
>Reporter: Tao Wang
>
> When I started HistoryServer for reading event logs, the timestamp readed 
> will be parsed using local timezone like "CST"(confirmed via debug).
> But the time related columns like "Started"/"Completed"/"Last Updated" in 
> History Server UI using "GMT" time, which is 8 hours earlier than "CST".
> {quote}
> App IDApp NameStarted Completed   DurationSpark 
> User  Last UpdatedEvent Log
> local-1478225166651   Spark shell 2016-11-04 02:06:06 2016-11-07 
> 01:33:30 71.5 h  root2016-11-07 01:33:30
> {quote}
> I've checked the REST api and found the result like:
> {color:red}
> [ {
>   "id" : "local-1478225166651",
>   "name" : "Spark shell",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:06:06.020GMT",  
> "endTime" : "2016-11-07T01:33:30.265GMT",  
> "lastUpdated" : "2016-11-07T01:33:30.000GMT",
> "duration" : 257244245,
> "sparkUser" : "root",
> "completed" : true,
> "lastUpdatedEpoch" : 147848241,
> "endTimeEpoch" : 1478482410265,
> "startTimeEpoch" : 1478225166020
>   } ]
> }, {
>   "id" : "local-1478224925869",
>   "name" : "Spark Pi",
>   "attempts" : [ {
> "startTime" : "2016-11-04T02:02:02.133GMT",
> "endTime" : "2016-11-04T02:02:07.468GMT",
> "lastUpdated" : "2016-11-04T02:02:07.000GMT",
> "duration" : 5335,
> "sparkUser" : "root",
> "completed" : true,
> ...
> {color}
> So maybe the change happened in transferring between server and browser? I 
> have no idea where to go from this point.
> Hope guys can offer some help, or just fix it if it's easy? :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2016-11-07 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646477#comment-15646477
 ] 

Song Jun edited comment on SPARK-18055 at 11/8/16 5:04 AM:
---

[~davies] I can't reproduce it on the master branch or on databricks(Spark 
2.0.1-db1 (Scala 2.11)),my code:

import scala.collection.mutable.ArrayBuffer
case class MyData(id: String,arr: Seq\[String])
val myarr = ArrayBuffer\[MyData]()
for(i <- 20 to 30){
val arr = ArrayBuffer\[String]()
for(j <- 1 to 10) {
arr += (i+j).toString
}

val mydata = new MyData(i.toString,arr)
myarr += mydata
}

val rdd = spark.sparkContext.makeRDD(myarr)
val ds = rdd.toDS

ds.rdd.flatMap(_.arr)
ds.flatMap(_.arr)

there is no exception, it has fixed? or My code is wrong?

I also test with the test-jar_2.11-1.0.jar in spark-shell:
>spark-shell --jars test-jar_2.11-1.0.jar

and there is no exception.



was (Author: windpiger):
[~davies] I can't reproduce it on the master branch or on databricks(Spark 
2.0.1-db1 (Scala 2.11)),my code:

import scala.collection.mutable.ArrayBuffer
case class MyData(id: String,arr: Seq\[String])
val myarr = ArrayBuffer\[MyData]()
for(i <- 20 to 30){
val arr = ArrayBuffer\[String]()
for(j <- 1 to 10) {
arr += (i+j).toString
}

val mydata = new MyData(i.toString,arr)
myarr += mydata
}

val rdd = spark.sparkContext.makeRDD(myarr)
val ds = rdd.toDS

ds.rdd.flatMap(_.arr)
ds.flatMap(_.arr)

there is no exception, it has fixed? or My code is wrong?
or I must test it with customed jar?

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2016-11-07 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646477#comment-15646477
 ] 

Song Jun edited comment on SPARK-18055 at 11/8/16 4:51 AM:
---

[~davies] I can't reproduce it on the master branch or on databricks(Spark 
2.0.1-db1 (Scala 2.11)),my code:

import scala.collection.mutable.ArrayBuffer
case class MyData(id: String,arr: Seq\[String])
val myarr = ArrayBuffer\[MyData]()
for(i <- 20 to 30){
val arr = ArrayBuffer\[String]()
for(j <- 1 to 10) {
arr += (i+j).toString
}

val mydata = new MyData(i.toString,arr)
myarr += mydata
}

val rdd = spark.sparkContext.makeRDD(myarr)
val ds = rdd.toDS

ds.rdd.flatMap(_.arr)
ds.flatMap(_.arr)

there is no exception, it has fixed? or My code is wrong?
or I must test it with customed jar?


was (Author: windpiger):
[~davies] I can't reproduce it on the master branch or on databricks(Spark 
2.0.1-db1 (Scala 2.11)),my code:

import scala.collection.mutable.ArrayBuffer
case class MyData(id: String,arr: Seq\[String])
val myarr = ArrayBuffer\[MyData]()
for(i <- 20 to 30){
val arr = ArrayBuffer\[String]()
for(j <- 1 to 10) {
arr += (i+j).toString
}

val mydata = new MyData(i.toString,arr)
myarr += mydata
}

val rdd = spark.sparkContext.makeRDD(myarr)
val ds = rdd.toDS

ds.rdd.flatMap(_.arr)
ds.flatMap(_.arr)

there is no exception, it has fixed? or My code is wrong?

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2016-11-07 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646477#comment-15646477
 ] 

Song Jun commented on SPARK-18055:
--

[~davies] I can't reproduce it on the master branch or on databricks(Spark 
2.0.1-db1 (Scala 2.11)),my code:

import scala.collection.mutable.ArrayBuffer
case class MyData(id: String,arr: Seq\[String])
val myarr = ArrayBuffer\[MyData]()
for(i <- 20 to 30){
val arr = ArrayBuffer\[String]()
for(j <- 1 to 10) {
arr += (i+j).toString
}

val mydata = new MyData(i.toString,arr)
myarr += mydata
}

val rdd = spark.sparkContext.makeRDD(myarr)
val ds = rdd.toDS

ds.rdd.flatMap(_.arr)
ds.flatMap(_.arr)

there is no exception, it has fixed? or My code is wrong?

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18490) duplicate nodename extrainfo of ShuffleExchange

2016-11-17 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673918#comment-15673918
 ] 

Song Jun commented on SPARK-18490:
--

it is not a bug, just simplify this code 

> duplicate nodename extrainfo of ShuffleExchange
> ---
>
> Key: SPARK-18490
> URL: https://issues.apache.org/jira/browse/SPARK-18490
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Song Jun
>Priority: Trivial
>
> {noformat}
>   override def nodeName: String = {
> val extraInfo = coordinator match {
>   case Some(exchangeCoordinator) if exchangeCoordinator.isEstimated =>
> s"(coordinator id: ${System.identityHashCode(coordinator)})"
>   case Some(exchangeCoordinator) if !exchangeCoordinator.isEstimated =>
> s"(coordinator id: ${System.identityHashCode(coordinator)})"
>   case None => ""
> }
> {noformat}
>  [if exchangeCoordinator.isEstimated ] true or false with the same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18490) duplicate nodename extrainfo of ShuffleExchange

2016-11-17 Thread Song Jun (JIRA)

Song Jun created SPARK-18490:


 Summary: duplicate nodename extrainfo of ShuffleExchange
 Key: SPARK-18490
 URL: https://issues.apache.org/jira/browse/SPARK-18490
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.2, 2.0.1, 2.0.0
Reporter: Song Jun
Priority: Minor


  override def nodeName: String = {
val extraInfo = coordinator match {
  case Some(exchangeCoordinator) if exchangeCoordinator.isEstimated =>
s"(coordinator id: ${System.identityHashCode(coordinator)})"
  case Some(exchangeCoordinator) if !exchangeCoordinator.isEstimated =>
s"(coordinator id: ${System.identityHashCode(coordinator)})"
  case None => ""
}

 [if exchangeCoordinator.isEstimated ] true or false with the same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18490) duplicate nodename extrainfo of ShuffleExchange

2016-11-17 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-18490:
-
Component/s: (was: Spark Core)
 SQL

> duplicate nodename extrainfo of ShuffleExchange
> ---
>
> Key: SPARK-18490
> URL: https://issues.apache.org/jira/browse/SPARK-18490
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Song Jun
>Priority: Minor
>
>   override def nodeName: String = {
> val extraInfo = coordinator match {
>   case Some(exchangeCoordinator) if exchangeCoordinator.isEstimated =>
> s"(coordinator id: ${System.identityHashCode(coordinator)})"
>   case Some(exchangeCoordinator) if !exchangeCoordinator.isEstimated =>
> s"(coordinator id: ${System.identityHashCode(coordinator)})"
>   case None => ""
> }
>  [if exchangeCoordinator.isEstimated ] true or false with the same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14171) UDAF aggregates argument object inspector not parsed correctly

2016-10-27 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614192#comment-15614192
 ] 

Song Jun commented on SPARK-14171:
--

This issue does not reproduce on the recently spark branch master ,because 
spark has implement its own percentile_approx(ApproximatePercentile.scala) not 
use the hive's, then there is no constant type check exception.

But there is another problem,when sql:
select percentile_approxy(key,0.9),count(distinct key),sume(distinc key) 
from src limit 1

detail in issue [SPARK-18137]

> UDAF aggregates argument object inspector not parsed correctly
> --
>
> Key: SPARK-14171
> URL: https://issues.apache.org/jira/browse/SPARK-14171
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jianfeng Hu
>Priority: Critical
>
> For example, when using percentile_approx and count distinct together, it 
> raises an error complaining the argument is not constant. We have a test case 
> to reproduce. Could you help look into a fix of this? This was working in 
> previous version (Spark 1.4 + Hive 0.13). Thanks!
> {code}--- 
> a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
> +++ 
> b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
> @@ -148,6 +148,9 @@ class HiveUDFSuite extends QueryTest with 
> TestHiveSingleton with SQLTestUtils {
>  checkAnswer(sql("SELECT percentile_approx(100.0, array(0.9, 0.9)) FROM 
> src LIMIT 1"),
>sql("SELECT array(100, 100) FROM src LIMIT 1").collect().toSeq)
> +
> +checkAnswer(sql("SELECT percentile_approx(key, 0.9), count(distinct 
> key) FROM src LIMIT 1"),
> +  sql("SELECT max(key), 1 FROM src LIMIT 1").collect().toSeq)
> }
>test("UDFIntegerToString") {
> {code}
> When running the test suite, we can see this error:
> {code}
> - Generic UDAF aggregates *** FAILED ***
>   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: 
> hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox@6e1dc6a7),key#51176,0.9,false,0,0)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:238)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter.org$apache$spark$sql$catalyst$analysis$DistinctAggregationRewriter$$patchAggregateFunctionChildren$1(DistinctAggregationRewriter.scala:148)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:192)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:190)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   ...
>   Cause: java.lang.reflect.InvocationTargetException:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:367)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:365)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
>   ...
>   Cause: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The second 
> argument must be a constant, but double was passed instead.
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox.getEvaluator(GenericUDAFPercentileApprox.java:147)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector$lzycompute(hiveUDFs.scala:598)
>   at 
>

[jira] [Created] (SPARK-18271) hash udf in HiveSessionCatalog.hiveFunctions seq is redundant

2016-11-04 Thread Song Jun (JIRA)

Song Jun created SPARK-18271:


 Summary: hash udf in HiveSessionCatalog.hiveFunctions seq is 
redundant
 Key: SPARK-18271
 URL: https://issues.apache.org/jira/browse/SPARK-18271
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Song Jun
Priority: Minor


when lookupfunction in HiveSessionCatalog, first look up in spark's build-in 
functionRegistry, if it not existed, then look up the Hive'S build-in functions 
which are listed in Seq(hiveFunctions).

But the [hash] function is already in spark's build-in functionRegistry list, 
so it will never to go to look up in Hive's hiveFunctions list, so the [hash] 
function which is in Hive's hiveFunctions is redundant, we can remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18609) [SQL] column mixup with CROSS JOIN

2016-12-13 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745309#comment-15745309
 ] 

Song Jun commented on SPARK-18609:
--

open another jira SPARK-18841

> [SQL] column mixup with CROSS JOIN
> --
>
> Key: SPARK-18609
> URL: https://issues.apache.org/jira/browse/SPARK-18609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Furcy Pin
>
> Reproduced on spark-sql v2.0.2 and on branch master.
> {code}
> DROP TABLE IF EXISTS p1 ;
> DROP TABLE IF EXISTS p2 ;
> CREATE TABLE p1 (col TIMESTAMP) ;
> CREATE TABLE p2 (col TIMESTAMP) ;
> set spark.sql.crossJoin.enabled = true;
> -- EXPLAIN
> WITH CTE AS (
>   SELECT
> s2.col as col
>   FROM p1
>   CROSS JOIN (
> SELECT
>   e.col as col
> FROM p2 E
>   ) s2
> )
> SELECT
>   T1.col as c1,
>   T2.col as c2
> FROM CTE T1
> CROSS JOIN CTE T2
> ;
> {code}
> This returns the following stacktrace :
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: col#21
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
>

[jira] [Created] (SPARK-18841) PushProjectionThroughUnion exception when there are same column

2016-12-13 Thread Song Jun (JIRA)

Song Jun created SPARK-18841:


 Summary: PushProjectionThroughUnion exception when there are same 
column
 Key: SPARK-18841
 URL: https://issues.apache.org/jira/browse/SPARK-18841
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2, 2.1.0
Reporter: Song Jun


DROP TABLE IF EXISTS p1 ;
DROP TABLE IF EXISTS p2 ;
DROP TABLE IF EXISTS p3 ;
CREATE TABLE p1 (col STRING) ;
CREATE TABLE p2 (col STRING) ;
CREATE TABLE p3 (col STRING) ;

set spark.sql.crossJoin.enabled = true;

SELECT
  1 as cste,
  col
FROM (
  SELECT
col as col
  FROM (
SELECT
  p1.col as col
FROM p1
LEFT JOIN p2 
UNION ALL
SELECT
  col
FROM p3
  ) T1
) T2
;

it will throw exception:


key not found: col#16
java.util.NoSuchElementException: key not found: col#16
at scala.collection.MapLike$class.default(MapLike.scala:228)
at 
org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at 
org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:376)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18609) [SQL] column mixup with CROSS JOIN

2016-12-06 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728042#comment-15728042
 ] 

Song Jun commented on SPARK-18609:
--

I'm working on this~

> [SQL] column mixup with CROSS JOIN
> --
>
> Key: SPARK-18609
> URL: https://issues.apache.org/jira/browse/SPARK-18609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Furcy Pin
>
> Reproduced on spark-sql v2.0.2 and on branch master.
> {code}
> DROP TABLE IF EXISTS p1 ;
> DROP TABLE IF EXISTS p2 ;
> CREATE TABLE p1 (col TIMESTAMP) ;
> CREATE TABLE p2 (col TIMESTAMP) ;
> set spark.sql.crossJoin.enabled = true;
> -- EXPLAIN
> WITH CTE AS (
>   SELECT
> s2.col as col
>   FROM p1
>   CROSS JOIN (
> SELECT
>   e.col as col
> FROM p2 E
>   ) s2
> )
> SELECT
>   T1.col as c1,
>   T2.col as c2
> FROM CTE T1
> CROSS JOIN CTE T2
> ;
> {code}
> This returns the following stacktrace :
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: col#21
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
>

[jira] [Created] (SPARK-18742) readd spark.broadcast.factory for user-defined BroadcastFactory

2016-12-06 Thread Song Jun (JIRA)

Song Jun created SPARK-18742:


 Summary: readd spark.broadcast.factory for user-defined 
BroadcastFactory
 Key: SPARK-18742
 URL: https://issues.apache.org/jira/browse/SPARK-18742
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Song Jun
Priority: Minor


After SPARK-12588 Remove HTTPBroadcast [1], the one and only
implementation of BroadcastFactory is TorrentBroadcastFactory. No code
in Spark 2 uses BroadcastFactory (but TorrentBroadcastFactory) however
the scaladoc says [2]:

/**
 * An interface for all the broadcast implementations in Spark (to allow
 * multiple broadcast implementations). SparkContext uses a user-specified
 * BroadcastFactory implementation to instantiate a particular broadcast for the
 * entire Spark job.
 */

which is not correct since there is no way to plug in a custom
user-specified BroadcastFactory.

It is better to readd spark.broadcast.factory for user-defined BroadcastFactory
 
[1] https://issues.apache.org/jira/browse/SPARK-12588
[2] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19154) support read and overwrite a same table

2017-01-10 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816942#comment-15816942
 ] 

Song Jun commented on SPARK-19154:
--

I am working on this~

> support read and overwrite a same table
> ---
>
> Key: SPARK-19154
> URL: https://issues.apache.org/jira/browse/SPARK-19154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> In SPARK-5746 , we forbid users to read and overwrite a same table. It seems 
> like we don't need this limitation now, we can remove the check and add 
> regression tests. We may need to take care of partitioned table though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19166) change method name from InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix

2017-01-10 Thread Song Jun (JIRA)

Song Jun created SPARK-19166:


 Summary: change method name from 
InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to 
InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix
 Key: SPARK-19166
 URL: https://issues.apache.org/jira/browse/SPARK-19166
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Song Jun
Priority: Minor


InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions delete all files 
that match a static prefix, such as a partition file path(/table/foo=1), or a 
no partition file path(/xxx/a.json).

while the method name deleteMatchingPartitions indicates that only the 
partition file will be deleted. This name make a confused.

It is better to rename the method name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.

2016-12-28 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784678#comment-15784678
 ] 

Song Jun edited comment on SPARK-18930 at 12/29/16 6:44 AM:


from hive document, 
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert

Note that the dynamic partition values are selected by ordering, not name, and 
taken as the last columns from the select clause.

and test it on hive also have the same logic as your description .

I think we can close this jira? [~srowen]


was (Author: windpiger):
from hive document, 
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert

Note that the dynamic partition values are selected by ordering, not name, and 
taken as the last columns from the select clause.

and test it on hive also have the same logic as your description .

I think we can close this jira?

> Inserting in partitioned table - partitioned field should be last in select 
> statement. 
> ---
>
> Key: SPARK-18930
> URL: https://issues.apache.org/jira/browse/SPARK-18930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> CREATE TABLE temp.test_partitioning_4 (
>   num string
>  ) 
> PARTITIONED BY (
>   day string)
>   stored as parquet
> INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day)
> select day, count(*) as num from 
> hss.session where year=2016 and month=4 
> group by day
> Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, 
> emp.db/test_partitioning_3/day=69094345
> As you can imagine these numbers are num of records. But! When I do select * 
> from  temp.test_partitioning_4 data is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.

2016-12-28 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784678#comment-15784678
 ] 

Song Jun commented on SPARK-18930:
--

from hive document, 
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert

Note that the dynamic partition values are selected by ordering, not name, and 
taken as the last columns from the select clause.

and test it on hive also have the same logic as your description .

I think we can close this jira?

> Inserting in partitioned table - partitioned field should be last in select 
> statement. 
> ---
>
> Key: SPARK-18930
> URL: https://issues.apache.org/jira/browse/SPARK-18930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> CREATE TABLE temp.test_partitioning_4 (
>   num string
>  ) 
> PARTITIONED BY (
>   day string)
>   stored as parquet
> INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day)
> select day, count(*) as num from 
> hss.session where year=2016 and month=4 
> group by day
> Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, 
> emp.db/test_partitioning_3/day=69094345
> As you can imagine these numbers are num of records. But! When I do select * 
> from  temp.test_partitioning_4 data is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18742) Clarify that user-defined BroadcastFactory is not supported

2016-12-16 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-18742:
-
Description: 
After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of 
BroadcastFactory is TorrentBroadcastFactory and the spark.broadcast.factory 
conf has removed. 

however the scaladoc says [2]:

/**
 * An interface for all the broadcast implementations in Spark (to allow
 * multiple broadcast implementations). SparkContext uses a user-specified
 * BroadcastFactory implementation to instantiate a particular broadcast for the
 * entire Spark job.
 */

so we should modify the comment that SparkContext will not use a  
user-specified BroadcastFactory implementation

[1] https://issues.apache.org/jira/browse/SPARK-12588
[2] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30

  was:
After SPARK-12588 Remove HTTPBroadcast [1], the one and only
implementation of BroadcastFactory is TorrentBroadcastFactory. however
the scaladoc says [2]:

/**
 * An interface for all the broadcast implementations in Spark (to allow
 * multiple broadcast implementations). SparkContext uses a user-specified
 * BroadcastFactory implementation to instantiate a particular broadcast for the
 * entire Spark job.
 */

so we should modify the comment that SparkContext will not use a  
user-specified BroadcastFactory implementation

[1] https://issues.apache.org/jira/browse/SPARK-12588
[2] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30


> Clarify that user-defined BroadcastFactory is not supported
> ---
>
> Key: SPARK-18742
> URL: https://issues.apache.org/jira/browse/SPARK-18742
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Song Jun
>Priority: Trivial
>
> After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation 
> of BroadcastFactory is TorrentBroadcastFactory and the 
> spark.broadcast.factory conf has removed. 
> however the scaladoc says [2]:
> /**
>  * An interface for all the broadcast implementations in Spark (to allow
>  * multiple broadcast implementations). SparkContext uses a user-specified
>  * BroadcastFactory implementation to instantiate a particular broadcast for 
> the
>  * entire Spark job.
>  */
> so we should modify the comment that SparkContext will not use a  
> user-specified BroadcastFactory implementation
> [1] https://issues.apache.org/jira/browse/SPARK-12588
> [2] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18742) Clarify that user-defined BroadcastFactory is not supported

2016-12-16 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-18742:
-
Description: 
After SPARK-12588 Remove HTTPBroadcast [1], the one and only
implementation of BroadcastFactory is TorrentBroadcastFactory. however
the scaladoc says [2]:

/**
 * An interface for all the broadcast implementations in Spark (to allow
 * multiple broadcast implementations). SparkContext uses a user-specified
 * BroadcastFactory implementation to instantiate a particular broadcast for the
 * entire Spark job.
 */

so we should modify the comment that SparkContext will not use a  
user-specified BroadcastFactory implementation

[1] https://issues.apache.org/jira/browse/SPARK-12588
[2] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30

  was:
After SPARK-12588 Remove HTTPBroadcast [1], the one and only
implementation of BroadcastFactory is TorrentBroadcastFactory. No code
in Spark 2 uses BroadcastFactory (but TorrentBroadcastFactory) however
the scaladoc says [2]:

/**
 * An interface for all the broadcast implementations in Spark (to allow
 * multiple broadcast implementations). SparkContext uses a user-specified
 * BroadcastFactory implementation to instantiate a particular broadcast for the
 * entire Spark job.
 */

which is not correct since there is no way to plug in a custom
user-specified BroadcastFactory.

It is better to readd spark.broadcast.factory for user-defined BroadcastFactory
 
[1] https://issues.apache.org/jira/browse/SPARK-12588
[2] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30


> Clarify that user-defined BroadcastFactory is not supported
> ---
>
> Key: SPARK-18742
> URL: https://issues.apache.org/jira/browse/SPARK-18742
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Song Jun
>Priority: Trivial
>
> After SPARK-12588 Remove HTTPBroadcast [1], the one and only
> implementation of BroadcastFactory is TorrentBroadcastFactory. however
> the scaladoc says [2]:
> /**
>  * An interface for all the broadcast implementations in Spark (to allow
>  * multiple broadcast implementations). SparkContext uses a user-specified
>  * BroadcastFactory implementation to instantiate a particular broadcast for 
> the
>  * entire Spark job.
>  */
> so we should modify the comment that SparkContext will not use a  
> user-specified BroadcastFactory implementation
> [1] https://issues.apache.org/jira/browse/SPARK-12588
> [2] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20013) merge renameTable to alterTable in ExternalCatalog

2017-03-18 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-20013:
-
Description: 
Currently when we create / rename a managed table, we should get the 
defaultTablePath for them in ExternalCatalog, then we have two defaultTablePath 
logic in its two subclass HiveExternalCatalog and InMemoryCatalog, additionally 
there is also a defaultTablePath in SessionCatalog, so till now we have three 
defaultTablePath in three classes. 
we'd better to unify them up to SessionCatalog

To unify them, we should move some logic from ExternalCatalog to 
SessionCatalog, renameTable is one of this.

while limit to the simple parameters in renameTable 
{code}
  def renameTable(db: String, oldName: String, newName: String): Unit
{code}
even if we move the defaultTablePath logic to SessionCatalog, we can not pass 
it to renameTable.

So we can add a newTablePath parameter  for renameTable in ExternalCatalog 

  was:
merge renameTable to alterTable in ExternalCatalog has some reasons:
1. In Hive, we rename a Table by alterTable
2. Currently when we create / rename a managed table, we should get the 
defaultTablePath for them in ExternalCatalog, then we have two defaultTablePath 
logic in its two subclass HiveExternalCatalog and InMemoryCatalog, additionally 
there is also a defaultTablePath in SessionCatalog, so till now we have three 
defaultTablePath in three classes. 
we'd better to unify them up to SessionCatalog

To unify them, we should move some logic from ExternalCatalog to 
SessionCatalog, renameTable is one of this.

while limit to the simple parameters in renameTable 
{code}
  def renameTable(db: String, oldName: String, newName: String): Unit
{code}
even if we move the defaultTablePath logic to SessionCatalog, we can not pass 
it to renameTable.

So we can merge the renameTable  to alterTable, and rename it in alterTable.


> merge renameTable to alterTable in ExternalCatalog
> --
>
> Key: SPARK-20013
> URL: https://issues.apache.org/jira/browse/SPARK-20013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> Currently when we create / rename a managed table, we should get the 
> defaultTablePath for them in ExternalCatalog, then we have two 
> defaultTablePath logic in its two subclass HiveExternalCatalog and 
> InMemoryCatalog, additionally there is also a defaultTablePath in 
> SessionCatalog, so till now we have three defaultTablePath in three classes. 
> we'd better to unify them up to SessionCatalog
> To unify them, we should move some logic from ExternalCatalog to 
> SessionCatalog, renameTable is one of this.
> while limit to the simple parameters in renameTable 
> {code}
>   def renameTable(db: String, oldName: String, newName: String): Unit
> {code}
> even if we move the defaultTablePath logic to SessionCatalog, we can not pass 
> it to renameTable.
> So we can add a newTablePath parameter  for renameTable in ExternalCatalog 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19961) unify a exception erro msg for dropdatabase

2017-03-15 Thread Song Jun (JIRA)

Song Jun created SPARK-19961:


 Summary: unify a exception erro msg for dropdatabase
 Key: SPARK-19961
 URL: https://issues.apache.org/jira/browse/SPARK-19961
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


unify a exception erro msg for dropdatabase when the database still have some 
tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20013) merge renameTable to alterTable in ExternalCatalog

2017-03-18 Thread Song Jun (JIRA)

Song Jun created SPARK-20013:


 Summary: merge renameTable to alterTable in ExternalCatalog
 Key: SPARK-20013
 URL: https://issues.apache.org/jira/browse/SPARK-20013
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


merge renameTable to alterTable in ExternalCatalog has some reasons:
1. In Hive, we rename a Table by alterTable
2. Currently when we create / rename a managed table, we should get the 
defaultTablePath for them in ExternalCatalog, then we have two defaultTablePath 
logic in its two subclass HiveExternalCatalog and InMemoryCatalog, additionally 
there is also a defaultTablePath in SessionCatalog, so till now we have three 
defaultTablePath in three classes. 
we'd better to unify them up to SessionCatalog

To unify them, we should move some logic from ExternalCatalog to 
SessionCatalog, renameTable is one of this.

while limit to the simple parameters in renameTable 
{code}
  def renameTable(db: String, oldName: String, newName: String): Unit
{code}
even if we move the defaultTablePath logic to SessionCatalog, we can not pass 
it to renameTable.

So we can merge the renameTable  to alterTable, and rename it in alterTable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409
 ] 

Song Jun edited comment on SPARK-19990 at 3/17/17 4:36 AM:
---

the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit  in SPARK-19235   
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9，if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.

it is not related with SPARK-19763

I will fix this by providing a new test dir which contain the test files in 
sql/ , and the test case use this file path.

thanks~



was (Author: windpiger):
the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9，if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.

it is not related with SPARK-19763

I will fix this by providing a new test dir which contain the test files in 
sql/ , and the test case use this file path.

thanks~


> Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create 
> temporary view using
> --
>
> Key: SPARK-19990
> URL: https://issues.apache.org/jira/browse/SPARK-19990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>
> This test seems to be failing consistently on all of the maven builds: 
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
>  and is possibly caused by SPARK-19763.
> Here's a stack trace for the failure: 
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
>   at 
>

[jira] [Comment Edited] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409
 ] 

Song Jun edited comment on SPARK-19990 at 3/17/17 4:35 AM:
---

the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9，if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.

it is not related with SPARK-19763

I will fix this by providing a new test dir which contain the test files in 
sql/ , and the test case use this file path.

thanks~



was (Author: windpiger):
the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9，if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.


> Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create 
> temporary view using
> --
>
> Key: SPARK-19990
> URL: https://issues.apache.org/jira/browse/SPARK-19990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>
> This test seems to be failing consistently on all of the maven builds: 
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
>  and is possibly caused by SPARK-19763.
> Here's a stack trace for the failure: 
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
>

[jira] [Commented] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409
 ] 

Song Jun commented on SPARK-19990:
--

the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9，if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.


> Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create 
> temporary view using
> --
>
> Key: SPARK-19990
> URL: https://issues.apache.org/jira/browse/SPARK-19990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>
> This test seems to be failing consistently on all of the maven builds: 
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
>  and is possibly caused by SPARK-19763.
> Here's a stack trace for the failure: 
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
>

[jira] [Updated] (SPARK-19945) Add test suite for SessionCatalog with HiveExternalCatalog

2017-03-14 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19945:
-
Summary: Add test suite for SessionCatalog with HiveExternalCatalog  (was: 
Add test case for SessionCatalog with HiveExternalCatalog)

> Add test suite for SessionCatalog with HiveExternalCatalog
> --
>
> Key: SPARK-19945
> URL: https://issues.apache.org/jira/browse/SPARK-19945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> Currently SessionCatalogSuite is only for InMemoryCatalog, there is no suite 
> for HiveExternalCatalog.
> And there are some ddl function is not proper to test in 
> ExternalCatalogSuite, because some logic are not full implement in 
> ExternalCatalog, these ddl functions are full implement in SessionCatalog, it 
> is better to test it in SessionCatalogSuite
> So we should add a test suite for SessionCatalog with HiveExternalCatalog



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19945) Add test case for SessionCatalog with HiveExternalCatalog

2017-03-14 Thread Song Jun (JIRA)

Song Jun created SPARK-19945:


 Summary: Add test case for SessionCatalog with HiveExternalCatalog
 Key: SPARK-19945
 URL: https://issues.apache.org/jira/browse/SPARK-19945
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


Currently SessionCatalogSuite is only for InMemoryCatalog, there is no suite 
for HiveExternalCatalog.
And there are some ddl function is not proper to test in ExternalCatalogSuite, 
because some logic are not full implement in ExternalCatalog, these ddl 
functions are full implement in SessionCatalog, it is better to test it in 
SessionCatalogSuite

So we should add a test suite for SessionCatalog with HiveExternalCatalog



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19917) qualified partition location stored in catalog

2017-03-10 Thread Song Jun (JIRA)

Song Jun created SPARK-19917:


 Summary: qualified partition location stored in catalog
 Key: SPARK-19917
 URL: https://issues.apache.org/jira/browse/SPARK-19917
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


partition path should be qualified to store in catalog. 
There are some scenes:
1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' 
   qualified: file:/path/x
2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' 
 qualified: file:/tablelocation/x
3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x'
   qualified: file:/path/x
4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x'
 qualified: file:/tablelocation/x

Currently only  ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde table 
has the expected qualified path. we should make other scenes to be consist with 
it.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19864) add makeQualifiedPath in SQLTestUtils to optimize some code

2017-03-08 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19864:
-
Summary: add makeQualifiedPath in SQLTestUtils to optimize some code  (was: 
add makeQualifiedPath in CatalogUtils to optimize some code)

> add makeQualifiedPath in SQLTestUtils to optimize some code
> ---
>
> Key: SPARK-19864
> URL: https://issues.apache.org/jira/browse/SPARK-19864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> Currently there are lots of places to make the path qualified, it is better 
> to provide a function to do this, then the code will be more simple.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19869) move table related ddl from ddl.scala to tables.scala

2017-03-08 Thread Song Jun (JIRA)

Song Jun created SPARK-19869:


 Summary: move table related ddl from ddl.scala to tables.scala
 Key: SPARK-19869
 URL: https://issues.apache.org/jira/browse/SPARK-19869
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


move table related ddl from ddl.scala to tables.scala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19869) move table related ddl from ddl.scala to tables.scala

2017-03-08 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19869:
-
Issue Type: Improvement  (was: Bug)

> move table related ddl from ddl.scala to tables.scala
> -
>
> Key: SPARK-19869
> URL: https://issues.apache.org/jira/browse/SPARK-19869
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> move table related ddl from ddl.scala to tables.scala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19845) failed to uncache datasource table after the table location altered

2017-03-06 Thread Song Jun (JIRA)

Song Jun created SPARK-19845:


 Summary: failed to uncache datasource table after the table 
location altered
 Key: SPARK-19845
 URL: https://issues.apache.org/jira/browse/SPARK-19845
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


Currently if we first cache a datasource table, then we alter the table 
location,
then we drop the table, uncache table will failed in the DropTableCommand, 
because the location has changed and sameResult for two InMemoryFileIndex with 
different location return false, so we can't find the table key in the cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19845) failed to uncache datasource table after the table location altered

2017-03-06 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898626#comment-15898626
 ] 

Song Jun commented on SPARK-19845:
--

yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784
 the location changed, it make it more complex to uncache the table and recache 
other tables reference this.

> failed to uncache datasource table after the table location altered
> ---
>
> Key: SPARK-19845
> URL: https://issues.apache.org/jira/browse/SPARK-19845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> Currently if we first cache a datasource table, then we alter the table 
> location,
> then we drop the table, uncache table will failed in the DropTableCommand, 
> because the location has changed and sameResult for two InMemoryFileIndex 
> with different location return false, so we can't find the table key in the 
> cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19845) failed to uncache datasource table after the table location altered

2017-03-06 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898626#comment-15898626
 ] 

Song Jun edited comment on SPARK-19845 at 3/7/17 2:33 AM:
--

yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784
 the location changed, it make it more complex to uncache the table and recache 
other tables reference this.

I will dig it more


was (Author: windpiger):
yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784
 the location changed, it make it more complex to uncache the table and recache 
other tables reference this.

> failed to uncache datasource table after the table location altered
> ---
>
> Key: SPARK-19845
> URL: https://issues.apache.org/jira/browse/SPARK-19845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> Currently if we first cache a datasource table, then we alter the table 
> location,
> then we drop the table, uncache table will failed in the DropTableCommand, 
> because the location has changed and sameResult for two InMemoryFileIndex 
> with different location return false, so we can't find the table key in the 
> cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19836) Customizable remote repository url for hive versions unit test

2017-03-07 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899292#comment-15899292
 ] 

Song Jun commented on SPARK-19836:
--

I have do this similar https://github.com/apache/spark/pull/16803

> Customizable remote repository url for hive versions unit test
> --
>
> Key: SPARK-19836
> URL: https://issues.apache.org/jira/browse/SPARK-19836
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Elek, Marton
>  Labels: ivy, unittest
>
> When the VersionSuite test runs from sql/hive it downloads different versions 
> from hive.
> Unfortunately the IsolatedClientClassloader (which is used by the 
> VersionSuite) uses hardcoded fix repositories:
> {code}
> val classpath = quietly {
>   SparkSubmitUtils.resolveMavenCoordinates(
> hiveArtifacts.mkString(","),
> SparkSubmitUtils.buildIvySettings(
>   Some("http://www.datanucleus.org/downloads/maven2;),
>   ivyPath),
> exclusions = version.exclusions)
> }
> {code}
> The problem is with the hard-coded repositories:
>  1. it's hard to run unit tests in an environment where only one internal 
> maven repository is available (and central/datanucleus is not)
>  2. it's impossible to run unit tests against custom built hive/hadoop 
> artifacts (which are not available from the central repository)
> VersionSuite has already a specific SPARK_VERSIONS_SUITE_IVY_PATH environment 
> variable to define a custom local repository as ivy cache.
> I suggest to add an additional environment variable 
> (SPARK_VERSIONS_SUITE_IVY_REPOSITORIES to the HiveClientBuilder.scala), to 
> make it possible adding new remote repositories for testing the different 
> hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name

2017-03-05 Thread Song Jun (JIRA)

Song Jun created SPARK-19832:


 Summary: DynamicPartitionWriteTask should escape the partition 
name 
 Key: SPARK-19832
 URL: https://issues.apache.org/jira/browse/SPARK-19832
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


Currently in DynamicPartitionWriteTask, when we get the paritionPath of a 
parition, we just escape the partition value, not escape the partition name.

this will cause some problems for some  special partition name situation, for 
example :
1) if the partition name contains '%' etc,  there will be two partition path 
created in the filesytem, one is for escaped path like '/path/a%25b=1', another 
is for unescaped path like '/path/a%b=1'.
and the data inserted stored in unescaped path, while the show partitions table 
will return 'a%25b=1' which the partition name is escaped. So here it is not 
consist. And I think the data should be stored in the escaped path in 
filesystem, which Hive2.0.0 also have the same action.

2) if the partition name contains ':', there will throw exception that new 
Path("/path","a:b"), this is illegal which has a colon in the relative path.

{code}
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: a:b
  at org.apache.hadoop.fs.Path.initialize(Path.java:205)
  at org.apache.hadoop.fs.Path.(Path.java:171)
  at org.apache.hadoop.fs.Path.(Path.java:88)
  ... 48 elided
Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b
  at java.net.URI.checkPath(URI.java:1823)
  at java.net.URI.(URI.java:745)
  at org.apache.hadoop.fs.Path.initialize(Path.java:202)
  ... 50 more
{code}






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19833) remove SQLConf.HIVE_VERIFY_PARTITION_PATH, we always return empty when the path does not exists

2017-03-06 Thread Song Jun (JIRA)

Song Jun created SPARK-19833:


 Summary: remove SQLConf.HIVE_VERIFY_PARTITION_PATH, we always 
return empty when the path does not exists
 Key: SPARK-19833
 URL: https://issues.apache.org/jira/browse/SPARK-19833
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


In SPARK-5068, we introduce a SQLConf spark.sql.hive.verifyPartitionPath,
if it is set to true, it will avoid the task failed when the patition location 
does not exists in the filesystem.

this situation should always return emtpy and don't lead to the task failed, 
here we remove this conf.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19864) add makeQualifiedPath in CatalogUtils to optimize some code

2017-03-07 Thread Song Jun (JIRA)

Song Jun created SPARK-19864:


 Summary: add makeQualifiedPath in CatalogUtils to optimize some 
code
 Key: SPARK-19864
 URL: https://issues.apache.org/jira/browse/SPARK-19864
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


Currently there are lots of places to make the path qualified, it is better to 
provide a function to do this, then the code will be more simple.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 116 matches

Mail list logo