[jira] [Created] (SPARK-19484) continue work to create a table with an empty schema
Song Jun created SPARK-19484: Summary: continue work to create a table with an empty schema Key: SPARK-19484 URL: https://issues.apache.org/jira/browse/SPARK-19484 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor after SPARK-19279, we could not create a Hive table with an empty schema, we should tighten up the condition when create a hive table in https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835 That is if a CatalogTable t has an empty schema, and (there is no `spark.sql.schema.numParts` or its value is 0), we should not add a default `col` schema, if we did, a table with an empty schema will be created, that is not we expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855677#comment-15855677 ] Song Jun commented on SPARK-19477: -- thanks, I got it~ > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19491) add a config for tableRelation cache size in SessionCatalog
Song Jun created SPARK-19491: Summary: add a config for tableRelation cache size in SessionCatalog Key: SPARK-19491 URL: https://issues.apache.org/jira/browse/SPARK-19491 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor currently the table relation cache size is hardcode to 1000, it is better to add a config to set its size. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19496) to_date with format has weird behavior
[ https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856205#comment-15856205 ] Song Jun commented on SPARK-19496: -- I am working on this~ > to_date with format has weird behavior > -- > > Key: SPARK-19496 > URL: https://issues.apache.org/jira/browse/SPARK-19496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wenchen Fan > > Today, if we run > {code} > SELECT to_date('2015-07-22', '-dd-MM') > {code} > will result to `2016-10-07`, while running > {code} > SELECT to_date('2014-31-12') # default format > {code} > will return null. > this behavior is weird and we should check other systems like hive to see if > this is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19463) refresh the table cache after InsertIntoHadoopFsRelation
Song Jun created SPARK-19463: Summary: refresh the table cache after InsertIntoHadoopFsRelation Key: SPARK-19463 URL: https://issues.apache.org/jira/browse/SPARK-19463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun If we first cache a DataSource table, then we insert some data into the table, we should refresh the data in the cache after the insert command. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19496) to_date with format has weird behavior
[ https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857541#comment-15857541 ] Song Jun edited comment on SPARK-19496 at 2/8/17 7:11 AM: -- mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null that is mysql both return null when the date is invalidate or the formate is invalidate. and hive will transform the invalidate date to valid, e.g 2014-31-12 -> 31/12 = 2 -> 2014+2=2016 , 31 - 12*2=7 -> 2016-07-12 was (Author: windpiger): mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null mysql both return null when the date is invalidate or the formate is invalidate > to_date with format has weird behavior > -- > > Key: SPARK-19496 > URL: https://issues.apache.org/jira/browse/SPARK-19496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wenchen Fan > > Today, if we run > {code} > SELECT to_date('2015-07-22', '-dd-MM') > {code} > will result to `2016-10-07`, while running > {code} > SELECT to_date('2014-31-12') # default format > {code} > will return null. > this behavior is weird and we should check other systems like hive to see if > this is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19496) to_date with format has weird behavior
[ https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857570#comment-15857570 ] Song Jun commented on SPARK-19496: -- [~hyukjin.kwon] > to_date with format has weird behavior > -- > > Key: SPARK-19496 > URL: https://issues.apache.org/jira/browse/SPARK-19496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wenchen Fan > > Today, if we run > {code} > SELECT to_date('2015-07-22', '-dd-MM') > {code} > will result to `2016-10-07`, while running > {code} > SELECT to_date('2014-31-12') # default format > {code} > will return null. > this behavior is weird and we should check other systems like hive to see if > this is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19496) to_date with format has weird behavior
[ https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857541#comment-15857541 ] Song Jun edited comment on SPARK-19496 at 2/8/17 7:18 AM: -- mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null that is mysql both return null when the date is invalidate or the formate is invalidate. and hive will transform the invalidate date to valid, e.g 2014-31-12 -> 31/12 = 2 -> 2014+2=2016 , 31 - 12*2=7 -> 2016-07-12 currently spark can handle wrong format / wrong date when to_date has the format parameter (like hive's transform), what about we also make to_date without format parameter follow its action, that is replace null with a transformed date to return was (Author: windpiger): mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null that is mysql both return null when the date is invalidate or the formate is invalidate. and hive will transform the invalidate date to valid, e.g 2014-31-12 -> 31/12 = 2 -> 2014+2=2016 , 31 - 12*2=7 -> 2016-07-12 > to_date with format has weird behavior > -- > > Key: SPARK-19496 > URL: https://issues.apache.org/jira/browse/SPARK-19496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wenchen Fan > > Today, if we run > {code} > SELECT to_date('2015-07-22', '-dd-MM') > {code} > will result to `2016-10-07`, while running > {code} > SELECT to_date('2014-31-12') # default format > {code} > will return null. > this behavior is weird and we should check other systems like hive to see if > this is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19496) to_date with format has weird behavior
[ https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857541#comment-15857541 ] Song Jun commented on SPARK-19496: -- mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null > to_date with format has weird behavior > -- > > Key: SPARK-19496 > URL: https://issues.apache.org/jira/browse/SPARK-19496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wenchen Fan > > Today, if we run > {code} > SELECT to_date('2015-07-22', '-dd-MM') > {code} > will result to `2016-10-07`, while running > {code} > SELECT to_date('2014-31-12') # default format > {code} > will return null. > this behavior is weird and we should check other systems like hive to see if > this is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19496) to_date with format has weird behavior
[ https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857541#comment-15857541 ] Song Jun edited comment on SPARK-19496 at 2/8/17 7:09 AM: -- mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null mysql both return null when the date is invalidate or the formate is invalidate was (Author: windpiger): mysql: select str_to_date('2014-12-31','%Y-%d-%m') also return null > to_date with format has weird behavior > -- > > Key: SPARK-19496 > URL: https://issues.apache.org/jira/browse/SPARK-19496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wenchen Fan > > Today, if we run > {code} > SELECT to_date('2015-07-22', '-dd-MM') > {code} > will result to `2016-10-07`, while running > {code} > SELECT to_date('2014-31-12') # default format > {code} > will return null. > this behavior is weird and we should check other systems like hive to see if > this is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19458) loading hive jars from the local repo which has already downloaded
Song Jun created SPARK-19458: Summary: loading hive jars from the local repo which has already downloaded Key: SPARK-19458 URL: https://issues.apache.org/jira/browse/SPARK-19458 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor Currently when we new a HiveClient for a specific metastore version and `spark.sql.hive.metastore.jars` is setted to `maven`, Spark will download the hive jars from remote repo(http://www.datanucleus.org/downloads/maven2). we should allow the user to load hive jars from the local repo which has already downloaded. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19430) Cannot read external tables with VARCHAR columns if they're backed by ORC files written by Hive 1.2.1
[ https://issues.apache.org/jira/browse/SPARK-19430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853466#comment-15853466 ] Song Jun commented on SPARK-19430: -- I think this is not a bug. If you want to access the hive table ,you can directly use ` spark.table("orc_varchar_test").show ` > Cannot read external tables with VARCHAR columns if they're backed by ORC > files written by Hive 1.2.1 > - > > Key: SPARK-19430 > URL: https://issues.apache.org/jira/browse/SPARK-19430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.0 >Reporter: Sameer Agarwal > > Spark throws an exception when trying to read external tables with VARCHAR > columns if they're backed by ORC files that were written by Hive 1.2.1 (and > possibly other versions of hive). > Steps to reproduce (credits to [~lian cheng]): > # Write an ORC table using Hive 1.2.1 with >{noformat} > CREATE TABLE orc_varchar_test STORED AS ORC > AS SELECT CASTE('a' AS VARCHAR(10)) AS c0{noformat} > # Get the raw path of the written ORC file > # Create an external table pointing to this file and read the table using > Spark > {noformat} > val path = "/tmp/orc_varchar_test" > sql(s"create external table if not exists test (c0 varchar(10)) stored as orc > location '$path'") > spark.table("test").show(){noformat} > The problem here is that the metadata in the ORC file written by Hive is > different from those written by Spark. We can inspect the ORC file written > above: > {noformat} > $ hive --orcfiledump > file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/00_0 > Structure for > file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/00_0 > File Version: 0.12 with HIVE_8732 > Rows: 1 > Compression: ZLIB > Compression size: 262144 > Type: struct<_col0:varchar(10)> < > ... > {noformat} > On the other hand, if you create an ORC table using the same DDL and inspect > the written ORC file, you'll see: > {noformat} > ... > Type: struct > ... > {noformat} > Note that all tests are done with {{spark.sql.hive.convertMetastoreOrc}} set > to {{false}}, which is the default case. > I've verified that Spark 1.6.x, 2.0.x and 2.1.x all fail with instances of > the following error: > {code} > java.lang.ClassCastException: > org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to > org.apache.hadoop.io.Text > at > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) > at > org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19447) Fix input metrics for range operator
[ https://issues.apache.org/jira/browse/SPARK-19447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853462#comment-15853462 ] Song Jun commented on SPARK-19447: -- spark.range(1,100).show there is some information in the SQL UI like: ` Range number of output rows: 99 ` I didn't see some information like `0 rows` maybe I didn't get the right place. could you help to describe it more clearly? thanks! > Fix input metrics for range operator > > > Key: SPARK-19447 > URL: https://issues.apache.org/jira/browse/SPARK-19447 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Reynold Xin > > Range operator currently does not output any input metrics, and as a result > in the SQL UI the number of rows shown is always 0. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19448) unify some duplication function in MetaStoreRelation
Song Jun created SPARK-19448: Summary: unify some duplication function in MetaStoreRelation Key: SPARK-19448 URL: https://issues.apache.org/jira/browse/SPARK-19448 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor 1. MetaStoreRelation' hiveQlTable can be replaced by calling HiveClientImpl's toHiveTable 2. MetaStoreRelation's toHiveColumn can be replaced by calling HiveClientImpl's toHiveColumn 3. process another TODO https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935 ] Song Jun edited comment on SPARK-19257 at 1/21/17 11:09 AM: I found it that `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) was (Author: windpiger): I found it than `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935 ] Song Jun commented on SPARK-19257: -- I found it than `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19359) partition path created by Hive should be deleted after rename a partition with upper-case
Song Jun created SPARK-19359: Summary: partition path created by Hive should be deleted after rename a partition with upper-case Key: SPARK-19359 URL: https://issues.apache.org/jira/browse/SPARK-19359 Project: Spark Issue Type: Bug Components: SQL Reporter: Song Jun Priority: Minor Hive metastore is not case preserving and keep partition columns with lower case names. If SparkSQL create a table with upper-case partion name use HiveExternalCatalog, when we rename partition, it first call the HiveClient to renamePartition, which will create a new lower case partition path, then SparkSql rename the lower case path to the upper-case. while if the renamed partition contains more than one depth partition ,e.g. A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19359) partition path created by Hive should be deleted after rename a partition with upper-case
[ https://issues.apache.org/jira/browse/SPARK-19359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19359: - Issue Type: Improvement (was: Bug) > partition path created by Hive should be deleted after rename a partition > with upper-case > - > > Key: SPARK-19359 > URL: https://issues.apache.org/jira/browse/SPARK-19359 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Song Jun >Priority: Minor > > Hive metastore is not case preserving and keep partition columns with lower > case names. > If SparkSQL create a table with upper-case partion name use > HiveExternalCatalog, when we rename partition, it first call the HiveClient > to renamePartition, which will create a new lower case partition path, then > SparkSql rename the lower case path to the upper-case. > while if the renamed partition contains more than one depth partition ,e.g. > A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to > A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19340) Opening a file in CSV format will result in an exception if the filename contains special characters
[ https://issues.apache.org/jira/browse/SPARK-19340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837407#comment-15837407 ] Song Jun commented on SPARK-19340: -- the reason is that spark sql treat the test{00-1}.txt as a globpath. we can not put a file name like text{00-1}.txt to hdfs, it will throw an exception. I think this is not a bug > Opening a file in CSV format will result in an exception if the filename > contains special characters > > > Key: SPARK-19340 > URL: https://issues.apache.org/jira/browse/SPARK-19340 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.2.0 >Reporter: Reza Safi >Priority: Minor > > If you want to open a file that its name is like {noformat} "*{*}*.*" > {noformat} or {noformat} "*[*]*.*" {noformat} using CSV format, you will get > the "org.apache.spark.sql.AnalysisException: Path does not exist" whether the > file is a local file or on hdfs. > This bug can be reproduced on master and all other Spark 2 branches. > To reproduce: > # Create a file like "test{00-1}.txt" on a local directory (like in > /Users/reza/test/test{00-1}.txt) > # Run spark-shell > # Execute this command: > {noformat} > val df=spark.read.option("header","false").csv("/Users/reza/test/*.txt") > {noformat} > You will see the following stack trace: > {noformat} > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/Users/reza/test/test\{00-01\}.txt; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:367) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.readText(CSVFileFormat.scala:208) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:63) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:173) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:158) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:423) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:360) > ... 48 elided > {noformat} > If you put the file on hadoop (like on /user/root) when you try to run the > following: > {noformat} > val df=spark.read.option("header", false).csv("/user/root/*.txt") > {noformat} > > You will get the following exception: > {noformat} > org.apache.hadoop.mapred.InvalidInputException: Input Pattern > hdfs://hosturl/user/root/test\{00-01\}.txt matches 0 files > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at
[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935 ] Song Jun edited comment on SPARK-19257 at 1/22/17 6:41 AM: --- I found it that `CatalogPartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) a)should we seperate the CatalogTableStorageFormat from CatalogPartition and CatalogTable? b)or we just check if the ` locationUri: String` is illegal when make a DDL command, such as ` alter table set location xx` or `create table t options (path xx) ` Hive's implement is like b) : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 [~cloud_fan] [~smilegator] was (Author: windpiger): I found it that `CatalogPartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) should we seperate the CatalogTableStorageFormat from CatalogPartition and CatalogTable? [~cloud_fan] [~smilegator] > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935 ] Song Jun edited comment on SPARK-19257 at 1/22/17 6:00 AM: --- I found it that `CatalogPartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) should we seperate the CatalogTableStorageFormat from CatalogPartition and CatalogTable? [~cloud_fan] [~smilegator] was (Author: windpiger): I found it that `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935 ] Song Jun edited comment on SPARK-19257 at 1/22/17 6:42 AM: --- I found it that `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition and CatalogTable? b)or we just check if the ` locationUri: String` is illegal when make a DDL command, such as ` alter table set location xx` or `create table t options (path xx) ` Hive's implement is like b) : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 [~cloud_fan] [~smilegator] was (Author: windpiger): I found it that `CatalogPartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) a)should we seperate the CatalogTableStorageFormat from CatalogPartition and CatalogTable? b)or we just check if the ` locationUri: String` is illegal when make a DDL command, such as ` alter table set location xx` or `create table t options (path xx) ` Hive's implement is like b) : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 [~cloud_fan] [~smilegator] > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19332) table's location should check if a URI is legal
[ https://issues.apache.org/jira/browse/SPARK-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19332: - Description: SPARK-19257 ‘s work is to change the type of `CatalogStorageFormat` 's locationUri to `URI`, while it has some problem: 1.`CatalogTable` and `CatalogTablePartition` use the same class `CatalogStorageFormat` 2. the type URI is ok for `CatalogTable`, but it is not proper for `CatalogTablePartition` 3. the location of a table partition can contains a not encode whitespace, so if a partition location contains this not encode whitespace, and it will throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a partition location which has whitespace so if we change the type to URI, it is bad for `CatalogTablePartition` and I found Hive has the same issue HIVE-6185 before hive 0.13 the location is URI, while after above PR, it change it to Path, and do some check when DDL. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 so I think ,we can do the URI check for the table's location , and it is not proper to change the type to URI. was: ~SPARK-19257 ‘s work is to change the type of `CatalogStorageFormat` 's locationUri to `URI`, while it has some problem: 1.`CatalogTable` and `CatalogTablePartition` use the same class `CatalogStorageFormat` 2. the type URI is ok for `CatalogTable`, but it is not proper for `CatalogTablePartition` 3. the location of a table partition can contains a not encode whitespace, so if a partition location contains this not encode whitespace, and it will throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a partition location which has whitespace so if we change the type to URI, it is bad for `CatalogTablePartition` and I found Hive has the same issue ~HIVE-6185 before hive 0.13 the location is URI, while after above PR, it change it to Path, and do some check when DDL. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 so I think ,we can do the URI check for the table's location , and it is not proper to change the type to URI. > table's location should check if a URI is legal > --- > > Key: SPARK-19332 > URL: https://issues.apache.org/jira/browse/SPARK-19332 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Song Jun > > SPARK-19257 ‘s work is to change the type of `CatalogStorageFormat` 's > locationUri to `URI`, while it has some problem: > 1.`CatalogTable` and `CatalogTablePartition` use the same class > `CatalogStorageFormat` > 2. the type URI is ok for `CatalogTable`, but it is not proper for > `CatalogTablePartition` > 3. the location of a table partition can contains a not encode whitespace, so > if a partition location contains this not encode whitespace, and it will > throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a > partition location which has whitespace > so if we change the type to URI, it is bad for `CatalogTablePartition` > and I found Hive has the same issue HIVE-6185 > before hive 0.13 the location is URI, while after above PR, it change it to > Path, and do some check when DDL. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 > so I think ,we can do the URI check for the table's location , and it is not > proper to change the type to URI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19332) table's location should check if a URI is legal
Song Jun created SPARK-19332: Summary: table's location should check if a URI is legal Key: SPARK-19332 URL: https://issues.apache.org/jira/browse/SPARK-19332 Project: Spark Issue Type: Bug Components: SQL Reporter: Song Jun ~SPARK-19257 ‘s work is to change the type of `CatalogStorageFormat` 's locationUri to `URI`, while it has some problem: 1.`CatalogTable` and `CatalogTablePartition` use the same class `CatalogStorageFormat` 2. the type URI is ok for `CatalogTable`, but it is not proper for `CatalogTablePartition` 3. the location of a table partition can contains a not encode whitespace, so if a partition location contains this not encode whitespace, and it will throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a partition location which has whitespace so if we change the type to URI, it is bad for `CatalogTablePartition` and I found Hive has the same issue ~HIVE-6185 before hive 0.13 the location is URI, while after above PR, it change it to Path, and do some check when DDL. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 so I think ,we can do the URI check for the table's location , and it is not proper to change the type to URI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19329) after alter a datasource table's location to a not exist location and then insert data throw Exception
Song Jun created SPARK-19329: Summary: after alter a datasource table's location to a not exist location and then insert data throw Exception Key: SPARK-19329 URL: https://issues.apache.org/jira/browse/SPARK-19329 Project: Spark Issue Type: Bug Components: SQL Reporter: Song Jun spark.sql("create table t(a string, b int) using parquet") spark.sql(s"alter table t set location '$notexistedlocation'") spark.sql("insert into table t select 'c', 1") this will throw an exception: com.google.common.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.AnalysisException: Path does not exist: $notexistedlocation; at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814) at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122) at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) at scala.collection.immutable.List.foreach(List.scala:381) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935 ] Song Jun edited comment on SPARK-19257 at 1/22/17 11:22 AM: I found it that `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition and CatalogTable? b)or we just check if the ` locationUri: String` is illegal when make a DDL command, such as ` alter table set location xx` or `create table t options (path xx) ` Hive's implement is like b) : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 HIVE-0.12 the table's location method's paramenter Type is URI, after that version change to Path https://github.com/apache/hive/blob/branch-0.12/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java#L501 this is the same issue in hive: https://issues.apache.org/jira/browse/HIVE-6185 [~cloud_fan] [~smilegator] was (Author: windpiger): I found it that `CatalogTablePartition` and `CatalogTable` use the same class `CatalogTableStorageFormat`. if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, this will failed in `CatalogTablePartition` because table partition location can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a legal URI (no encoded whitespace) a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition and CatalogTable? b)or we just check if the ` locationUri: String` is illegal when make a DDL command, such as ` alter table set location xx` or `create table t options (path xx) ` Hive's implement is like b) : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732 [~cloud_fan] [~smilegator] > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19667) Create table with HiveEnabled in default database use warehouse path instead of the location of default database
Song Jun created SPARK-19667: Summary: Create table with HiveEnabled in default database use warehouse path instead of the location of default database Key: SPARK-19667 URL: https://issues.apache.org/jira/browse/SPARK-19667 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.0 Reporter: Song Jun Currently, when we create a managed table with HiveEnabled in default database, Spark will use the location of default database as the table's location, this is ok in non-shared metastore. While if we use a shared metastore between different clusters, for example, 1) there is a hive metastore in Cluster-A, and the metastore use a remote mysql as its db, and create a default database in metastore, then the location of the default database is the path in Cluster-A 2) then we create another Cluster-B, and Cluster-B also use the same remote mysql as its metastore's db, so the default database conf in Cluster-B download from mysql, which location is the path of Cluster-A 3) then we create a table in Cluster-B in default database, it will throw an exception, that UnknowHost Cluster-A In Hive2.0.0, it is allowed to create a table in default database which shared between clusters , and this action is not allowed in other database, just for default. As a spark User, we will want to have the same action as Hive, thus we can create table in default databse -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19723) create table for data source tables should work with an non-existent location
Song Jun created SPARK-19723: Summary: create table for data source tables should work with an non-existent location Key: SPARK-19723 URL: https://issues.apache.org/jira/browse/SPARK-19723 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun This JIRA is a follow up work after SPARK-19583 As we discussed in that [PR|https://github.com/apache/spark/pull/16938] The following DDL for datasource table with an non-existent location should work: `` CREATE TABLE ... (PARTITIONED BY ...) LOCATION path ``` Currently it will throw exception that path not exists -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19724) create table for hive tables with an existed default location should throw an exception
Song Jun created SPARK-19724: Summary: create table for hive tables with an existed default location should throw an exception Key: SPARK-19724 URL: https://issues.apache.org/jira/browse/SPARK-19724 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun This JIRA is a follow up work after SPARK-19583 As we discussed in that [PR|https://github.com/apache/spark/pull/16938] The following DDL for hive table with an existed default location should throw an exception: {code} CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... {code} Currently it will success for this situation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19724) create managed table for hive tables with an existed default location should throw an exception
[ https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19724: - Summary: create managed table for hive tables with an existed default location should throw an exception (was: create table for hive tables with an existed default location should throw an exception) > create managed table for hive tables with an existed default location should > throw an exception > --- > > Key: SPARK-19724 > URL: https://issues.apache.org/jira/browse/SPARK-19724 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > This JIRA is a follow up work after SPARK-19583 > As we discussed in that [PR|https://github.com/apache/spark/pull/16938] > The following DDL for hive table with an existed default location should > throw an exception: > {code} > CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... > {code} > Currently it will success for this situation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19724) create managed table for hive tables with an existed default location should throw an exception
[ https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19724: - Description: This JIRA is a follow up work after [SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583) As we discussed in that [PR](https://github.com/apache/spark/pull/16938) The following DDL for a managed table with an existed default location should throw an exception: {code} CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... CREATE TABLE ... (PARTITIONED BY ...) {code} Currently there are some situations which are not consist with above logic: 1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default location situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog) 2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... situation: hive table succeed with an existed default location was: This JIRA is a follow up work after SPARK-19583 As we discussed in that [PR|https://github.com/apache/spark/pull/16938] The following DDL for hive table with an existed default location should throw an exception: {code} CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... {code} Currently it will success for this situation > create managed table for hive tables with an existed default location should > throw an exception > --- > > Key: SPARK-19724 > URL: https://issues.apache.org/jira/browse/SPARK-19724 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > This JIRA is a follow up work after > [SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583) > As we discussed in that [PR](https://github.com/apache/spark/pull/16938) > The following DDL for a managed table with an existed default location should > throw an exception: > {code} > CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... > CREATE TABLE ... (PARTITIONED BY ...) > {code} > Currently there are some situations which are not consist with above logic: > 1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default > location > situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog) > 2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... > situation: hive table succeed with an existed default location -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19724) create a managed table with an existed default location should throw an exception
[ https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19724: - Summary: create a managed table with an existed default location should throw an exception (was: create managed table for hive tables with an existed default location should throw an exception) > create a managed table with an existed default location should throw an > exception > - > > Key: SPARK-19724 > URL: https://issues.apache.org/jira/browse/SPARK-19724 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > This JIRA is a follow up work after > [SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583) > As we discussed in that [PR](https://github.com/apache/spark/pull/16938) > The following DDL for a managed table with an existed default location should > throw an exception: > {code} > CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... > CREATE TABLE ... (PARTITIONED BY ...) > {code} > Currently there are some situations which are not consist with above logic: > 1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default > location > situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog) > 2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... > situation: hive table succeed with an existed default location -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19664) put 'hive.metastore.warehouse.dir' in hadoopConf place
[ https://issues.apache.org/jira/browse/SPARK-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19664: - Description: In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in the logic, when use the value of 'spark.sql.warehouse.dir' to overwrite 'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it should put in 'sparkContext.hadoopConfiguration' and overwrite the original value of hadoopConf https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64 was: In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in the logic, when use the value of 'spark.sql.warehouse.dir' to overwrite 'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it should put in 'sparkContext.hadoopConfiguration' https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64 > put 'hive.metastore.warehouse.dir' in hadoopConf place > -- > > Key: SPARK-19664 > URL: https://issues.apache.org/jira/browse/SPARK-19664 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Song Jun >Priority: Minor > > In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in > the logic, when use the value of 'spark.sql.warehouse.dir' to overwrite > 'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it > should put in 'sparkContext.hadoopConfiguration' and overwrite the original > value of hadoopConf > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19664) put 'hive.metastore.warehouse.dir' in hadoopConf place
Song Jun created SPARK-19664: Summary: put 'hive.metastore.warehouse.dir' in hadoopConf place Key: SPARK-19664 URL: https://issues.apache.org/jira/browse/SPARK-19664 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Song Jun Priority: Minor In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in the logic, when use the value of 'spark.sql.warehouse.dir' to overwrite 'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it should put in 'sparkContext.hadoopConfiguration' https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19484) continue work to create a table with an empty schema
[ https://issues.apache.org/jira/browse/SPARK-19484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun closed SPARK-19484. Resolution: Won't Fix this has been contained in https://github.com/apache/spark/pull/16787 > continue work to create a table with an empty schema > > > Key: SPARK-19484 > URL: https://issues.apache.org/jira/browse/SPARK-19484 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > after SPARK-19279, we could not create a Hive table with an empty schema, > we should tighten up the condition when create a hive table in > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835 > That is if a CatalogTable t has an empty schema, and (there is no > `spark.sql.schema.numParts` or its value is 0), we should not add a default > `col` schema, if we did, a table with an empty schema will be created, that > is not we expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19491) add a config for tableRelation cache size in SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-19491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun closed SPARK-19491. Resolution: Duplicate duplicate with https://github.com/apache/spark/pull/16736 > add a config for tableRelation cache size in SessionCatalog > --- > > Key: SPARK-19491 > URL: https://issues.apache.org/jira/browse/SPARK-19491 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > currently the table relation cache size is hardcode to 1000, it is better to > add a config to set its size. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19558) Provide a config option to attach QueryExecutionListener to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862768#comment-15862768 ] Song Jun commented on SPARK-19558: -- sparkSession.listenerManager.register is not enough? > Provide a config option to attach QueryExecutionListener to SparkSession > > > Key: SPARK-19558 > URL: https://issues.apache.org/jira/browse/SPARK-19558 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Salil Surendran > > Provide a configuration property(just like spark.extraListeners) to attach a > QueryExecutionListener to a SparkSession -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19583) CTAS for data source tables with an created location does not work
[ https://issues.apache.org/jira/browse/SPARK-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865111#comment-15865111 ] Song Jun commented on SPARK-19583: -- ok, I'd like to take this one, thanks a lot! > CTAS for data source tables with an created location does not work > -- > > Key: SPARK-19583 > URL: https://issues.apache.org/jira/browse/SPARK-19583 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > {noformat} > spark.sql( > s""" > |CREATE TABLE t > |USING parquet > |PARTITIONED BY(a, b) > |LOCATION '$dir' > |AS SELECT 3 as a, 4 as b, 1 as c, 2 as d >""".stripMargin) > {noformat} > Failed with the error message: > {noformat} > path > file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4cgn/T/spark-195cd513-428a-4df9-b196-87db0c73e772 > already exists.; > org.apache.spark.sql.AnalysisException: path > file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4cgn/T/spark-195cd513-428a-4df9-b196-87db0c73e772 > already exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:102) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19166) change method name from InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix
[ https://issues.apache.org/jira/browse/SPARK-19166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun closed SPARK-19166. Resolution: Not A Bug minor issue > change method name from > InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to > InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix > > > Key: SPARK-19166 > URL: https://issues.apache.org/jira/browse/SPARK-19166 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Song Jun >Priority: Minor > > InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions delete all files > that match a static prefix, such as a partition file path(/table/foo=1), or a > no partition file path(/xxx/a.json). > while the method name deleteMatchingPartitions indicates that only the > partition file will be deleted. This name make a confused. > It is better to rename the method name. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19598) Remove the alias parameter in UnresolvedRelation
[ https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867548#comment-15867548 ] Song Jun edited comment on SPARK-19598 at 2/15/17 9:46 AM: --- [~rxin] When I do this jira, I found it that it is not proper to remove UnresolvedRelation's alias argument. If we remove it and use SubqueryAlias, that is: {quote}SubqueryAlias(alias,UnresolvedRelation(tableIdentifier),None){quote} to replace {quote}UnresolvedRelation(tableIdentifier, alias){quote} While there are lots of *match case* codes for *UnresolvedRelation*, and in matched logic it will use the alias parameter of *UnresolvedRelation*, currently table with/without alias can processed in one *match case UnresolvedRelation* logic, after this change, we should process table with alias and without alias seperately in two *match case*: {quote} case u:UnresolvedRelation => func(u,None) case s@SubqueryAlias(alias,u:UnresolvedRelation,_) => func(u, alias) {quote} such as: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L589-L591 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L626 Is this right? Or am I missing something? was (Author: windpiger): [~rxin] When I do this jira, I found it that it is not proper to remove UnresolvedRelation's alias argument. If we remove it and use SubqueryAlias, that is: {quote}SubqueryAlias(alias,UnresolvedRelation(tableIdentifier),None){quote} to replace {quote}UnresolvedRelation(tableIdentifier, aliase){quote} While there are lots of *match case* codes for *UnresolvedRelation*, and in matched logic it will use the alias parameter of *UnresolvedRelation*, currently table with/without alias can processed in one *match case UnresolvedRelation* logic, after this change, we should process table with alias and without alias seperately in two *match case*: {quote} case u:UnresolvedRelation => func(u,None) case s@SubqueryAlias(alias,u:UnresolvedRelation,_) => func(u, alias) {quote} such as: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L589-L591 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L626 Is this right? Or am I missing something? > Remove the alias parameter in UnresolvedRelation > > > Key: SPARK-19598 > URL: https://issues.apache.org/jira/browse/SPARK-19598 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > UnresolvedRelation has a second argument named "alias", for assigning the > relation an alias. I think we can actually remove it and replace its use with > a SubqueryAlias. > This would actually simplify some analyzer code to only match on > SubqueryAlias. For example, the broadcast hint pull request can have one > fewer case https://github.com/apache/spark/pull/16925/files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation
[ https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867398#comment-15867398 ] Song Jun commented on SPARK-19598: -- OK~ I'd like to do this. Thank you very much! > Remove the alias parameter in UnresolvedRelation > > > Key: SPARK-19598 > URL: https://issues.apache.org/jira/browse/SPARK-19598 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > UnresolvedRelation has a second argument named "alias", for assigning the > relation an alias. I think we can actually remove it and replace its use with > a SubqueryAlias. > This would actually simplify some analyzer code to only match on > SubqueryAlias. For example, the broadcast hint pull request can have one > fewer case https://github.com/apache/spark/pull/16925/files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation
[ https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867548#comment-15867548 ] Song Jun commented on SPARK-19598: -- [~rxin] When I do this jira, I found it that it is not proper to remove UnresolvedRelation's alias argument. If we remove it and use SubqueryAlias, that is: {quote}SubqueryAlias(alias,UnresolvedRelation(tableIdentifier),None){quote} to replace {quote}UnresolvedRelation(tableIdentifier, aliase){quote} While there are lots of *match case* codes for *UnresolvedRelation*, and in matched logic it will use the alias parameter of *UnresolvedRelation*, currently table with/without alias can processed in one *match case UnresolvedRelation* logic, after this change, we should process table with alias and without alias seperately in two *match case*: {quote} case u:UnresolvedRelation => func(u,None) case s@SubqueryAlias(alias,u:UnresolvedRelation,_) => func(u, alias) {quote} such as: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L589-L591 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L626 Is this right? Or am I missing something? > Remove the alias parameter in UnresolvedRelation > > > Key: SPARK-19598 > URL: https://issues.apache.org/jira/browse/SPARK-19598 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > UnresolvedRelation has a second argument named "alias", for assigning the > relation an alias. I think we can actually remove it and replace its use with > a SubqueryAlias. > This would actually simplify some analyzer code to only match on > SubqueryAlias. For example, the broadcast hint pull request can have one > fewer case https://github.com/apache/spark/pull/16925/files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19577) insert into a partition datasource table with InMemoryCatalog after the partition location alter by alter command failed
Song Jun created SPARK-19577: Summary: insert into a partition datasource table with InMemoryCatalog after the partition location alter by alter command failed Key: SPARK-19577 URL: https://issues.apache.org/jira/browse/SPARK-19577 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun If we use InMemoryCatalog, then we insert into a partition datasource table, which partition location has changed by `alter table t partition(a="xx") set location $newpath`, the insert operation is ok, and the data can be insert into $newpath, while if we then select partition from the table, it will not return the value we inserted. The reason is that the InMemoryFileIndex to inferPartition by the table's rootPath, it does not track the user specific $newPath which provided by alter command. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation
[ https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869623#comment-15869623 ] Song Jun commented on SPARK-19598: -- Thanks~ let me investigate more~ > Remove the alias parameter in UnresolvedRelation > > > Key: SPARK-19598 > URL: https://issues.apache.org/jira/browse/SPARK-19598 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > UnresolvedRelation has a second argument named "alias", for assigning the > relation an alias. I think we can actually remove it and replace its use with > a SubqueryAlias. > This would actually simplify some analyzer code to only match on > SubqueryAlias. For example, the broadcast hint pull request can have one > fewer case https://github.com/apache/spark/pull/16925/files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19575) Reading from or writing to a hive serde table with a non pre-existing location should succeed
Song Jun created SPARK-19575: Summary: Reading from or writing to a hive serde table with a non pre-existing location should succeed Key: SPARK-19575 URL: https://issues.apache.org/jira/browse/SPARK-19575 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun currently when we select from a hive serde table which has a non pre-existing location will throw an exception: ``` Input path does not exist: file:/tmp/spark-37caa4e6-5a6a-4361-a905-06cc56afb274 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/spark-37caa4e6-5a6a-4361-a905-06cc56afb274 at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2080) at org.apache.spark.rdd.RDD.count(RDD.scala:1157) at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:258) ``` this is a folllowup work from SPARK-19329 which has unify the action when we reading from or writing to a datasource table with a non pre-existing locaiton, so here we should also unify the hive serde tables -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19577) insert into a partition datasource table with InMemoryCatalog after the partition location alter by alter command failed
[ https://issues.apache.org/jira/browse/SPARK-19577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863511#comment-15863511 ] Song Jun commented on SPARK-19577: -- I am working on this~ > insert into a partition datasource table with InMemoryCatalog after the > partition location alter by alter command failed > > > Key: SPARK-19577 > URL: https://issues.apache.org/jira/browse/SPARK-19577 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > If we use InMemoryCatalog, then we insert into a partition datasource table, > which partition location has changed by `alter table t partition(a="xx") set > location $newpath`, the insert operation is ok, and the data can be insert > into $newpath, while if we then select partition from the table, it will not > return the value we inserted. > The reason is that the InMemoryFileIndex to inferPartition by the table's > rootPath, it does not track the user specific $newPath which provided by > alter command. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19284) append to a existed partitioned datasource table should have no CustomPartitionLocations
Song Jun created SPARK-19284: Summary: append to a existed partitioned datasource table should have no CustomPartitionLocations Key: SPARK-19284 URL: https://issues.apache.org/jira/browse/SPARK-19284 Project: Spark Issue Type: Bug Components: SQL Reporter: Song Jun Priority: Minor when we append data to a existed partitioned datasource table, the InsertIntoHadoopFsRelationCommand.getCustomPartitionLocations currently return the same location with Hive default, it should return None. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19241) remove hive generated table properties if they are not useful in Spark
[ https://issues.apache.org/jira/browse/SPARK-19241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823654#comment-15823654 ] Song Jun commented on SPARK-19241: -- I am working on this~ > remove hive generated table properties if they are not useful in Spark > -- > > Key: SPARK-19241 > URL: https://issues.apache.org/jira/browse/SPARK-19241 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > When we save a table into hive metastore, hive will generate some table > properties automatically. e.g. transient_lastDdlTime, last_modified_by, > rawDataSize, etc. Some of them are useless in Spark SQL, we should remove > them. > It will be good if we can get the list of Hive-generated table properties via > Hive API, so that we don't need to hardcode them. > We can take a look at Hive code to see how it excludes these auto-generated > table properties when describe table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table
[ https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823657#comment-15823657 ] Song Jun commented on SPARK-19153: -- I am sorry, It is my fault,I forget to comment > DataFrameWriter.saveAsTable should work with hive format to create > partitioned table > > > Key: SPARK-19153 > URL: https://issues.apache.org/jira/browse/SPARK-19153 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19246) CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema
Song Jun created SPARK-19246: Summary: CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema Key: SPARK-19246 URL: https://issues.apache.org/jira/browse/SPARK-19246 Project: Spark Issue Type: Bug Components: SQL Reporter: Song Jun get CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, if not we should throw an exception -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19246) CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, and keep order with partitionColumnN
[ https://issues.apache.org/jira/browse/SPARK-19246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19246: - Description: get CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, if not we should throw an exception and CataLogTable's partitionSchema should keep order with partitionColumnNames was:get CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, if not we should throw an exception > CataLogTable's partitionSchema should check if each column name in > partitionColumnNames must match one and only one field in schema, and keep > order with partitionColumnNames > -- > > Key: SPARK-19246 > URL: https://issues.apache.org/jira/browse/SPARK-19246 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Song Jun > > get CataLogTable's partitionSchema should check if each column name in > partitionColumnNames must match one and only one field in schema, if not we > should throw an exception > and CataLogTable's partitionSchema should keep order with > partitionColumnNames -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19246) CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, and keep order with partitionColumnN
[ https://issues.apache.org/jira/browse/SPARK-19246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19246: - Summary: CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, and keep order with partitionColumnNames (was: CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema) > CataLogTable's partitionSchema should check if each column name in > partitionColumnNames must match one and only one field in schema, and keep > order with partitionColumnNames > -- > > Key: SPARK-19246 > URL: https://issues.apache.org/jira/browse/SPARK-19246 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Song Jun > > get CataLogTable's partitionSchema should check if each column name in > partitionColumnNames must match one and only one field in schema, if not we > should throw an exception -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19246) CataLogTable's partitionSchema should check order
[ https://issues.apache.org/jira/browse/SPARK-19246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19246: - Summary: CataLogTable's partitionSchema should check order (was: CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, and keep order with partitionColumnNames ) > CataLogTable's partitionSchema should check order > --- > > Key: SPARK-19246 > URL: https://issues.apache.org/jira/browse/SPARK-19246 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Song Jun > > get CataLogTable's partitionSchema should check if each column name in > partitionColumnNames must match one and only one field in schema, if not we > should throw an exception > and CataLogTable's partitionSchema should keep order with > partitionColumnNames -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String
[ https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825645#comment-15825645 ] Song Jun commented on SPARK-19257: -- I am working on this~ thanks > The type of CatalogStorageFormat.locationUri should be java.net.URI instead > of String > - > > Key: SPARK-19257 > URL: https://issues.apache.org/jira/browse/SPARK-19257 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > Currently we treat `CatalogStorageFormat.locationUri` as URI string and > always convert it to path by `new Path(new URI(locationUri))` > It will be safer if we can make the type of > `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19761) create InMemoryFileIndex with empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero
Song Jun created SPARK-19761: Summary: create InMemoryFileIndex with empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero Key: SPARK-19761 URL: https://issues.apache.org/jira/browse/SPARK-19761 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun if we create a InMemoryFileIndex with an empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero, it will throw an exception: {code} Positive number of slices required java.lang.IllegalArgumentException: Positive number of slices required at org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:119) at org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(ParallelCollectionRDD.scala:97) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2084) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:935) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles(PartitioningAwareFileIndex.scala:357) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.listLeafFiles(PartitioningAwareFileIndex.scala:256) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:74) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(InMemoryFileIndex.scala:50) at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9$$anonfun$apply$mcV$sp$2.apply$mcV$sp(FileIndexSuite.scala:186) at org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:105) at org.apache.spark.sql.execution.datasources.FileIndexSuite.withSQLConf(FileIndexSuite.scala:33) at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply$mcV$sp(FileIndexSuite.scala:185) at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185) at org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19763) qualified external datasource table location stored in catalog
Song Jun created SPARK-19763: Summary: qualified external datasource table location stored in catalog Key: SPARK-19763 URL: https://issues.apache.org/jira/browse/SPARK-19763 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun If we create a external datasource table with a non-qualified location , we should qualified it to store in catalog. {code} CREATE TABLE t(a string) USING parquet LOCATION '/path/xx' CREATE TABLE t1(a string, b string) USING parquet PARTITIONED BY(b) LOCATION '/path/xx' {code} when we get the table from catalog, the location should be qualified, e.g.'file:/path/xxx' -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19748) refresh for InMemoryFileIndex with FileStatusCache does not work correctly
Song Jun created SPARK-19748: Summary: refresh for InMemoryFileIndex with FileStatusCache does not work correctly Key: SPARK-19748 URL: https://issues.apache.org/jira/browse/SPARK-19748 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use the FileStatusCache to generate the cachedLeafFiles etc, then call FileStatusCache.invalidateAll. the order to do these two actions is wrong, this lead to the refresh action does not take effect. {code} override def refresh(): Unit = { refresh0() fileStatusCache.invalidateAll() } private def refresh0(): Unit = { val files = listLeafFiles(rootPaths) cachedLeafFiles = new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => f.getPath -> f) cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent) cachedPartitionSpec = null } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19742) When using SparkSession to write a dataset to Hive the schema is ignored
[ https://issues.apache.org/jira/browse/SPARK-19742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885399#comment-15885399 ] Song Jun commented on SPARK-19742: -- this is expected, see the comment. {code} /** * Inserts the content of the `DataFrame` to the specified table. It requires that * the schema of the `DataFrame` is the same as the schema of the table. * * @note Unlike `saveAsTable`, `insertInto` ignores the column names and just uses position-based * resolution. For example: * * {{{ *scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1") *scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1") *scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1") *scala> sql("select * from t1").show *+---+---+ *| i| j| *+---+---+ *| 5| 6| *| 3| 4| *| 1| 2| *+---+---+ * }}} * * Because it inserts data to an existing table, format or options will be ignored. * * @since 1.4.0 */ def insertInto(tableName: String): Unit = { insertInto(df.sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)) } {code} > When using SparkSession to write a dataset to Hive the schema is ignored > > > Key: SPARK-19742 > URL: https://issues.apache.org/jira/browse/SPARK-19742 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.1 > Environment: Running on Ubuntu with HDP 2.4. >Reporter: Navin Goel > > I am saving a Dataset that is created form reading a json and some selects > and filters into a hive table. The dataset.write().insertInto function does > not look at schema when writing to the table but instead writes in order to > the hive table. > The schemas for both the tables are same. > schema printed from spark of the dataset being written: > StructType(StructField(countrycode,StringType,true), > StructField(systemflag,StringType,true), > StructField(classcode,StringType,true), > StructField(classname,StringType,true), > StructField(rangestart,StringType,true), > StructField(rangeend,StringType,true), > StructField(tablename,StringType,true), > StructField(last_updated_date,TimestampType,true)) > Schema of the dataset after loading the same table from Hive: > StructType(StructField(systemflag,StringType,true), > StructField(RangeEnd,StringType,true), > StructField(classcode,StringType,true), > StructField(classname,StringType,true), > StructField(last_updated_date,TimestampType,true), > StructField(countrycode,StringType,true), > StructField(rangestart,StringType,true), > StructField(tablename,StringType,true)) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19784) refresh datasource table after alter the location
Song Jun created SPARK-19784: Summary: refresh datasource table after alter the location Key: SPARK-19784 URL: https://issues.apache.org/jira/browse/SPARK-19784 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun currently if we alter the location of a datasource table, then we select from it, it still return the data of the old location. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18137) RewriteDistinctAggregates UnresolvedException because of UDAF TypeCheckFailure
[ https://issues.apache.org/jira/browse/SPARK-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-18137: - Description: when run a sql with distinct(on spark github master branch), it throw UnresolvedException. For example: run a test case on spark(branch master) with sql: SELECT percentile_approx(key, 0.9), count(distinct key),sum(distinct key) FROM src LIMIT 1 and it throw exception: ``` org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'percentile_approx(CAST(src.`key` AS DOUBLE), CAST(0.9BD AS DOUBLE), 1) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:92) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$nullify(RewriteDistinctAggregates.scala:261) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$evalWithinGroup$1(RewriteDistinctAggregates.scala:136) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:187) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) at
[jira] [Created] (SPARK-18137) RewriteDistinctAggregates UnresolvedException because of UDAF TypeCheckFailure
Song Jun created SPARK-18137: Summary: RewriteDistinctAggregates UnresolvedException because of UDAF TypeCheckFailure Key: SPARK-18137 URL: https://issues.apache.org/jira/browse/SPARK-18137 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Song Jun when run a sql with distinct(on master branch), it throw UnresolvedException. For example: run a test case on spark(branch master) with sql: SELECT percentile_approx(key, 0.9), count(distinct key),sum(distinct key) FROM src LIMIT 1 and it throw exception: ``` org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'percentile_approx(CAST(src.`key` AS DOUBLE), CAST(0.9BD AS DOUBLE), 1) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:92) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$nullify(RewriteDistinctAggregates.scala:261) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$evalWithinGroup$1(RewriteDistinctAggregates.scala:136) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:187) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105) at org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
[jira] [Updated] (SPARK-18137) RewriteDistinctAggregates UnresolvedException when a UDAF has a foldable TypeCheck
[ https://issues.apache.org/jira/browse/SPARK-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-18137: - Summary: RewriteDistinctAggregates UnresolvedException when a UDAF has a foldable TypeCheck (was: RewriteDistinctAggregates UnresolvedException because of UDAF TypeCheckFailure) > RewriteDistinctAggregates UnresolvedException when a UDAF has a foldable > TypeCheck > -- > > Key: SPARK-18137 > URL: https://issues.apache.org/jira/browse/SPARK-18137 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Song Jun > > when run a sql with distinct(on spark github master branch), it throw > UnresolvedException. > For example: > run a test case on spark(branch master) with sql: > SELECT percentile_approx(key, 0.9), count(distinct key),sum(distinct key) > FROM src LIMIT 1 > and it throw exception: > ``` > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > dataType on unresolved object, tree: 'percentile_approx(CAST(src.`key` AS > DOUBLE), CAST(0.9BD AS DOUBLE), 1) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:92) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$nullify(RewriteDistinctAggregates.scala:261) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$evalWithinGroup$1(RewriteDistinctAggregates.scala:136) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:187) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105) > at > org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at >
[jira] [Commented] (SPARK-18298) HistoryServer use GMT time all time
[ https://issues.apache.org/jira/browse/SPARK-18298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646333#comment-15646333 ] Song Jun commented on SPARK-18298: -- [~WangTao] I test it,and the ui should show the user's local time which the 1.6.x did, so I think this is a bug, and I post a pull request. > HistoryServer use GMT time all time > --- > > Key: SPARK-18298 > URL: https://issues.apache.org/jira/browse/SPARK-18298 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.0.1 > Environment: suse 11.3 with CST time >Reporter: Tao Wang > > When I started HistoryServer for reading event logs, the timestamp readed > will be parsed using local timezone like "CST"(confirmed via debug). > But the time related columns like "Started"/"Completed"/"Last Updated" in > History Server UI using "GMT" time, which is 8 hours earlier than "CST". > {quote} > App IDApp NameStarted Completed DurationSpark > User Last UpdatedEvent Log > local-1478225166651 Spark shell 2016-11-04 02:06:06 2016-11-07 > 01:33:30 71.5 h root2016-11-07 01:33:30 > {quote} > I've checked the REST api and found the result like: > {color:red} > [ { > "id" : "local-1478225166651", > "name" : "Spark shell", > "attempts" : [ { > "startTime" : "2016-11-04T02:06:06.020GMT", > "endTime" : "2016-11-07T01:33:30.265GMT", > "lastUpdated" : "2016-11-07T01:33:30.000GMT", > "duration" : 257244245, > "sparkUser" : "root", > "completed" : true, > "lastUpdatedEpoch" : 147848241, > "endTimeEpoch" : 1478482410265, > "startTimeEpoch" : 1478225166020 > } ] > }, { > "id" : "local-1478224925869", > "name" : "Spark Pi", > "attempts" : [ { > "startTime" : "2016-11-04T02:02:02.133GMT", > "endTime" : "2016-11-04T02:02:07.468GMT", > "lastUpdated" : "2016-11-04T02:02:07.000GMT", > "duration" : 5335, > "sparkUser" : "root", > "completed" : true, > ... > {color} > So maybe the change happened in transferring between server and browser? I > have no idea where to go from this point. > Hope guys can offer some help, or just fix it if it's easy? :) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646477#comment-15646477 ] Song Jun edited comment on SPARK-18055 at 11/8/16 5:04 AM: --- [~davies] I can't reproduce it on the master branch or on databricks(Spark 2.0.1-db1 (Scala 2.11)),my code: import scala.collection.mutable.ArrayBuffer case class MyData(id: String,arr: Seq\[String]) val myarr = ArrayBuffer\[MyData]() for(i <- 20 to 30){ val arr = ArrayBuffer\[String]() for(j <- 1 to 10) { arr += (i+j).toString } val mydata = new MyData(i.toString,arr) myarr += mydata } val rdd = spark.sparkContext.makeRDD(myarr) val ds = rdd.toDS ds.rdd.flatMap(_.arr) ds.flatMap(_.arr) there is no exception, it has fixed? or My code is wrong? I also test with the test-jar_2.11-1.0.jar in spark-shell: >spark-shell --jars test-jar_2.11-1.0.jar and there is no exception. was (Author: windpiger): [~davies] I can't reproduce it on the master branch or on databricks(Spark 2.0.1-db1 (Scala 2.11)),my code: import scala.collection.mutable.ArrayBuffer case class MyData(id: String,arr: Seq\[String]) val myarr = ArrayBuffer\[MyData]() for(i <- 20 to 30){ val arr = ArrayBuffer\[String]() for(j <- 1 to 10) { arr += (i+j).toString } val mydata = new MyData(i.toString,arr) myarr += mydata } val rdd = spark.sparkContext.makeRDD(myarr) val ds = rdd.toDS ds.rdd.flatMap(_.arr) ds.flatMap(_.arr) there is no exception, it has fixed? or My code is wrong? or I must test it with customed jar? > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646477#comment-15646477 ] Song Jun edited comment on SPARK-18055 at 11/8/16 4:51 AM: --- [~davies] I can't reproduce it on the master branch or on databricks(Spark 2.0.1-db1 (Scala 2.11)),my code: import scala.collection.mutable.ArrayBuffer case class MyData(id: String,arr: Seq\[String]) val myarr = ArrayBuffer\[MyData]() for(i <- 20 to 30){ val arr = ArrayBuffer\[String]() for(j <- 1 to 10) { arr += (i+j).toString } val mydata = new MyData(i.toString,arr) myarr += mydata } val rdd = spark.sparkContext.makeRDD(myarr) val ds = rdd.toDS ds.rdd.flatMap(_.arr) ds.flatMap(_.arr) there is no exception, it has fixed? or My code is wrong? or I must test it with customed jar? was (Author: windpiger): [~davies] I can't reproduce it on the master branch or on databricks(Spark 2.0.1-db1 (Scala 2.11)),my code: import scala.collection.mutable.ArrayBuffer case class MyData(id: String,arr: Seq\[String]) val myarr = ArrayBuffer\[MyData]() for(i <- 20 to 30){ val arr = ArrayBuffer\[String]() for(j <- 1 to 10) { arr += (i+j).toString } val mydata = new MyData(i.toString,arr) myarr += mydata } val rdd = spark.sparkContext.makeRDD(myarr) val ds = rdd.toDS ds.rdd.flatMap(_.arr) ds.flatMap(_.arr) there is no exception, it has fixed? or My code is wrong? > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15646477#comment-15646477 ] Song Jun commented on SPARK-18055: -- [~davies] I can't reproduce it on the master branch or on databricks(Spark 2.0.1-db1 (Scala 2.11)),my code: import scala.collection.mutable.ArrayBuffer case class MyData(id: String,arr: Seq\[String]) val myarr = ArrayBuffer\[MyData]() for(i <- 20 to 30){ val arr = ArrayBuffer\[String]() for(j <- 1 to 10) { arr += (i+j).toString } val mydata = new MyData(i.toString,arr) myarr += mydata } val rdd = spark.sparkContext.makeRDD(myarr) val ds = rdd.toDS ds.rdd.flatMap(_.arr) ds.flatMap(_.arr) there is no exception, it has fixed? or My code is wrong? > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18490) duplicate nodename extrainfo of ShuffleExchange
[ https://issues.apache.org/jira/browse/SPARK-18490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673918#comment-15673918 ] Song Jun commented on SPARK-18490: -- it is not a bug, just simplify this code > duplicate nodename extrainfo of ShuffleExchange > --- > > Key: SPARK-18490 > URL: https://issues.apache.org/jira/browse/SPARK-18490 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Song Jun >Priority: Trivial > > {noformat} > override def nodeName: String = { > val extraInfo = coordinator match { > case Some(exchangeCoordinator) if exchangeCoordinator.isEstimated => > s"(coordinator id: ${System.identityHashCode(coordinator)})" > case Some(exchangeCoordinator) if !exchangeCoordinator.isEstimated => > s"(coordinator id: ${System.identityHashCode(coordinator)})" > case None => "" > } > {noformat} > [if exchangeCoordinator.isEstimated ] true or false with the same result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18490) duplicate nodename extrainfo of ShuffleExchange
Song Jun created SPARK-18490: Summary: duplicate nodename extrainfo of ShuffleExchange Key: SPARK-18490 URL: https://issues.apache.org/jira/browse/SPARK-18490 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2, 2.0.1, 2.0.0 Reporter: Song Jun Priority: Minor override def nodeName: String = { val extraInfo = coordinator match { case Some(exchangeCoordinator) if exchangeCoordinator.isEstimated => s"(coordinator id: ${System.identityHashCode(coordinator)})" case Some(exchangeCoordinator) if !exchangeCoordinator.isEstimated => s"(coordinator id: ${System.identityHashCode(coordinator)})" case None => "" } [if exchangeCoordinator.isEstimated ] true or false with the same result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18490) duplicate nodename extrainfo of ShuffleExchange
[ https://issues.apache.org/jira/browse/SPARK-18490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-18490: - Component/s: (was: Spark Core) SQL > duplicate nodename extrainfo of ShuffleExchange > --- > > Key: SPARK-18490 > URL: https://issues.apache.org/jira/browse/SPARK-18490 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Song Jun >Priority: Minor > > override def nodeName: String = { > val extraInfo = coordinator match { > case Some(exchangeCoordinator) if exchangeCoordinator.isEstimated => > s"(coordinator id: ${System.identityHashCode(coordinator)})" > case Some(exchangeCoordinator) if !exchangeCoordinator.isEstimated => > s"(coordinator id: ${System.identityHashCode(coordinator)})" > case None => "" > } > [if exchangeCoordinator.isEstimated ] true or false with the same result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14171) UDAF aggregates argument object inspector not parsed correctly
[ https://issues.apache.org/jira/browse/SPARK-14171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614192#comment-15614192 ] Song Jun commented on SPARK-14171: -- This issue does not reproduce on the recently spark branch master ,because spark has implement its own percentile_approx(ApproximatePercentile.scala) not use the hive's, then there is no constant type check exception. But there is another problem,when sql: select percentile_approxy(key,0.9),count(distinct key),sume(distinc key) from src limit 1 detail in issue [SPARK-18137] > UDAF aggregates argument object inspector not parsed correctly > -- > > Key: SPARK-14171 > URL: https://issues.apache.org/jira/browse/SPARK-14171 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Jianfeng Hu >Priority: Critical > > For example, when using percentile_approx and count distinct together, it > raises an error complaining the argument is not constant. We have a test case > to reproduce. Could you help look into a fix of this? This was working in > previous version (Spark 1.4 + Hive 0.13). Thanks! > {code}--- > a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala > +++ > b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala > @@ -148,6 +148,9 @@ class HiveUDFSuite extends QueryTest with > TestHiveSingleton with SQLTestUtils { > checkAnswer(sql("SELECT percentile_approx(100.0, array(0.9, 0.9)) FROM > src LIMIT 1"), >sql("SELECT array(100, 100) FROM src LIMIT 1").collect().toSeq) > + > +checkAnswer(sql("SELECT percentile_approx(key, 0.9), count(distinct > key) FROM src LIMIT 1"), > + sql("SELECT max(key), 1 FROM src LIMIT 1").collect().toSeq) > } >test("UDFIntegerToString") { > {code} > When running the test suite, we can see this error: > {code} > - Generic UDAF aggregates *** FAILED *** > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: > hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox@6e1dc6a7),key#51176,0.9,false,0,0) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:357) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:238) > at > org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter.org$apache$spark$sql$catalyst$analysis$DistinctAggregationRewriter$$patchAggregateFunctionChildren$1(DistinctAggregationRewriter.scala:148) > at > org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:192) > at > org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:190) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > ... > Cause: java.lang.reflect.InvocationTargetException: > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:368) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:367) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:365) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:357) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48) > ... > Cause: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The second > argument must be a constant, but double was passed instead. > at > org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox.getEvaluator(GenericUDAFPercentileApprox.java:147) > at > org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector$lzycompute(hiveUDFs.scala:598) > at >
[jira] [Created] (SPARK-18271) hash udf in HiveSessionCatalog.hiveFunctions seq is redundant
Song Jun created SPARK-18271: Summary: hash udf in HiveSessionCatalog.hiveFunctions seq is redundant Key: SPARK-18271 URL: https://issues.apache.org/jira/browse/SPARK-18271 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Song Jun Priority: Minor when lookupfunction in HiveSessionCatalog, first look up in spark's build-in functionRegistry, if it not existed, then look up the Hive'S build-in functions which are listed in Seq(hiveFunctions). But the [hash] function is already in spark's build-in functionRegistry list, so it will never to go to look up in Hive's hiveFunctions list, so the [hash] function which is in Hive's hiveFunctions is redundant, we can remove it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18609) [SQL] column mixup with CROSS JOIN
[ https://issues.apache.org/jira/browse/SPARK-18609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745309#comment-15745309 ] Song Jun commented on SPARK-18609: -- open another jira SPARK-18841 > [SQL] column mixup with CROSS JOIN > -- > > Key: SPARK-18609 > URL: https://issues.apache.org/jira/browse/SPARK-18609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Furcy Pin > > Reproduced on spark-sql v2.0.2 and on branch master. > {code} > DROP TABLE IF EXISTS p1 ; > DROP TABLE IF EXISTS p2 ; > CREATE TABLE p1 (col TIMESTAMP) ; > CREATE TABLE p2 (col TIMESTAMP) ; > set spark.sql.crossJoin.enabled = true; > -- EXPLAIN > WITH CTE AS ( > SELECT > s2.col as col > FROM p1 > CROSS JOIN ( > SELECT > e.col as col > FROM p2 E > ) s2 > ) > SELECT > T1.col as c1, > T2.col as c2 > FROM CTE T1 > CROSS JOIN CTE T2 > ; > {code} > This returns the following stacktrace : > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: col#21 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at >
[jira] [Created] (SPARK-18841) PushProjectionThroughUnion exception when there are same column
Song Jun created SPARK-18841: Summary: PushProjectionThroughUnion exception when there are same column Key: SPARK-18841 URL: https://issues.apache.org/jira/browse/SPARK-18841 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2, 2.1.0 Reporter: Song Jun DROP TABLE IF EXISTS p1 ; DROP TABLE IF EXISTS p2 ; DROP TABLE IF EXISTS p3 ; CREATE TABLE p1 (col STRING) ; CREATE TABLE p2 (col STRING) ; CREATE TABLE p3 (col STRING) ; set spark.sql.crossJoin.enabled = true; SELECT 1 as cste, col FROM ( SELECT col as col FROM ( SELECT p1.col as col FROM p1 LEFT JOIN p2 UNION ALL SELECT col FROM p3 ) T1 ) T2 ; it will throw exception: key not found: col#16 java.util.NoSuchElementException: key not found: col#16 at scala.collection.MapLike$class.default(MapLike.scala:228) at org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31) at scala.collection.MapLike$class.apply(MapLike.scala:141) at org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378) at org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:376) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18609) [SQL] column mixup with CROSS JOIN
[ https://issues.apache.org/jira/browse/SPARK-18609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728042#comment-15728042 ] Song Jun commented on SPARK-18609: -- I'm working on this~ > [SQL] column mixup with CROSS JOIN > -- > > Key: SPARK-18609 > URL: https://issues.apache.org/jira/browse/SPARK-18609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Furcy Pin > > Reproduced on spark-sql v2.0.2 and on branch master. > {code} > DROP TABLE IF EXISTS p1 ; > DROP TABLE IF EXISTS p2 ; > CREATE TABLE p1 (col TIMESTAMP) ; > CREATE TABLE p2 (col TIMESTAMP) ; > set spark.sql.crossJoin.enabled = true; > -- EXPLAIN > WITH CTE AS ( > SELECT > s2.col as col > FROM p1 > CROSS JOIN ( > SELECT > e.col as col > FROM p2 E > ) s2 > ) > SELECT > T1.col as c1, > T2.col as c2 > FROM CTE T1 > CROSS JOIN CTE T2 > ; > {code} > This returns the following stacktrace : > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: col#21 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at >
[jira] [Created] (SPARK-18742) readd spark.broadcast.factory for user-defined BroadcastFactory
Song Jun created SPARK-18742: Summary: readd spark.broadcast.factory for user-defined BroadcastFactory Key: SPARK-18742 URL: https://issues.apache.org/jira/browse/SPARK-18742 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Song Jun Priority: Minor After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of BroadcastFactory is TorrentBroadcastFactory. No code in Spark 2 uses BroadcastFactory (but TorrentBroadcastFactory) however the scaladoc says [2]: /** * An interface for all the broadcast implementations in Spark (to allow * multiple broadcast implementations). SparkContext uses a user-specified * BroadcastFactory implementation to instantiate a particular broadcast for the * entire Spark job. */ which is not correct since there is no way to plug in a custom user-specified BroadcastFactory. It is better to readd spark.broadcast.factory for user-defined BroadcastFactory [1] https://issues.apache.org/jira/browse/SPARK-12588 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19154) support read and overwrite a same table
[ https://issues.apache.org/jira/browse/SPARK-19154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816942#comment-15816942 ] Song Jun commented on SPARK-19154: -- I am working on this~ > support read and overwrite a same table > --- > > Key: SPARK-19154 > URL: https://issues.apache.org/jira/browse/SPARK-19154 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > > In SPARK-5746 , we forbid users to read and overwrite a same table. It seems > like we don't need this limitation now, we can remove the check and add > regression tests. We may need to take care of partitioned table though. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19166) change method name from InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix
Song Jun created SPARK-19166: Summary: change method name from InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix Key: SPARK-19166 URL: https://issues.apache.org/jira/browse/SPARK-19166 Project: Spark Issue Type: Improvement Components: SQL Reporter: Song Jun Priority: Minor InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions delete all files that match a static prefix, such as a partition file path(/table/foo=1), or a no partition file path(/xxx/a.json). while the method name deleteMatchingPartitions indicates that only the partition file will be deleted. This name make a confused. It is better to rename the method name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.
[ https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784678#comment-15784678 ] Song Jun edited comment on SPARK-18930 at 12/29/16 6:44 AM: from hive document, https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert Note that the dynamic partition values are selected by ordering, not name, and taken as the last columns from the select clause. and test it on hive also have the same logic as your description . I think we can close this jira? [~srowen] was (Author: windpiger): from hive document, https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert Note that the dynamic partition values are selected by ordering, not name, and taken as the last columns from the select clause. and test it on hive also have the same logic as your description . I think we can close this jira? > Inserting in partitioned table - partitioned field should be last in select > statement. > --- > > Key: SPARK-18930 > URL: https://issues.apache.org/jira/browse/SPARK-18930 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > CREATE TABLE temp.test_partitioning_4 ( > num string > ) > PARTITIONED BY ( > day string) > stored as parquet > INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day) > select day, count(*) as num from > hss.session where year=2016 and month=4 > group by day > Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, > emp.db/test_partitioning_3/day=69094345 > As you can imagine these numbers are num of records. But! When I do select * > from temp.test_partitioning_4 data is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.
[ https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784678#comment-15784678 ] Song Jun commented on SPARK-18930: -- from hive document, https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert Note that the dynamic partition values are selected by ordering, not name, and taken as the last columns from the select clause. and test it on hive also have the same logic as your description . I think we can close this jira? > Inserting in partitioned table - partitioned field should be last in select > statement. > --- > > Key: SPARK-18930 > URL: https://issues.apache.org/jira/browse/SPARK-18930 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > CREATE TABLE temp.test_partitioning_4 ( > num string > ) > PARTITIONED BY ( > day string) > stored as parquet > INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day) > select day, count(*) as num from > hss.session where year=2016 and month=4 > group by day > Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, > emp.db/test_partitioning_3/day=69094345 > As you can imagine these numbers are num of records. But! When I do select * > from temp.test_partitioning_4 data is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18742) Clarify that user-defined BroadcastFactory is not supported
[ https://issues.apache.org/jira/browse/SPARK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-18742: - Description: After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of BroadcastFactory is TorrentBroadcastFactory and the spark.broadcast.factory conf has removed. however the scaladoc says [2]: /** * An interface for all the broadcast implementations in Spark (to allow * multiple broadcast implementations). SparkContext uses a user-specified * BroadcastFactory implementation to instantiate a particular broadcast for the * entire Spark job. */ so we should modify the comment that SparkContext will not use a user-specified BroadcastFactory implementation [1] https://issues.apache.org/jira/browse/SPARK-12588 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 was: After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of BroadcastFactory is TorrentBroadcastFactory. however the scaladoc says [2]: /** * An interface for all the broadcast implementations in Spark (to allow * multiple broadcast implementations). SparkContext uses a user-specified * BroadcastFactory implementation to instantiate a particular broadcast for the * entire Spark job. */ so we should modify the comment that SparkContext will not use a user-specified BroadcastFactory implementation [1] https://issues.apache.org/jira/browse/SPARK-12588 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 > Clarify that user-defined BroadcastFactory is not supported > --- > > Key: SPARK-18742 > URL: https://issues.apache.org/jira/browse/SPARK-18742 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Song Jun >Priority: Trivial > > After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation > of BroadcastFactory is TorrentBroadcastFactory and the > spark.broadcast.factory conf has removed. > however the scaladoc says [2]: > /** > * An interface for all the broadcast implementations in Spark (to allow > * multiple broadcast implementations). SparkContext uses a user-specified > * BroadcastFactory implementation to instantiate a particular broadcast for > the > * entire Spark job. > */ > so we should modify the comment that SparkContext will not use a > user-specified BroadcastFactory implementation > [1] https://issues.apache.org/jira/browse/SPARK-12588 > [2] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18742) Clarify that user-defined BroadcastFactory is not supported
[ https://issues.apache.org/jira/browse/SPARK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-18742: - Description: After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of BroadcastFactory is TorrentBroadcastFactory. however the scaladoc says [2]: /** * An interface for all the broadcast implementations in Spark (to allow * multiple broadcast implementations). SparkContext uses a user-specified * BroadcastFactory implementation to instantiate a particular broadcast for the * entire Spark job. */ so we should modify the comment that SparkContext will not use a user-specified BroadcastFactory implementation [1] https://issues.apache.org/jira/browse/SPARK-12588 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 was: After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of BroadcastFactory is TorrentBroadcastFactory. No code in Spark 2 uses BroadcastFactory (but TorrentBroadcastFactory) however the scaladoc says [2]: /** * An interface for all the broadcast implementations in Spark (to allow * multiple broadcast implementations). SparkContext uses a user-specified * BroadcastFactory implementation to instantiate a particular broadcast for the * entire Spark job. */ which is not correct since there is no way to plug in a custom user-specified BroadcastFactory. It is better to readd spark.broadcast.factory for user-defined BroadcastFactory [1] https://issues.apache.org/jira/browse/SPARK-12588 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 > Clarify that user-defined BroadcastFactory is not supported > --- > > Key: SPARK-18742 > URL: https://issues.apache.org/jira/browse/SPARK-18742 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Song Jun >Priority: Trivial > > After SPARK-12588 Remove HTTPBroadcast [1], the one and only > implementation of BroadcastFactory is TorrentBroadcastFactory. however > the scaladoc says [2]: > /** > * An interface for all the broadcast implementations in Spark (to allow > * multiple broadcast implementations). SparkContext uses a user-specified > * BroadcastFactory implementation to instantiate a particular broadcast for > the > * entire Spark job. > */ > so we should modify the comment that SparkContext will not use a > user-specified BroadcastFactory implementation > [1] https://issues.apache.org/jira/browse/SPARK-12588 > [2] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20013) merge renameTable to alterTable in ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-20013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-20013: - Description: Currently when we create / rename a managed table, we should get the defaultTablePath for them in ExternalCatalog, then we have two defaultTablePath logic in its two subclass HiveExternalCatalog and InMemoryCatalog, additionally there is also a defaultTablePath in SessionCatalog, so till now we have three defaultTablePath in three classes. we'd better to unify them up to SessionCatalog To unify them, we should move some logic from ExternalCatalog to SessionCatalog, renameTable is one of this. while limit to the simple parameters in renameTable {code} def renameTable(db: String, oldName: String, newName: String): Unit {code} even if we move the defaultTablePath logic to SessionCatalog, we can not pass it to renameTable. So we can add a newTablePath parameter for renameTable in ExternalCatalog was: merge renameTable to alterTable in ExternalCatalog has some reasons: 1. In Hive, we rename a Table by alterTable 2. Currently when we create / rename a managed table, we should get the defaultTablePath for them in ExternalCatalog, then we have two defaultTablePath logic in its two subclass HiveExternalCatalog and InMemoryCatalog, additionally there is also a defaultTablePath in SessionCatalog, so till now we have three defaultTablePath in three classes. we'd better to unify them up to SessionCatalog To unify them, we should move some logic from ExternalCatalog to SessionCatalog, renameTable is one of this. while limit to the simple parameters in renameTable {code} def renameTable(db: String, oldName: String, newName: String): Unit {code} even if we move the defaultTablePath logic to SessionCatalog, we can not pass it to renameTable. So we can merge the renameTable to alterTable, and rename it in alterTable. > merge renameTable to alterTable in ExternalCatalog > -- > > Key: SPARK-20013 > URL: https://issues.apache.org/jira/browse/SPARK-20013 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > Currently when we create / rename a managed table, we should get the > defaultTablePath for them in ExternalCatalog, then we have two > defaultTablePath logic in its two subclass HiveExternalCatalog and > InMemoryCatalog, additionally there is also a defaultTablePath in > SessionCatalog, so till now we have three defaultTablePath in three classes. > we'd better to unify them up to SessionCatalog > To unify them, we should move some logic from ExternalCatalog to > SessionCatalog, renameTable is one of this. > while limit to the simple parameters in renameTable > {code} > def renameTable(db: String, oldName: String, newName: String): Unit > {code} > even if we move the defaultTablePath logic to SessionCatalog, we can not pass > it to renameTable. > So we can add a newTablePath parameter for renameTable in ExternalCatalog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19961) unify a exception erro msg for dropdatabase
Song Jun created SPARK-19961: Summary: unify a exception erro msg for dropdatabase Key: SPARK-19961 URL: https://issues.apache.org/jira/browse/SPARK-19961 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor unify a exception erro msg for dropdatabase when the database still have some tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20013) merge renameTable to alterTable in ExternalCatalog
Song Jun created SPARK-20013: Summary: merge renameTable to alterTable in ExternalCatalog Key: SPARK-20013 URL: https://issues.apache.org/jira/browse/SPARK-20013 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun merge renameTable to alterTable in ExternalCatalog has some reasons: 1. In Hive, we rename a Table by alterTable 2. Currently when we create / rename a managed table, we should get the defaultTablePath for them in ExternalCatalog, then we have two defaultTablePath logic in its two subclass HiveExternalCatalog and InMemoryCatalog, additionally there is also a defaultTablePath in SessionCatalog, so till now we have three defaultTablePath in three classes. we'd better to unify them up to SessionCatalog To unify them, we should move some logic from ExternalCatalog to SessionCatalog, renameTable is one of this. while limit to the simple parameters in renameTable {code} def renameTable(db: String, oldName: String, newName: String): Unit {code} even if we move the defaultTablePath logic to SessionCatalog, we can not pass it to renameTable. So we can merge the renameTable to alterTable, and rename it in alterTable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
[ https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409 ] Song Jun edited comment on SPARK-19990 at 3/17/17 4:36 AM: --- the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit in SPARK-19235 https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. it is not related with SPARK-19763 I will fix this by providing a new test dir which contain the test files in sql/ , and the test case use this file path. thanks~ was (Author: windpiger): the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. it is not related with SPARK-19763 I will fix this by providing a new test dir which contain the test files in sql/ , and the test case use this file path. thanks~ > Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create > temporary view using > -- > > Key: SPARK-19990 > URL: https://issues.apache.org/jira/browse/SPARK-19990 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout > > This test seems to be failing consistently on all of the maven builds: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using > and is possibly caused by SPARK-19763. > Here's a stack trace for the failure: > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) > at >
[jira] [Comment Edited] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
[ https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409 ] Song Jun edited comment on SPARK-19990 at 3/17/17 4:35 AM: --- the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. it is not related with SPARK-19763 I will fix this by providing a new test dir which contain the test files in sql/ , and the test case use this file path. thanks~ was (Author: windpiger): the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. > Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create > temporary view using > -- > > Key: SPARK-19990 > URL: https://issues.apache.org/jira/browse/SPARK-19990 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout > > This test seems to be failing consistently on all of the maven builds: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using > and is possibly caused by SPARK-19763. > Here's a stack trace for the failure: > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at >
[jira] [Commented] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
[ https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409 ] Song Jun commented on SPARK-19990: -- the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. > Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create > temporary view using > -- > > Key: SPARK-19990 > URL: https://issues.apache.org/jira/browse/SPARK-19990 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout > > This test seems to be failing consistently on all of the maven builds: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using > and is possibly caused by SPARK-19763. > Here's a stack trace for the failure: > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705) > at > org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186) > at > org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at >
[jira] [Updated] (SPARK-19945) Add test suite for SessionCatalog with HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-19945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19945: - Summary: Add test suite for SessionCatalog with HiveExternalCatalog (was: Add test case for SessionCatalog with HiveExternalCatalog) > Add test suite for SessionCatalog with HiveExternalCatalog > -- > > Key: SPARK-19945 > URL: https://issues.apache.org/jira/browse/SPARK-19945 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > Currently SessionCatalogSuite is only for InMemoryCatalog, there is no suite > for HiveExternalCatalog. > And there are some ddl function is not proper to test in > ExternalCatalogSuite, because some logic are not full implement in > ExternalCatalog, these ddl functions are full implement in SessionCatalog, it > is better to test it in SessionCatalogSuite > So we should add a test suite for SessionCatalog with HiveExternalCatalog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19945) Add test case for SessionCatalog with HiveExternalCatalog
Song Jun created SPARK-19945: Summary: Add test case for SessionCatalog with HiveExternalCatalog Key: SPARK-19945 URL: https://issues.apache.org/jira/browse/SPARK-19945 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Currently SessionCatalogSuite is only for InMemoryCatalog, there is no suite for HiveExternalCatalog. And there are some ddl function is not proper to test in ExternalCatalogSuite, because some logic are not full implement in ExternalCatalog, these ddl functions are full implement in SessionCatalog, it is better to test it in SessionCatalogSuite So we should add a test suite for SessionCatalog with HiveExternalCatalog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19917) qualified partition location stored in catalog
Song Jun created SPARK-19917: Summary: qualified partition location stored in catalog Key: SPARK-19917 URL: https://issues.apache.org/jira/browse/SPARK-19917 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun partition path should be qualified to store in catalog. There are some scenes: 1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' qualified: file:/path/x 2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' qualified: file:/tablelocation/x 3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x' qualified: file:/path/x 4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x' qualified: file:/tablelocation/x Currently only ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde table has the expected qualified path. we should make other scenes to be consist with it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19864) add makeQualifiedPath in SQLTestUtils to optimize some code
[ https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19864: - Summary: add makeQualifiedPath in SQLTestUtils to optimize some code (was: add makeQualifiedPath in CatalogUtils to optimize some code) > add makeQualifiedPath in SQLTestUtils to optimize some code > --- > > Key: SPARK-19864 > URL: https://issues.apache.org/jira/browse/SPARK-19864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > Currently there are lots of places to make the path qualified, it is better > to provide a function to do this, then the code will be more simple. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19869) move table related ddl from ddl.scala to tables.scala
Song Jun created SPARK-19869: Summary: move table related ddl from ddl.scala to tables.scala Key: SPARK-19869 URL: https://issues.apache.org/jira/browse/SPARK-19869 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor move table related ddl from ddl.scala to tables.scala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19869) move table related ddl from ddl.scala to tables.scala
[ https://issues.apache.org/jira/browse/SPARK-19869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Song Jun updated SPARK-19869: - Issue Type: Improvement (was: Bug) > move table related ddl from ddl.scala to tables.scala > - > > Key: SPARK-19869 > URL: https://issues.apache.org/jira/browse/SPARK-19869 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > move table related ddl from ddl.scala to tables.scala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19845) failed to uncache datasource table after the table location altered
Song Jun created SPARK-19845: Summary: failed to uncache datasource table after the table location altered Key: SPARK-19845 URL: https://issues.apache.org/jira/browse/SPARK-19845 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Currently if we first cache a datasource table, then we alter the table location, then we drop the table, uncache table will failed in the DropTableCommand, because the location has changed and sameResult for two InMemoryFileIndex with different location return false, so we can't find the table key in the cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19845) failed to uncache datasource table after the table location altered
[ https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898626#comment-15898626 ] Song Jun commented on SPARK-19845: -- yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784 the location changed, it make it more complex to uncache the table and recache other tables reference this. > failed to uncache datasource table after the table location altered > --- > > Key: SPARK-19845 > URL: https://issues.apache.org/jira/browse/SPARK-19845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > Currently if we first cache a datasource table, then we alter the table > location, > then we drop the table, uncache table will failed in the DropTableCommand, > because the location has changed and sameResult for two InMemoryFileIndex > with different location return false, so we can't find the table key in the > cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19845) failed to uncache datasource table after the table location altered
[ https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898626#comment-15898626 ] Song Jun edited comment on SPARK-19845 at 3/7/17 2:33 AM: -- yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784 the location changed, it make it more complex to uncache the table and recache other tables reference this. I will dig it more was (Author: windpiger): yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784 the location changed, it make it more complex to uncache the table and recache other tables reference this. > failed to uncache datasource table after the table location altered > --- > > Key: SPARK-19845 > URL: https://issues.apache.org/jira/browse/SPARK-19845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > Currently if we first cache a datasource table, then we alter the table > location, > then we drop the table, uncache table will failed in the DropTableCommand, > because the location has changed and sameResult for two InMemoryFileIndex > with different location return false, so we can't find the table key in the > cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19836) Customizable remote repository url for hive versions unit test
[ https://issues.apache.org/jira/browse/SPARK-19836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899292#comment-15899292 ] Song Jun commented on SPARK-19836: -- I have do this similar https://github.com/apache/spark/pull/16803 > Customizable remote repository url for hive versions unit test > -- > > Key: SPARK-19836 > URL: https://issues.apache.org/jira/browse/SPARK-19836 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Elek, Marton > Labels: ivy, unittest > > When the VersionSuite test runs from sql/hive it downloads different versions > from hive. > Unfortunately the IsolatedClientClassloader (which is used by the > VersionSuite) uses hardcoded fix repositories: > {code} > val classpath = quietly { > SparkSubmitUtils.resolveMavenCoordinates( > hiveArtifacts.mkString(","), > SparkSubmitUtils.buildIvySettings( > Some("http://www.datanucleus.org/downloads/maven2;), > ivyPath), > exclusions = version.exclusions) > } > {code} > The problem is with the hard-coded repositories: > 1. it's hard to run unit tests in an environment where only one internal > maven repository is available (and central/datanucleus is not) > 2. it's impossible to run unit tests against custom built hive/hadoop > artifacts (which are not available from the central repository) > VersionSuite has already a specific SPARK_VERSIONS_SUITE_IVY_PATH environment > variable to define a custom local repository as ivy cache. > I suggest to add an additional environment variable > (SPARK_VERSIONS_SUITE_IVY_REPOSITORIES to the HiveClientBuilder.scala), to > make it possible adding new remote repositories for testing the different > hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name
Song Jun created SPARK-19832: Summary: DynamicPartitionWriteTask should escape the partition name Key: SPARK-19832 URL: https://issues.apache.org/jira/browse/SPARK-19832 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Currently in DynamicPartitionWriteTask, when we get the paritionPath of a parition, we just escape the partition value, not escape the partition name. this will cause some problems for some special partition name situation, for example : 1) if the partition name contains '%' etc, there will be two partition path created in the filesytem, one is for escaped path like '/path/a%25b=1', another is for unescaped path like '/path/a%b=1'. and the data inserted stored in unescaped path, while the show partitions table will return 'a%25b=1' which the partition name is escaped. So here it is not consist. And I think the data should be stored in the escaped path in filesystem, which Hive2.0.0 also have the same action. 2) if the partition name contains ':', there will throw exception that new Path("/path","a:b"), this is illegal which has a colon in the relative path. {code} java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: a:b at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.(Path.java:171) at org.apache.hadoop.fs.Path.(Path.java:88) ... 48 elided Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b at java.net.URI.checkPath(URI.java:1823) at java.net.URI.(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:202) ... 50 more {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19833) remove SQLConf.HIVE_VERIFY_PARTITION_PATH, we always return empty when the path does not exists
Song Jun created SPARK-19833: Summary: remove SQLConf.HIVE_VERIFY_PARTITION_PATH, we always return empty when the path does not exists Key: SPARK-19833 URL: https://issues.apache.org/jira/browse/SPARK-19833 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun In SPARK-5068, we introduce a SQLConf spark.sql.hive.verifyPartitionPath, if it is set to true, it will avoid the task failed when the patition location does not exists in the filesystem. this situation should always return emtpy and don't lead to the task failed, here we remove this conf. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19864) add makeQualifiedPath in CatalogUtils to optimize some code
Song Jun created SPARK-19864: Summary: add makeQualifiedPath in CatalogUtils to optimize some code Key: SPARK-19864 URL: https://issues.apache.org/jira/browse/SPARK-19864 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor Currently there are lots of places to make the path qualified, it is better to provide a function to do this, then the code will be more simple. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org