[jira] [Commented] (SPARK-37980) Extend METADATA column to support row indices for file based data sources

2022-02-01 Thread Cheng Lian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485215#comment-17485215
 ] 

Cheng Lian commented on SPARK-37980:


[~prakharjain09], as you've mentioned, it's not super straightforward to 
customize the Parquet code paths in Spark to achieve the goal. In the 
meanwhile, this functionality is in general quite useful. I can imagine it 
enabling other systems in the Parquet ecosystem to build more sophisticated 
indexing solutions. Instead of doing heavy customizations in Spark, would it be 
better if we can make the changes happen in upstream {{parquet-mr}} so that 
other systems can benefit from it more easily?

> Extend METADATA column to support row indices for file based data sources
> -
>
> Key: SPARK-37980
> URL: https://issues.apache.org/jira/browse/SPARK-37980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3
>Reporter: Prakhar Jain
>Priority: Major
>
> Spark recently added hidden metadata column support for File based 
> datasources as part of  SPARK-37273.
> We should extend it to support ROW_INDEX/ROW_POSITION also.
>  
> Meaning of  ROW_POSITION:
> ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th 
> row in the file will have ROW_INDEX 5.
>  
> Use cases: 
> Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple 
> uniquely identifies row in a table. This information can be used to mark rows 
> e.g. this can be used by indexer etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31935) Hadoop file system config should be effective in data source options

2020-06-30 Thread Cheng Lian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-31935:
---
Affects Version/s: (was: 3.0.1)
   (was: 3.1.0)
   2.4.6
   3.0.0

> Hadoop file system config should be effective in data source options 
> -
>
> Key: SPARK-31935
> URL: https://issues.apache.org/jira/browse/SPARK-31935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Data source options should be propagated into the hadoop configuration of 
> method `checkAndGlobPathIfNecessary`
> From org.apache.hadoop.fs.FileSystem.java:
> {code:java}
>   public static FileSystem get(URI uri, Configuration conf) throws 
> IOException {
> String scheme = uri.getScheme();
> String authority = uri.getAuthority();
> if (scheme == null && authority == null) { // use default FS
>   return get(conf);
> }
> if (scheme != null && authority == null) { // no authority
>   URI defaultUri = getDefaultUri(conf);
>   if (scheme.equals(defaultUri.getScheme())// if scheme matches 
> default
>   && defaultUri.getAuthority() != null) {  // & default has authority
> return get(defaultUri, conf);  // return default
>   }
> }
> 
> String disableCacheName = String.format("fs.%s.impl.disable.cache", 
> scheme);
> if (conf.getBoolean(disableCacheName, false)) {
>   return createFileSystem(uri, conf);
> }
> return CACHE.get(uri, conf);
>   }
> {code}
> With this, we can specify URI schema and authority related configurations for 
> scanning file systems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26352) Join reordering should not change the order of output attributes

2020-05-29 Thread Cheng Lian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-26352:
---
Summary: Join reordering should not change the order of output attributes  
(was: join reordering should not change the order of output attributes)

> Join reordering should not change the order of output attributes
> 
>
> Key: SPARK-26352
> URL: https://issues.apache.org/jira/browse/SPARK-26352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> The optimizer rule {{org.apache.spark.sql.catalyst.optimizer.ReorderJoin}} 
> performs join reordering on inner joins. This was introduced from SPARK-12032 
> in 2015-12.
> After it had reordered the joins, though, it didn't check whether or not the 
> column order (in terms of the {{output}} attribute list) is still the same as 
> before. Thus, it's possible to have a mismatch between the reordered column 
> order vs the schema that a DataFrame thinks it has.
> This can be demonstrated with the example:
> {code:none}
> spark.sql("create table table_a (x int, y int) using parquet")
> spark.sql("create table table_b (i int, j int) using parquet")
> spark.sql("create table table_c (a int, b int) using parquet")
> val df = spark.sql("with df1 as (select * from table_a cross join table_b) 
> select * from df1 join table_c on a = x and b = i")
> {code}
> here's what the DataFrame thinks:
> {code:none}
> scala> df.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: integer (nullable = true)
>  |-- i: integer (nullable = true)
>  |-- j: integer (nullable = true)
>  |-- a: integer (nullable = true)
>  |-- b: integer (nullable = true)
> {code}
> here's what the optimized plan thinks, after join reordering:
> {code:none}
> scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- 
> ${a.name}: ${a.dataType.typeName}"))
> |-- x: integer
> |-- y: integer
> |-- a: integer
> |-- b: integer
> |-- i: integer
> |-- j: integer
> {code}
> If we exclude the {{ReorderJoin}} rule (using Spark 2.4's optimizer rule 
> exclusion feature), it's back to normal:
> {code:none}
> scala> spark.conf.set("spark.sql.optimizer.excludedRules", 
> "org.apache.spark.sql.catalyst.optimizer.ReorderJoin")
> scala> val df = spark.sql("with df1 as (select * from table_a cross join 
> table_b) select * from df1 join table_c on a = x and b = i")
> df: org.apache.spark.sql.DataFrame = [x: int, y: int ... 4 more fields]
> scala> df.queryExecution.optimizedPlan.output.foreach(a => println(s"|-- 
> ${a.name}: ${a.dataType.typeName}"))
> |-- x: integer
> |-- y: integer
> |-- i: integer
> |-- j: integer
> |-- a: integer
> |-- b: integer
> {code}
> Note that this column ordering problem leads to data corruption, and can 
> manifest itself in various symptoms:
> * Silently corrupting data, if the reordered columns happen to either have 
> matching types or have sufficiently-compatible types (e.g. all fixed length 
> primitive types are considered as "sufficiently compatible" in an UnsafeRow), 
> then only the resulting data is going to be wrong but it might not trigger 
> any alarms immediately. Or
> * Weird Java-level exceptions like {{java.lang.NegativeArraySizeException}}, 
> or even SIGSEGVs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator

2019-10-30 Thread Cheng Lian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-29667:
---
Environment: (was: spark-2.4.3-bin-dbr-5.5-snapshot-9833d0f)

> implicitly convert mismatched datatypes on right side of "IN" operator
> --
>
> Key: SPARK-29667
> URL: https://issues.apache.org/jira/browse/SPARK-29667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jessie Lin
>Priority: Minor
>
> Ran into error on this sql
> Mismatched columns:
> [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
> the sql and clause
>   AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
> Once I cast decimal(18,0) to decimal(28,0) explicitly above, the sql ran just 
> fine. Can the sql engine cast implicitly in this case?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator

2019-10-30 Thread Cheng Lian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963305#comment-16963305
 ] 

Cheng Lian commented on SPARK-29667:


Reproduced this with the following snippet:
{code}
spark.range(10).select($"id" cast DecimalType(18, 
0)).createOrReplaceTempView("t1")
spark.range(10).select($"id" cast DecimalType(28, 
0)).createOrReplaceTempView("t2")
sql("SELECT * FROM t1 WHERE t1.id IN (SELECT id FROM t2)").explain(true)
{code}
Exception:
{noformat}
The data type of one or more elements in the left hand side of an IN subquery
is not compatible with the data type of the output of the subquery
Mismatched columns:
[(t1.`id`:decimal(18,0), t2.`id`:decimal(28,0))]
Left side:
[decimal(18,0)].
Right side:
[decimal(28,0)].; line 1 pos 29;
'Project [*]
+- 'Filter id#16 IN (list#22 [])
   :  +- Project [id#20]
   : +- SubqueryAlias `t2`
   :+- Project [cast(id#18L as decimal(28,0)) AS id#20]
   :   +- Range (0, 10, step=1, splits=Some(8))
   +- SubqueryAlias `t1`
  +- Project [cast(id#14L as decimal(18,0)) AS id#16]
 +- Range (0, 10, step=1, splits=Some(8))
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:123)
...
{noformat}
It seems that Postgres does support this kind of implicit casting:
{noformat}
postgres=# SELECT CAST(1 AS BIGINT) IN (CAST(1 AS INT));

 ?column?
--
 t
(1 row)
{noformat}
I believe the problem in Spark is that 
{{o.a.s.s.c.expressions.In#checkInputDataTypes()}} is too strict.

> implicitly convert mismatched datatypes on right side of "IN" operator
> --
>
> Key: SPARK-29667
> URL: https://issues.apache.org/jira/browse/SPARK-29667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: spark-2.4.3-bin-dbr-5.5-snapshot-9833d0f
>Reporter: Jessie Lin
>Priority: Minor
>
> Ran into error on this sql
> Mismatched columns:
> [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
> the sql and clause
>   AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
> Once I cast decimal(18,0) to decimal(28,0) explicitly above, the sql ran just 
> fine. Can the sql engine cast implicitly in this case?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-10-10 Thread Cheng Lian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-26806:
---
Reporter: Cheng Lian  (was: liancheng)

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Cheng Lian
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.2.4, 2.3.3, 2.4.1, 3.0.0
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-10-10 Thread Cheng Lian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-26806:
---
Description: 
Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
make "avg" become "NaN". And whatever gets merged with the result of 
"zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will 
return "0" and the user will see the following incorrect report:
{code:java}
"eventTime" : {
"avg" : "1970-01-01T00:00:00.000Z",
"max" : "2019-01-31T12:57:00.000Z",
"min" : "2019-01-30T18:44:04.000Z",
"watermark" : "1970-01-01T00:00:00.000Z"
  }
{code}

  was:
Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
make "avg" become "NaN". And whatever gets merged with the result of 
"zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong will 
return "0" and the user will see the following incorrect report:

{code}
"eventTime" : {
"avg" : "1970-01-01T00:00:00.000Z",
"max" : "2019-01-31T12:57:00.000Z",
"min" : "2019-01-30T18:44:04.000Z",
"watermark" : "1970-01-01T00:00:00.000Z"
  }
{code}

This issue was reported by [~liancheng]


> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Cheng Lian
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.2.4, 2.3.3, 2.4.1, 3.0.0
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code:java}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27369) Standalone worker can load resource conf and discover resources

2019-06-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-27369:
--

Assignee: wuyi

> Standalone worker can load resource conf and discover resources
> ---
>
> Key: SPARK-27369
> URL: https://issues.apache.org/jira/browse/SPARK-27369
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: wuyi
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27611) Redundant javax.activation dependencies in the Maven build

2019-05-01 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-27611:
--

Assignee: Cheng Lian

> Redundant javax.activation dependencies in the Maven build
> --
>
> Key: SPARK-27611
> URL: https://issues.apache.org/jira/browse/SPARK-27611
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> [PR #23890|https://github.com/apache/spark/pull/23890] introduced 
> {{org.glassfish.jaxb:jaxb-runtime:2.3.2}} as a runtime dependency. As an 
> unexpected side effect, {{jakarta.activation:jakarta.activation-api:1.2.1}} 
> was also pulled in as a transitive dependency. As a result, for the Maven 
> build, both of the following two jars can be found under 
> {{assembly/target/scala-2.12/jars}}:
> {noformat}
> activation-1.1.1.jar
> jakarta.activation-api-1.2.1.jar
> {noformat}
> Discussed this with [~srowen] offline and we agreed that we should probably 
> exclude the Jakarta one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27611) Redundant javax.activation dependencies in the Maven build

2019-04-30 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-27611:
--

 Summary: Redundant javax.activation dependencies in the Maven build
 Key: SPARK-27611
 URL: https://issues.apache.org/jira/browse/SPARK-27611
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 3.0.0
Reporter: Cheng Lian


[PR #23890|https://github.com/apache/spark/pull/23890] introduced 
{{org.glassfish.jaxb:jaxb-runtime:2.3.2}} as a runtime dependency. As an 
unexpected side effect, {{jakarta.activation:jakarta.activation-api:1.2.1}} was 
also pulled in as a transitive dependency. As a result, for the Maven build, 
both of the following two jars can be found under 
{{assembly/target/scala-2.12/jars}}:
{noformat}
activation-1.1.1.jar
jakarta.activation-api-1.2.1.jar
{noformat}
Discussed this with [~srowen] offline and we agreed that we should probably 
exclude the Jakarta one.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678595#comment-16678595
 ] 

Cheng Lian commented on SPARK-25966:


[~andrioni], just realized that I might misunderstand this part of your 
statement:
{quote}
This job used to work fine with Spark 2.2.1
[...]
{quote}
I thought you could read the same problematic files using Spark 2.2.1. Now I 
guess you probably only meant that the same job worked fine with Spark 2.2.1 
previously (with different sets of historical files).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   

[jira] [Comment Edited] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678542#comment-16678542
 ] 

Cheng Lian edited comment on SPARK-25966 at 11/7/18 5:34 PM:
-

Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
column(s)/row group(s) than Spark 2.2.1 for the same job, and those extra 
column(s)/row group(s) happened to contain some corrupted data, which would 
also indicate an optimizer side issue (predicate push-down and column pruning).


was (Author: lian cheng):
Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
columns/row groups than Spark 2.2.1 for the same job, which would also indicate 
an optimizer side issue (predicate push-down and column pruning).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at 

[jira] [Comment Edited] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678542#comment-16678542
 ] 

Cheng Lian edited comment on SPARK-25966 at 11/7/18 5:34 PM:
-

Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
columns/row groups than Spark 2.2.1 for the same job, and those extra 
columns/row groups happened to contain some corrupted data, which would also 
indicate an optimizer side issue (predicate push-down and column pruning).


was (Author: lian cheng):
Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
column(s)/row group(s) than Spark 2.2.1 for the same job, and those extra 
column(s)/row group(s) happened to contain some corrupted data, which would 
also indicate an optimizer side issue (predicate push-down and column pruning).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> 

[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678542#comment-16678542
 ] 

Cheng Lian commented on SPARK-25966:


Hey, [~andrioni], if you still have the original (potentially) corrupted 
Parquet files at hand, could you please try reading them again with Spark 2.4 
but with {{spark.sql.parquet.enableVectorizedReader}} set to {{false}}? In this 
way, we fall back to the vanilla {{parquet-mr}} 1.10 Parquet reader. If it 
works fine, it might be an issue in the vectorized reader.

Also, any chances that you can share a sample problematic file?

Since the same workload worked fine with Spark 2.2.1, I doubt whether this is 
really a file corruption issue. Unless somehow Spark 2.4 is reading more 
columns/row groups than Spark 2.2.1 for the same job, which would also indicate 
an optimizer side issue (predicate push-down and column pruning).

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
> 

[jira] [Assigned] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-24927:
--

Assignee: Cheng Lian

> The hadoop-provided profile doesn't play well with Snappy-compressed Parquet 
> files
> --
>
> Key: SPARK-24927
> URL: https://issues.apache.org/jira/browse/SPARK-24927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
>
> Reproduction:
> {noformat}
> wget 
> https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
> wget 
> https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xzf spark-2.3.1-bin-without-hadoop.tgz
> tar xzf hadoop-2.7.3.tar.gz
> export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
> ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
> ...
> scala> 
> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
> {noformat}
> Exception:
> {noformat}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>   ... 69 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>   at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>   at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
>   at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
>   at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
>   at 

[jira] [Updated] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-24927:
---
Description: 
Reproduction:
{noformat}
wget 
https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
wget 
https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

tar xzf spark-2.3.1-bin-without-hadoop.tgz
tar xzf hadoop-2.7.3.tar.gz

export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
...
scala> 
spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
{noformat}
Exception:
{noformat}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
  ... 69 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
  at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
  at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
  at 
org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
  at 
org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
  at 
org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
  at 
org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
  at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
  at 
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
  at 
org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
  at 
org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
  at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
  at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
  at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396)
  at 

[jira] [Commented] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16557603#comment-16557603
 ] 

Cheng Lian commented on SPARK-24927:


Downgraded from blocker to major, since it's not a regression. Just realized 
that this issue existed ever since at least 1.6.

> The hadoop-provided profile doesn't play well with Snappy-compressed Parquet 
> files
> --
>
> Key: SPARK-24927
> URL: https://issues.apache.org/jira/browse/SPARK-24927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Cheng Lian
>Priority: Major
>
> Reproduction:
> {noformat}
> wget 
> https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
> wget 
> https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xzf spark-2.3.1-bin-without-hadoop.tgz
> tar xzf hadoop-2.7.3.tar.gz
> export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
> ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
> ...
> scala> 
> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
> {noformat}
> Exception:
> {noformat}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>   ... 69 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>   at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>   at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
>   at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
>   at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
>   at 
> 

[jira] [Updated] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-24927:
---
Priority: Major  (was: Blocker)

> The hadoop-provided profile doesn't play well with Snappy-compressed Parquet 
> files
> --
>
> Key: SPARK-24927
> URL: https://issues.apache.org/jira/browse/SPARK-24927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Cheng Lian
>Priority: Major
>
> Reproduction:
> {noformat}
> wget 
> https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
> wget 
> https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
> tar xzf spark-2.3.1-bin-without-hadoop.tgz
> tar xzf hadoop-2.7.3.tar.gz
> export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
> ./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
> ...
> scala> 
> spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
> {noformat}
> Exception:
> {noformat}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>   ... 69 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>   at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>   at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>   at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
>   at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
>   at 
> org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
>   at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
>   at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
>   at 
> 

[jira] [Created] (SPARK-24927) The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files

2018-07-26 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-24927:
--

 Summary: The hadoop-provided profile doesn't play well with 
Snappy-compressed Parquet files
 Key: SPARK-24927
 URL: https://issues.apache.org/jira/browse/SPARK-24927
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.3.1, 2.3.2
Reporter: Cheng Lian


Reproduction:
{noformat}
wget 
https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
wget 
https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

tar xzf spark-2.3.1-bin-without-hadoop.tgz
tar xzf hadoop-2.7.3.tar.gz

export SPARK_DIST_CLASSPATH=$(hadoop-2.7.3/bin/hadoop classpath)
./spark-2.3.1-bin-without-hadoop/bin/spark-shell --master local
...
scala> 
spark.range(1).repartition(1).write.mode("overwrite").parquet("file:///tmp/test.parquet")
{noformat}
Exception:
{noformat}
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
  ... 69 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:109)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
  at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
  at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
  at 
org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
  at 
org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
  at 
org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
  at 
org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
  at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
  at 
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
  at 
org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
  at 
org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
  at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
  at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
  at 
org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
  at 

[jira] [Assigned] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-24 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-24895:
--

Assignee: Eric Chang

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-24 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-24895:
---
Description: 
Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
repo has mismatched filenames:
{noformat}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
(enforce-banned-dependencies) on project spark_2.4: Execution 
enforce-banned-dependencies of goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: Could 
not resolve following dependencies: 
[org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
resolve dependencies for project com.databricks:spark_2.4:pom:1: The following 
artifacts could not be resolved: 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
{noformat}
 

If you check the artifact metadata you will see the pom and jar files are 
2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
{code:xml}

  org.apache.spark
  spark-mllib-local_2.11
  2.4.0-SNAPSHOT
  

  20180723.232411
  177

20180723232411

  
jar
2.4.0-20180723.232411-177
20180723232411
  
  
pom
2.4.0-20180723.232411-177
20180723232411
  
  
tests
jar
2.4.0-20180723.232410-177
20180723232411
  
  
sources
jar
2.4.0-20180723.232410-177
20180723232411
  
  
test-sources
jar
2.4.0-20180723.232410-177
20180723232411
  

  

{code}
 
This behavior is very similar to this issue: 
https://issues.apache.org/jira/browse/MDEPLOY-221

Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
2.8.2 plugin, it is highly possible that we introduced a new plugin that causes 
this. 

The most recent addition is the spot-bugs plugin, which is known to have 
incompatibilities with other plugins: 
[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]

We may want to try building without it to sanity check.

  was:
Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
repo has mismatched filenames:

{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
(enforce-banned-dependencies) on project spark_2.4: Execution 
enforce-banned-dependencies of goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: Could 
not resolve following dependencies: 
[org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
resolve dependencies for project com.databricks:spark_2.4:pom:1: The following 
artifacts could not be resolved: 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
{code}
 

If you check the artifact metadata you will see the pom and jar files are 
2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
{code:xml}

org.apache.spark
spark-mllib-local_2.11
2.4.0-SNAPSHOT


20180723.232411
177

20180723232411


jar
2.4.0-20180723.232411-177
20180723232411


pom
2.4.0-20180723.232411-177
20180723232411


tests
jar
2.4.0-20180723.232410-177
20180723232411


sources
jar
2.4.0-20180723.232410-177
20180723232411


test-sources
jar
2.4.0-20180723.232410-177
20180723232411




{code}
 
 This behavior is very similar to this issue: 
https://issues.apache.org/jira/browse/MDEPLOY-221

Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
2.8.2 plugin, it is highly possible that we introduced a new plugin that causes 
this. 

The most recent addition is the spot-bugs plugin, which is known to have 
incompatibilities with other plugins: 
[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]

We may want to try building without it to sanity check.


> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> 

[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2018-02-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372289#comment-16372289
 ] 

Cheng Lian commented on SPARK-19737:


[~LANDAIS Christophe], I filed SPARK-23486 for this. Should be relatively 
straightforward to fix and I'd like to have a new contributor to try it as a 
starter task. Thanks for reporting!

> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-02-21 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-23486:
---
Labels: starter  (was: )

> LookupFunctions should not check the same function name more than once
> --
>
> Key: SPARK-23486
> URL: https://issues.apache.org/jira/browse/SPARK-23486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Cheng Lian
>Priority: Major
>  Labels: starter
>
> For a query invoking the same function multiple times, the current 
> {{LookupFunctions}} rule performs a check for each invocation. For users 
> using Hive metastore as external catalog, this issues unnecessary metastore 
> accesses and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-02-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372285#comment-16372285
 ] 

Cheng Lian commented on SPARK-23486:


Please refer to [this 
comment|https://issues.apache.org/jira/browse/SPARK-19737?focusedCommentId=16371377=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16371377]
 for more details.

> LookupFunctions should not check the same function name more than once
> --
>
> Key: SPARK-23486
> URL: https://issues.apache.org/jira/browse/SPARK-23486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Cheng Lian
>Priority: Major
>
> For a query invoking the same function multiple times, the current 
> {{LookupFunctions}} rule performs a check for each invocation. For users 
> using Hive metastore as external catalog, this issues unnecessary metastore 
> accesses and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-02-21 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-23486:
--

 Summary: LookupFunctions should not check the same function name 
more than once
 Key: SPARK-23486
 URL: https://issues.apache.org/jira/browse/SPARK-23486
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1, 2.3.0
Reporter: Cheng Lian


For a query invoking the same function multiple times, the current 
{{LookupFunctions}} rule performs a check for each invocation. For users using 
Hive metastore as external catalog, this issues unnecessary metastore accesses 
and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-22951.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20174
[https://github.com/apache/spark/pull/20174]

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>Assignee: Feng Liu
>  Labels: correctness
> Fix For: 2.3.0
>
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-22951:
--

Assignee: Feng Liu

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>Assignee: Feng Liu
>  Labels: correctness
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-22951:
---
Target Version/s: 2.3.0

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>  Labels: correctness
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-22951:
---
Labels: correctness  (was: )

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>  Labels: correctness
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata

2017-07-17 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-9686:
-

Assignee: (was: Cheng Lian)

> Spark Thrift server doesn't return correct JDBC metadata 
> -
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2
>Reporter: pin_zhang
>Priority: Critical
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-08 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043478#comment-16043478
 ] 

Cheng Lian commented on SPARK-20958:


[~marmbrus], here is the draft release note entry:
{quote}
SPARK-20958: For users who use parquet-avro together with Spark 2.2, please use 
parquet-avro 1.8.1 instead of parquet-avro 1.8.2. This is because parquet-avro 
1.8.2 upgrades avro from 1.7.6 to 1.8.1, which is backward incompatible with 
1.7.6.
{quote}

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: release-notes, release_notes, releasenotes
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-08 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-20958:
---
Labels: release-notes release_notes releasenotes  (was: release-notes)

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: release-notes, release_notes, releasenotes
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035149#comment-16035149
 ] 

Cheng Lian commented on SPARK-20958:


Thanks [~rdblue]! I'm also reluctant to roll it back considering those fixes we 
wanted so badly... We decided to give this a try because, from the perspective 
of release management, we'd like to avoid cutting a release with known 
conflicting dependencies, even transitive ones. For a Spark 2.2 user, it's 
quite natural to choose parquet-avro 1.8.2, which is part of parquet-mr 1.8.2, 
which in turn, is a direct dependency of Spark 2.2.0.

However, due to PARQUET-389, rolling back is already not an option. Two options 
I can see here are:

# Release Spark 2.2.0 as is with a statement in the release notes saying that 
users should use parquet-avro 1.8.1 instead of 1.8.2 to avoid the Avro 
compatibility issue.
# Wait for parquet-mr 1.8.3, which hopefully resolves this dependency issue 
(e.g., by reverting PARQUET-358).

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034310#comment-16034310
 ] 

Cheng Lian commented on SPARK-20958:


[~rdblue] I think the root cause here is we cherry-picked parquet-mr [PR 
#318|https://github.com/apache/parquet-mr/pull/318] to parquet-mr 1.8.2, and 
introduced this avro upgrade.

Tried to roll back parquet-mr back to 1.8.1 but it doesn't work well because 
this brings back 
[PARQUET-389|https://issues.apache.org/jira/browse/PARQUET-389] and breaks some 
test cases involving schema evolution. 

It would be nice if we can have a parquet-mr 1.8.3 or 1.8.2.1 release that has 
[PR #318|https://github.com/apache/parquet-mr/pull/318] reverted from 1.8.2? I 
think cherry-picking that PR is also problematic for parquet-mr because it 
introduces a backward-incompatible dependency change in a maintenance release.

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-02 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-20958:
---
Description: 
We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 and 
avro 1.7.7 used by spark-core 2.2.0-rc2.

Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 
and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons 
mentioned in [PR 
#17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
Therefore, we don't really have many choices here and have to roll back 
parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.

  was:
We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 and 
avro 1.7.7 used by spark-core 2.2.0-rc2.

, Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 and 
1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons 
mentioned in [PR 
#17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
Therefore, we don't really have many choices here and have to roll back 
parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.


> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-01 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-20958:
--

 Summary: Roll back parquet-mr 1.8.2 to parquet-1.8.1
 Key: SPARK-20958
 URL: https://issues.apache.org/jira/browse/SPARK-20958
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Cheng Lian
Assignee: Cheng Lian


We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 and 
avro 1.7.7 used by spark-core 2.2.0-rc2.

, Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 and 
1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons 
mentioned in [PR 
#17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
Therefore, we don't really have many choices here and have to roll back 
parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20132) Add documentation for column string functions

2017-05-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-20132:
---
Fix Version/s: 2.2.0

> Add documentation for column string functions
> -
>
> Key: SPARK-20132
> URL: https://issues.apache.org/jira/browse/SPARK-20132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Assignee: Michael Patterson
>Priority: Minor
>  Labels: documentation, newbie
> Fix For: 2.2.0, 2.3.0
>
>
> Four Column string functions do not have documentation for PySpark:
> rlike
> like
> startswith
> endswith
> These functions are called through the _bin_op interface, which allows the 
> passing of a docstring. I have added docstrings with examples to each of the 
> four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-20246:
---
Labels: correctness  (was: )

> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>  Labels: correctness
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-06 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959946#comment-15959946
 ] 

Cheng Lian commented on SPARK-20246:


[This 
line|https://github.com/apache/spark/blob/a4491626ed8169f0162a0dfb78736c9b9e7fb434/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L795]
 should be the root cause. We didn't check determinism of the predicates before 
pushing them down.

The same thing also applies when pushing predicates through union and window 
operators.

cc [~cloud_fan]

> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-04-04 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19716:
---
Fix Version/s: (was: 2.3.0)
   2.2.0

> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow 
> compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we 
> will add cast for each field, except struct type field, because struct type 
> is flexible, the number of columns can mismatch. We should probably also skip 
> cast for array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-04-04 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-19716.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17398
[https://github.com/apache/spark/pull/17398]

> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow 
> compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we 
> will add cast for each field, except struct type field, because struct type 
> is flexible, the number of columns can mismatch. We should probably also skip 
> cast for array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-04-04 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-19716:
--

Assignee: Wenchen Fan

> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow 
> compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we 
> will add cast for each field, except struct type field, because struct type 
> is flexible, the number of columns can mismatch. We should probably also skip 
> cast for array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19912) String literals are not escaped while performing Hive metastore level partition pruning

2017-03-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19912:
---
Summary: String literals are not escaped while performing Hive metastore 
level partition pruning  (was: String literals are not escaped while performing 
partition pruning at Hive metastore level)

> String literals are not escaped while performing Hive metastore level 
> partition pruning
> ---
>
> Key: SPARK-19912
> URL: https://issues.apache.org/jira/browse/SPARK-19912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Cheng Lian
>  Labels: correctness
>
> {{Shim_v0_13.convertFilters()}} doesn't escape string literals while 
> generating Hive style partition predicates.
> The following SQL-injection-like test case illustrates this issue:
> {code}
>   test("SPARK-19912") {
> withTable("spark_19912") {
>   Seq(
> (1, "p1", "q1"),
> (2, "p1\" and q=\"q1", "q2")
>   ).toDF("a", "p", "q").write.partitionBy("p", 
> "q").saveAsTable("spark_19912")
>   checkAnswer(
> spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
> Row(2)
>   )
> }
>   }
> {code}
> The above test case fails like this:
> {noformat}
> [info] - spark_19912 *** FAILED *** (13 seconds, 74 milliseconds)
> [info]   Results do not match for query:
> [info]   Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> [info]   Timezone Env:
> [info]
> [info]   == Parsed Logical Plan ==
> [info]   'Project [unresolvedalias('a, None)]
> [info]   +- Filter (p#27 = p1" and q = "q1)
> [info]  +- SubqueryAlias spark_19912
> [info] +- Relation[a#26,p#27,q#28] parquet
> [info]
> [info]   == Analyzed Logical Plan ==
> [info]   a: int
> [info]   Project [a#26]
> [info]   +- Filter (p#27 = p1" and q = "q1)
> [info]  +- SubqueryAlias spark_19912
> [info] +- Relation[a#26,p#27,q#28] parquet
> [info]
> [info]   == Optimized Logical Plan ==
> [info]   Project [a#26]
> [info]   +- Filter (isnotnull(p#27) && (p#27 = p1" and q = "q1))
> [info]  +- Relation[a#26,p#27,q#28] parquet
> [info]
> [info]   == Physical Plan ==
> [info]   *Project [a#26]
> [info]   +- *FileScan parquet default.spark_19912[a#26,p#27,q#28] Batched: 
> true, Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 
> 0, PartitionFilters: [isnotnull(p#27), (p#27 = p1" and q = "q1)], 
> PushedFilters: [], ReadSchema: struct
> [info]   == Results ==
> [info]
> [info]   == Results ==
> [info]   !== Correct Answer - 1 ==   == Spark Answer - 0 ==
> [info]struct<>   struct<>
> [info]   ![2]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19912) String literals are not escaped while performing partition pruning at Hive metastore level

2017-03-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19912:
---
Description: 
{{Shim_v0_13.convertFilters()}} doesn't escape string literals while generating 
Hive style partition predicates.

The following SQL-injection-like test case illustrates this issue:
{code}
  test("SPARK-19912") {
withTable("spark_19912") {
  Seq(
(1, "p1", "q1"),
(2, "p1\" and q=\"q1", "q2")
  ).toDF("a", "p", "q").write.partitionBy("p", 
"q").saveAsTable("spark_19912")

  checkAnswer(
spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
Row(2)
  )
}
  }
{code}
The above test case fails like this:
{noformat}
[info] - spark_19912 *** FAILED *** (13 seconds, 74 milliseconds)
[info]   Results do not match for query:
[info]   Timezone: 
sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
[info]   Timezone Env:
[info]
[info]   == Parsed Logical Plan ==
[info]   'Project [unresolvedalias('a, None)]
[info]   +- Filter (p#27 = p1" and q = "q1)
[info]  +- SubqueryAlias spark_19912
[info] +- Relation[a#26,p#27,q#28] parquet
[info]
[info]   == Analyzed Logical Plan ==
[info]   a: int
[info]   Project [a#26]
[info]   +- Filter (p#27 = p1" and q = "q1)
[info]  +- SubqueryAlias spark_19912
[info] +- Relation[a#26,p#27,q#28] parquet
[info]
[info]   == Optimized Logical Plan ==
[info]   Project [a#26]
[info]   +- Filter (isnotnull(p#27) && (p#27 = p1" and q = "q1))
[info]  +- Relation[a#26,p#27,q#28] parquet
[info]
[info]   == Physical Plan ==
[info]   *Project [a#26]
[info]   +- *FileScan parquet default.spark_19912[a#26,p#27,q#28] Batched: 
true, Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, 
PartitionFilters: [isnotnull(p#27), (p#27 = p1" and q = "q1)], PushedFilters: 
[], ReadSchema: struct
[info]   == Results ==
[info]
[info]   == Results ==
[info]   !== Correct Answer - 1 ==   == Spark Answer - 0 ==
[info]struct<>   struct<>
[info]   ![2]
{noformat}

  was:
{{Shim_v0_13.convertFilters()}} doesn't escape string literals while generating 
Hive style partition predicates.

The following SQL-injection-like test case illustrates this issue:
{code}
  test("foo") {
withTable("foo") {
  Seq(
(1, "p1", "q1"),
(2, "p1\" and q=\"q1", "q2")
  ).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("foo")

  checkAnswer(
spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
Row(2)
  )
}
  }
{code}


> String literals are not escaped while performing partition pruning at Hive 
> metastore level
> --
>
> Key: SPARK-19912
> URL: https://issues.apache.org/jira/browse/SPARK-19912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Cheng Lian
>  Labels: correctness
>
> {{Shim_v0_13.convertFilters()}} doesn't escape string literals while 
> generating Hive style partition predicates.
> The following SQL-injection-like test case illustrates this issue:
> {code}
>   test("SPARK-19912") {
> withTable("spark_19912") {
>   Seq(
> (1, "p1", "q1"),
> (2, "p1\" and q=\"q1", "q2")
>   ).toDF("a", "p", "q").write.partitionBy("p", 
> "q").saveAsTable("spark_19912")
>   checkAnswer(
> spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
> Row(2)
>   )
> }
>   }
> {code}
> The above test case fails like this:
> {noformat}
> [info] - spark_19912 *** FAILED *** (13 seconds, 74 milliseconds)
> [info]   Results do not match for query:
> [info]   Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> [info]   Timezone Env:
> [info]
> [info]   == Parsed Logical Plan ==
> [info]   'Project [unresolvedalias('a, None)]
> [info]   +- Filter (p#27 = p1" and q = "q1)
> [info]  +- SubqueryAlias spark_19912
> [info] +- Relation[a#26,p#27,q#28] parquet
> [info]
> [info]   == Analyzed Logical Plan ==
> [info]   a: int
> 

[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Labels: correctness  (was: )

> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Cheng Lian
>  Labels: correctness
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned tables use the magic string 
> {{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19912) String literals are not escaped while performing partition pruning at Hive metastore level

2017-03-10 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-19912:
--

 Summary: String literals are not escaped while performing 
partition pruning at Hive metastore level
 Key: SPARK-19912
 URL: https://issues.apache.org/jira/browse/SPARK-19912
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.2.0
Reporter: Cheng Lian


{{Shim_v0_13.convertFilters()}} doesn't escape string literals while generating 
Hive style partition predicates.

The following SQL-injection-like test case illustrates this issue:
{code}
  test("foo") {
withTable("foo") {
  Seq(
(1, "p1", "q1"),
(2, "p1\" and q=\"q1", "q2")
  ).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("foo")

  checkAnswer(
spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
Row(2)
  )
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Affects Version/s: 2.2.0

> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned tables use the magic string 
> {{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19905) Dataset.inputFiles is broken for Hive SerDe tables

2017-03-10 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-19905:
--

 Summary: Dataset.inputFiles is broken for Hive SerDe tables
 Key: SPARK-19905
 URL: https://issues.apache.org/jira/browse/SPARK-19905
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Cheng Lian
Assignee: Cheng Lian


The following snippet reproduces this issue:
{code}
spark.range(10).createOrReplaceTempView("t")
spark.sql("CREATE TABLE u STORED AS RCFILE AS SELECT * FROM t")
spark.table("u").inputFiles.foreach(println)
{code}
In Spark 2.2, it prints nothing, while in Spark 2.1, it prints something like
{noformat}
file:/Users/lian/local/var/lib/hive/warehouse_1.2.1/u
{noformat}
on my laptop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Description: 
The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned tables use the magic string 
{{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in 
partition directory names. However, in the case persisted partitioned table, 
this magic string is not interpreted as {{NULL}} but a regular string.

  was:
The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned tables use the magic string 
{{__HIVE_DEFAULT_PARTITION__}} to indicate {{NULL}} partition values in 
partition directory names. However, in the case persisted partitioned table, 
this magic string is not interpreted as {{NULL}} but a regular string.


> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned tables use the magic string 
> {{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, 

[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Description: 
The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned tables use the magic string 
{{__HIVE_DEFAULT_PARTITION__}} to indicate {{NULL}} partition values in 
partition directory names. However, in the case persisted partitioned table, 
this magic string is not interpreted as {{NULL}} but a regular string.

  was:
The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned table uses magic string {{"__HIVE_DEFAULT_PARTITION__"}} 
to indicate {{NULL}} partition values in partition directory names. However, in 
the case persisted partitioned table, this magic string is not interpreted as 
{{NULL}} but a regular string.



> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned tables use the magic string 
> {{__HIVE_DEFAULT_PARTITION__}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-09 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Summary: __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition 
value in partitioned persisted tables  (was: __HIVE_DEFAULT_PARTITION__ not 
interpreted as NULL partition value in partitioned persisted tables)

> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned table uses magic string 
> {{"__HIVE_DEFAULT_PARTITION__"}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ not interpreted as NULL partition value in partitioned persisted tables

2017-03-09 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-19887:
--

 Summary: __HIVE_DEFAULT_PARTITION__ not interpreted as NULL 
partition value in partitioned persisted tables
 Key: SPARK-19887
 URL: https://issues.apache.org/jira/browse/SPARK-19887
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Cheng Lian


The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned table uses magic string {{"__HIVE_DEFAULT_PARTITION__"}} 
to indicate {{NULL}} partition values in partition directory names. However, in 
the case persisted partitioned table, this magic string is not interpreted as 
{{NULL}} but a regular string.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ not interpreted as NULL partition value in partitioned persisted tables

2017-03-09 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903749#comment-15903749
 ] 

Cheng Lian commented on SPARK-19887:


cc [~cloud_fan]

> __HIVE_DEFAULT_PARTITION__ not interpreted as NULL partition value in 
> partitioned persisted tables
> --
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned table uses magic string 
> {{"__HIVE_DEFAULT_PARTITION__"}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-03-06 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-19737.

Resolution: Fixed

Issue resolved by pull request 17168
[https://github.com/apache/spark/pull/17168]

> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-03-06 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-19737:
--

Assignee: Cheng Lian

> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-03-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19737:
---
Description: 
Let's consider the following simple SQL query that reference an undefined 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.

  was:
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.


> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, then it may take the analyzer a long time 
> before realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, 

[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-03-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19737:
---
Description: 
Let's consider the following simple SQL query that reference an undefined 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.

  was:
Let's consider the following simple SQL query that reference an undefined 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.


> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-02-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19737:
---
Description: 
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.

  was:
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocation
# Look up the function name from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.


> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an invalid 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, then it may take the analyzer a long time 
> before realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-02-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19737:
---
Description: 
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocation
# Look up the function name from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.

  was:
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocation
# Look up the function name from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't try to actually resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly.


> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an invalid 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, then it may take the analyzer a long time 
> before realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocation
> # Look up the function name from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Created] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-02-24 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-19737:
--

 Summary: New analysis rule for reporting unregistered functions 
without relying on relation resolution
 Key: SPARK-19737
 URL: https://issues.apache.org/jira/browse/SPARK-19737
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Cheng Lian
 Fix For: 2.2.0


Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocation
# Look up the function name from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't try to actually resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19529) TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()

2017-02-13 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19529:
---
Target Version/s: 1.6.3, 2.0.3, 2.1.1, 2.2.0  (was: 2.0.3, 2.1.1, 2.2.0)

> TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
> ---
>
> Key: SPARK-19529
> URL: https://issues.apache.org/jira/browse/SPARK-19529
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In Spark's Netty RPC layer, TransportClientFactory.createClient() calls 
> awaitUninterruptibly() on a Netty future while waiting for a connection to be 
> established. This creates problem when a Spark task is interrupted while 
> blocking in this call (which can happen in the event of a slow connection 
> which will eventually time out). This has bad impacts on task cancellation 
> when interruptOnCancel = true.
> As an example of the impact of this problem, I experienced significant 
> numbers of uncancellable "zombie tasks" on a production cluster where several 
> tasks were blocked trying to connect to a dead shuffle server and then 
> continued running as zombies after I cancelled the associated Spark stage. 
> The zombie tasks ran for several minutes with the following stack:
> {code}
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:460)
> io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607) 
> io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>  => holding Monitor(java.lang.Object@1849476028}) 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>  
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:
> 350) 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:120)
>  
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
>  
> org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
>  
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
> [...]
> {code}
> I believe that we can easily fix this by using the 
> InterruptedException-throwing await() instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19529) TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()

2017-02-13 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19529:
---
Target Version/s: 2.0.3, 2.1.1, 2.2.0  (was: 2.0.3, 2.1.1)

> TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
> ---
>
> Key: SPARK-19529
> URL: https://issues.apache.org/jira/browse/SPARK-19529
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In Spark's Netty RPC layer, TransportClientFactory.createClient() calls 
> awaitUninterruptibly() on a Netty future while waiting for a connection to be 
> established. This creates problem when a Spark task is interrupted while 
> blocking in this call (which can happen in the event of a slow connection 
> which will eventually time out). This has bad impacts on task cancellation 
> when interruptOnCancel = true.
> As an example of the impact of this problem, I experienced significant 
> numbers of uncancellable "zombie tasks" on a production cluster where several 
> tasks were blocked trying to connect to a dead shuffle server and then 
> continued running as zombies after I cancelled the associated Spark stage. 
> The zombie tasks ran for several minutes with the following stack:
> {code}
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:460)
> io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607) 
> io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>  => holding Monitor(java.lang.Object@1849476028}) 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>  
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:
> 350) 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:120)
>  
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
>  
> org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
>  
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
> [...]
> {code}
> I believe that we can easily fix this by using the 
> InterruptedException-throwing await() instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18717) Datasets - crash (compile exception) when mapping to immutable scala map

2017-02-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18717:
---
Fix Version/s: 2.1.1

> Datasets - crash (compile exception) when mapping to immutable scala map
> 
>
> Key: SPARK-18717
> URL: https://issues.apache.org/jira/browse/SPARK-18717
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Damian Momot
>Assignee: Andrew Ray
> Fix For: 2.1.1, 2.2.0
>
>
> {code}
> val spark: SparkSession = ???
> case class Test(id: String, map_test: Map[Long, String])
> spark.sql("CREATE TABLE xyz.map_test (id string, map_test map) 
> STORED AS PARQUET")
> spark.sql("SELECT * FROM xyz.map_test").as[Test].map(t => t).collect()
> {code}
> {code}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 307, Column 108: No applicable constructor/method found for actual parameters 
> "java.lang.String, scala.collection.Map"; candidates are: 
> "$line14.$read$$iw$$iw$Test(java.lang.String, scala.collection.immutable.Map)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18717) Datasets - crash (compile exception) when mapping to immutable scala map

2017-02-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18717:
---
Affects Version/s: 2.1.0

> Datasets - crash (compile exception) when mapping to immutable scala map
> 
>
> Key: SPARK-18717
> URL: https://issues.apache.org/jira/browse/SPARK-18717
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Damian Momot
>Assignee: Andrew Ray
> Fix For: 2.1.1, 2.2.0
>
>
> {code}
> val spark: SparkSession = ???
> case class Test(id: String, map_test: Map[Long, String])
> spark.sql("CREATE TABLE xyz.map_test (id string, map_test map) 
> STORED AS PARQUET")
> spark.sql("SELECT * FROM xyz.map_test").as[Test].map(t => t).collect()
> {code}
> {code}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 307, Column 108: No applicable constructor/method found for actual parameters 
> "java.lang.String, scala.collection.Map"; candidates are: 
> "$line14.$read$$iw$$iw$Test(java.lang.String, scala.collection.immutable.Map)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17714) ClassCircularityError is thrown when using org.apache.spark.util.Utils.classForName 

2017-02-07 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856638#comment-15856638
 ] 

Cheng Lian commented on SPARK-17714:


Although I've no idea why this error occurs, it seems that setting system 
property {{io.netty.noJavassist}} to {{true}} can workaround this issue by 
disabling Javassist usage in Netty.

> ClassCircularityError is thrown when using 
> org.apache.spark.util.Utils.classForName 
> 
>
> Key: SPARK-17714
> URL: https://issues.apache.org/jira/browse/SPARK-17714
> Project: Spark
>  Issue Type: Bug
>Reporter: Weiqing Yang
>
> This jira is a follow up to [SPARK-15857| 
> https://issues.apache.org/jira/browse/SPARK-15857] .
> Task invokes CallerContext. SetCurrentContext() to set its callerContext to 
> HDFS. In SetCurrentContext(), it tries looking for class 
> {{org.apache.hadoop.ipc.CallerContext}} by using 
> {{org.apache.spark.util.Utils.classForName}}. This causes 
> ClassCircularityError to be thrown when running ReplSuite in master Maven 
> builds (The same tests pass in the SBT build). A hotfix 
> [SPARK-17710|https://issues.apache.org/jira/browse/SPARK-17710] has been made 
> by using Class.forName instead, but it needs further investigation.
> Error:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.3/2000/testReport/junit/org.apache.spark.repl/ReplSuite/simple_foreach_with_accumulator/
> {code}
> scala> accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, 
> name: None, value: 0)
> scala> org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, localhost): java.lang.ClassCircularityError: 
> io/netty/util/internal/_matchers_/org/apache/spark/network/protocol/MessageMatcher
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:62)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:54)
> at 
> io.netty.util.internal.TypeParameterMatcher.get(TypeParameterMatcher.java:42)
> at 
> io.netty.util.internal.TypeParameterMatcher.find(TypeParameterMatcher.java:78)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.(MessageToMessageEncoder.java:59)
> at 
> org.apache.spark.network.protocol.MessageEncoder.(MessageEncoder.java:34)
> at org.apache.spark.network.TransportContext.(TransportContext.java:78)
> at 
> org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:354)
> at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:324)
> at 
> org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:162)
> at 
> org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:80)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:62)
> at 
> io.netty.util.internal.JavassistTypeParameterMatcherGenerator.generate(JavassistTypeParameterMatcherGenerator.java:54)
> at 
> io.netty.util.internal.TypeParameterMatcher.get(TypeParameterMatcher.java:42)
> at 
> io.netty.util.internal.TypeParameterMatcher.find(TypeParameterMatcher.java:78)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.(MessageToMessageEncoder.java:59)
> at 
> org.apache.spark.network.protocol.MessageEncoder.(MessageEncoder.java:34)
> at org.apache.spark.network.TransportContext.(TransportContext.java:78)
> at 
> org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:354)
> at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:324)
> at 
> org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
> at 
> 

[jira] [Resolved] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2017-02-03 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-18539.

  Resolution: Fixed
Assignee: Dongjoon Hyun
Target Version/s: 2.2.0

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Assignee: Dongjoon Hyun
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2017-02-03 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851965#comment-15851965
 ] 

Cheng Lian commented on SPARK-18539:


SPARK-19409 upgrades parquet-mr to 1.8.2 and fixed this issue.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at 

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2017-01-26 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840186#comment-15840186
 ] 

Cheng Lian commented on SPARK-18539:


[~viirya], sorry for the (super) late reply. What I mentioned was a *nullable* 
column instead of a *null* column. To be more specific, say we have two Parquet 
files:

- File {{A}} has columns {{}}
- File {{B}} has columns {{}}, where {{c}} is marked as nullable (or 
{{optional}} in the term of Parquet)

Then it should be fine to treat these two files as a single dataset with a 
merged schema {{}} and you should be able to push down predicates 
involving {{c}}.

BTW, the Parquet community just made a patch release 1.8.2 that includes a fix 
for PARQUET-389 and we probably will upgrade to 1.8.2 in 2.2.0. Then we'll have 
a proper fix for this issue and remove the workaround we did while doing schema 
merging.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> 

[jira] [Resolved] (SPARK-19016) Document scalable partition handling feature in the programming guide

2016-12-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-19016.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16424
[https://github.com/apache/spark/pull/16424]

> Document scalable partition handling feature in the programming guide
> -
>
> Key: SPARK-19016
> URL: https://issues.apache.org/jira/browse/SPARK-19016
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> Currently, we only mention this in the migration guide. Should also document 
> it in the programming guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19016) Document scalable partition handling feature in the programming guide

2016-12-28 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-19016:
--

 Summary: Document scalable partition handling feature in the 
programming guide
 Key: SPARK-19016
 URL: https://issues.apache.org/jira/browse/SPARK-19016
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.1.0, 2.2.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


Currently, we only mention this in the migration guide. Should also document it 
in the programming guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18956) Python API should reuse existing SparkSession while creating new SQLContext instances

2016-12-20 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-18956:
--

 Summary: Python API should reuse existing SparkSession while 
creating new SQLContext instances
 Key: SPARK-18956
 URL: https://issues.apache.org/jira/browse/SPARK-18956
 Project: Spark
  Issue Type: Bug
Reporter: Cheng Lian


We did this for the Scala API for Spark 2.0 but didn't update the Python API 
respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18950) Report conflicting fields when merging two StructTypes.

2016-12-20 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18950:
---
Labels: starter  (was: )

> Report conflicting fields when merging two StructTypes.
> ---
>
> Key: SPARK-18950
> URL: https://issues.apache.org/jira/browse/SPARK-18950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Priority: Minor
>  Labels: starter
>
> Currently, {{StructType.merge()}} only reports data types of conflicting 
> fields when merging two incompatible schemas. It would be nice to also report 
> the field names for easier debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18950) Report conflicting fields when merging two StructTypes.

2016-12-20 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-18950:
--

 Summary: Report conflicting fields when merging two StructTypes.
 Key: SPARK-18950
 URL: https://issues.apache.org/jira/browse/SPARK-18950
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian
Priority: Minor


Currently, {{StructType.merge()}} only reports data types of conflicting fields 
when merging two incompatible schemas. It would be nice to also report the 
field names for easier debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18753) Inconsistent behavior after writing to parquet files

2016-12-14 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18753:
---
Fix Version/s: 2.2.0

> Inconsistent behavior after writing to parquet files
> 
>
> Key: SPARK-18753
> URL: https://issues.apache.org/jira/browse/SPARK-18753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
> Fix For: 2.1.0, 2.2.0
>
>
> Found an inconsistent behavior when using parquet.
> {code}
> scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
> java.lang.Boolean, new java.lang.Boolean(false)).toDS
> ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]
> scala> ds.filter('value === "true").show
> +-+
> |value|
> +-+
> +-+
> {code}
> In the above example, `ds.filter('value === "true")` returns nothing as 
> "true" will be converted to null and the filter expression will be always 
> null, so it drops all rows.
> However, if I store `ds` to a parquet file and read it back, `filter('value 
> === "true")` will return non null values.
> {code}
> scala> ds.write.parquet("testfile")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> scala> val ds2 = spark.read.parquet("testfile")
> ds2: org.apache.spark.sql.DataFrame = [value: boolean]
> scala> ds2.filter('value === "true").show
> +-+
> |value|
> +-+
> | true|
> |false|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18753) Inconsistent behavior after writing to parquet files

2016-12-14 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18753:
---
Assignee: Hyukjin Kwon

> Inconsistent behavior after writing to parquet files
> 
>
> Key: SPARK-18753
> URL: https://issues.apache.org/jira/browse/SPARK-18753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0, 2.2.0
>
>
> Found an inconsistent behavior when using parquet.
> {code}
> scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
> java.lang.Boolean, new java.lang.Boolean(false)).toDS
> ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]
> scala> ds.filter('value === "true").show
> +-+
> |value|
> +-+
> +-+
> {code}
> In the above example, `ds.filter('value === "true")` returns nothing as 
> "true" will be converted to null and the filter expression will be always 
> null, so it drops all rows.
> However, if I store `ds` to a parquet file and read it back, `filter('value 
> === "true")` will return non null values.
> {code}
> scala> ds.write.parquet("testfile")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> scala> val ds2 = spark.read.parquet("testfile")
> ds2: org.apache.spark.sql.DataFrame = [value: boolean]
> scala> ds2.filter('value === "true").show
> +-+
> |value|
> +-+
> | true|
> |false|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18753) Inconsistent behavior after writing to parquet files

2016-12-14 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-18753.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16184
[https://github.com/apache/spark/pull/16184]

> Inconsistent behavior after writing to parquet files
> 
>
> Key: SPARK-18753
> URL: https://issues.apache.org/jira/browse/SPARK-18753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
> Fix For: 2.1.0
>
>
> Found an inconsistent behavior when using parquet.
> {code}
> scala> val ds = Seq[java.lang.Boolean](new java.lang.Boolean(true), null: 
> java.lang.Boolean, new java.lang.Boolean(false)).toDS
> ds: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]
> scala> ds.filter('value === "true").show
> +-+
> |value|
> +-+
> +-+
> {code}
> In the above example, `ds.filter('value === "true")` returns nothing as 
> "true" will be converted to null and the filter expression will be always 
> null, so it drops all rows.
> However, if I store `ds` to a parquet file and read it back, `filter('value 
> === "true")` will return non null values.
> {code}
> scala> ds.write.parquet("testfile")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> scala> val ds2 = spark.read.parquet("testfile")
> ds2: org.apache.spark.sql.DataFrame = [value: boolean]
> scala> ds2.filter('value === "true").show
> +-+
> |value|
> +-+
> | true|
> |false|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18712) keep the order of sql expression and support short circuit

2016-12-05 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724381#comment-15724381
 ] 

Cheng Lian edited comment on SPARK-18712 at 12/6/16 5:10 AM:
-

I think the contract here is that for a DataFrame {{df}} and 1 or more 
consecutive filter predicates applied to {{df}}, each filter predicate must be 
a full function over the output of {{df}}. Only in this way, we can ensure that 
the execution order of all the filter predicates can be irrelevant. This 
contract is important for optimizations like filter push-down. If we have to 
preserve execution order of all filter predicates, you won't be able to push 
down {{a}} in {{a AND b}}, and lose lots of optimization opportunities.

In the case of the snippet in the JIRA description, the first predicate is a 
full function while the second is a partial function of the output of the 
original {{df}}.


was (Author: lian cheng):
I think the contract here is that for a DataFrame {{df}} and 1 or more 
consecutive filter predicates applied to {{df}}, each filter predicate must be 
a full function over the output of {{df}}. Only in this way, we can ensure that 
the execution order of all the filter predicates can be irrelevant. This 
contract is important for optimizations like filter push-down. If we have to 
preserve execution order of all filter predicates, you won't be able to push 
down {{a}} in {{a AND b}}, and lose lots of optimization opportunities.

> keep the order of sql expression and support short circuit
> --
>
> Key: SPARK-18712
> URL: https://issues.apache.org/jira/browse/SPARK-18712
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04
>Reporter: yahsuan, chang
>
> The following python code fails with spark 2.0.2, but works with spark 1.5.2
> {code}
> # a.py
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> table = {5: True, 6: False}
> df = sqlc.range(10)
> df = df.where(df['id'].isin(5, 6))
> f = F.udf(lambda x: table[x], T.BooleanType())
> df = df.where(f(df['id']))
> # df.explain(True)
> print(df.count())
> {code}
> here is the exception 
> {code}
> KeyError: 0
> {code}
> I guess the problem is about the order of sql expression.
> the following are the explain of two spark version
> {code}
> # explain of spark 2.0.2
> == Parsed Logical Plan ==
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> id: bigint
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Filter (id#0L IN (5,6) && (id#0L))
> +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Project [id#0L]
> +- *Filter (id#0L IN (5,6) && pythonUDF0#5)
>+- BatchEvalPython [(id#0L)], [id#0L, pythonUDF0#5]
>   +- *Range (0, 10, step=1, splits=Some(4))
> {code}
> {code}
> # explain of spark 1.5.2
> == Parsed Logical Plan ==
> 'Project [*,PythonUDF#(id#0L) AS sad#1]
>  Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
>   LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Analyzed Logical Plan ==
> id: bigint, sad: int
> Project [id#0L,sad#1]
>  Project [id#0L,pythonUDF#2 AS sad#1]
>   EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
> LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Optimized Logical Plan ==
> Project [id#0L,pythonUDF#2 AS sad#1]
>  EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>   Filter id#0L IN (5,6)
>LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Physical Plan ==
> TungstenProject [id#0L,pythonUDF#2 AS sad#1]
>  !BatchPythonEvaluation PythonUDF#(id#0L), [id#0L,pythonUDF#2]
>   Filter id#0L IN (5,6)
>Scan PhysicalRDD[id#0L]
> Code Generation: true
> {code}
> Also, I am not sure if the sql expression support short circuit evaluation, 
> so I do the following experiment
> {code}
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> def f(x):
> print('in f')
> return True
> f_udf = F.udf(f, T.BooleanType())
> df = sqlc.createDataFrame([(1, 2)], schema=['a', 'b'])
> df = df.where(f_udf('a') | f_udf('b'))
> df.show()
> {code}
> and I got the following output for both spark 1.5.2 and spark 2.0.2
> {code}
> in f
> in f
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> there is only one 

[jira] [Commented] (SPARK-18712) keep the order of sql expression and support short circuit

2016-12-05 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724381#comment-15724381
 ] 

Cheng Lian commented on SPARK-18712:


I think the contract here is that for a DataFrame {{df}} and 1 or more 
consecutive filter predicates applied to {{df}}, each filter predicate must be 
a full function over the output of {{df}}. Only in this way, we can ensure that 
the execution order of all the filter predicates can be irrelevant. This 
contract is important for optimizations like filter push-down. If we have to 
preserve execution order of all filter predicates, you won't be able to push 
down {{a}} in {{a AND b}}, and lose lots of optimization opportunities.

> keep the order of sql expression and support short circuit
> --
>
> Key: SPARK-18712
> URL: https://issues.apache.org/jira/browse/SPARK-18712
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04
>Reporter: yahsuan, chang
>
> The following python code fails with spark 2.0.2, but works with spark 1.5.2
> {code}
> # a.py
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> table = {5: True, 6: False}
> df = sqlc.range(10)
> df = df.where(df['id'].isin(5, 6))
> f = F.udf(lambda x: table[x], T.BooleanType())
> df = df.where(f(df['id']))
> # df.explain(True)
> print(df.count())
> {code}
> here is the exception 
> {code}
> KeyError: 0
> {code}
> I guess the problem is about the order of sql expression.
> the following are the explain of two spark version
> {code}
> # explain of spark 2.0.2
> == Parsed Logical Plan ==
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> id: bigint
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Filter (id#0L IN (5,6) && (id#0L))
> +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Project [id#0L]
> +- *Filter (id#0L IN (5,6) && pythonUDF0#5)
>+- BatchEvalPython [(id#0L)], [id#0L, pythonUDF0#5]
>   +- *Range (0, 10, step=1, splits=Some(4))
> {code}
> {code}
> # explain of spark 1.5.2
> == Parsed Logical Plan ==
> 'Project [*,PythonUDF#(id#0L) AS sad#1]
>  Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
>   LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Analyzed Logical Plan ==
> id: bigint, sad: int
> Project [id#0L,sad#1]
>  Project [id#0L,pythonUDF#2 AS sad#1]
>   EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
> LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Optimized Logical Plan ==
> Project [id#0L,pythonUDF#2 AS sad#1]
>  EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>   Filter id#0L IN (5,6)
>LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Physical Plan ==
> TungstenProject [id#0L,pythonUDF#2 AS sad#1]
>  !BatchPythonEvaluation PythonUDF#(id#0L), [id#0L,pythonUDF#2]
>   Filter id#0L IN (5,6)
>Scan PhysicalRDD[id#0L]
> Code Generation: true
> {code}
> Also, I am not sure if the sql expression support short circuit evaluation, 
> so I do the following experiment
> {code}
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> def f(x):
> print('in f')
> return True
> f_udf = F.udf(f, T.BooleanType())
> df = sqlc.createDataFrame([(1, 2)], schema=['a', 'b'])
> df = df.where(f_udf('a') | f_udf('b'))
> df.show()
> {code}
> and I got the following output for both spark 1.5.2 and spark 2.0.2
> {code}
> in f
> in f
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> there is only one element in dataframe df, but the function f has been called 
> twice, so I guess no short circuit.
> the result seems to conflict with #SPARK-1461



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724013#comment-15724013
 ] 

Cheng Lian commented on SPARK-18539:


[~xwu0226], thanks for the new use case!

[~viirya], I do think this is a valid use case as long as all the missing 
columns are nullable. The only reason that this use case doesn't work right now 
is PARQUET-389.

I got some vague idea about a possible cleaner fix for this issue. Will post it 
later.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> 

[jira] [Updated] (SPARK-18730) Ask the build script to link to Jenkins test report page instead of full console output page when posting to GitHub

2016-12-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18730:
---
Priority: Minor  (was: Major)

> Ask the build script to link to Jenkins test report page instead of full 
> console output page when posting to GitHub
> ---
>
> Key: SPARK-18730
> URL: https://issues.apache.org/jira/browse/SPARK-18730
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> Currently, the full console output page of a Spark Jenkins PR build can be as 
> large as several megabytes. It takes a relatively long time to load and may 
> even freeze the browser for quite a while.
> I'd suggest posting the test report page link to GitHub instead, which is way 
> more concise and is usually the first page I'd like to check when 
> investigating a Jenkins build failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18730) Ask the build script to link to Jenkins test report page instead of full console output page when posting to GitHub

2016-12-05 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-18730:
--

 Summary: Ask the build script to link to Jenkins test report page 
instead of full console output page when posting to GitHub
 Key: SPARK-18730
 URL: https://issues.apache.org/jira/browse/SPARK-18730
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Cheng Lian
Assignee: Cheng Lian


Currently, the full console output page of a Spark Jenkins PR build can be as 
large as several megabytes. It takes a relatively long time to load and may 
even freeze the browser for quite a while.

I'd suggest posting the test report page link to GitHub instead, which is way 
more concise and is usually the first page I'd like to check when investigating 
a Jenkins build failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723781#comment-15723781
 ] 

Cheng Lian commented on SPARK-18539:


Please remind me if I missed anything important, otherwise, we can resolve this 
ticket as "Not a Problem".

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 

[jira] [Comment Edited] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723747#comment-15723747
 ] 

Cheng Lian edited comment on SPARK-18539 at 12/5/16 11:43 PM:
--

[~v-gerasimov], [~smilegator], and [~xwu0226], after some investigation, I 
don't think this is a bug now.

Just tested the master branch using the following test cases:
{code}
  for {
useVectorizedReader <- Seq(true, false)
mergeSchema <- Seq(true, false)
  } {
test(s"foo - mergeSchema: $mergeSchema - vectorized: $useVectorizedReader") 
{
  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
useVectorizedReader.toString) {
withTempPath { dir =>
  val path = dir.getCanonicalPath
  val df = spark.range(1).coalesce(1)

  df.selectExpr("id AS a", "id AS b").write.parquet(path)
  df.selectExpr("id AS a").write.mode("append").parquet(path)

  assertResult(0) {
spark.read
  .option("mergeSchema", mergeSchema.toString)
  .parquet(path)
  .filter("b < 0")
  .count()
  }
}
  }
}
  }
{code}
It turned out that this issue only happens when schema merging is turned off. 
This also explains why PR #9940 doesn't prevent PARQUET-389: because the trick 
PR #9940 employs happens during schema merging phase. On the other hand, you 
can't expect missing columns to be properly read when schema merging is turned 
off. Therefore, I don't think it's a bug.

The fix for the snippet mentioned in the ticket description is easy, just add 
{{.option("mergeSchema", "true")}} to enable schema merging.


was (Author: lian cheng):
[~v-gerasimov], [~smilegator], and [~xwu0226], after some investigation, I 
don't think this is a bug now.

Just tested the master branch using the following test cases:
{code}
  for {
useVectorizedReader <- Seq(true, false)
mergeSchema <- Seq(true, false)
  } {
test(s"foo - mergeSchema: $mergeSchema - vectorized: $useVectorizedReader") 
{
  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
useVectorizedReader.toString) {
withTempPath { dir =>
  val path = dir.getCanonicalPath
  val df = spark.range(1).coalesce(1)

  df.selectExpr("id AS a", "id AS b").write.parquet(path)
  df.selectExpr("id AS a").write.mode("append").parquet(path)

  assertResult(0) {
spark.read
  .option("mergeSchema", mergeSchema.toString)
  .parquet(path)
  .filter("b < 0")
  .count()
  }
}
  }
}
  }
{code}
It turned out that this issue only happens when schema merging is turned off. 
This also explains why PR #9940 doesn't prevent PARQUET-389: because the trick 
PR #9940 employs happens during schema merging phase. On the other hand, you 
can't expect missing columns can be properly read when schema merging is turned 
off. Therefore, I don't think it's a bug.

The fix for the snippet mentioned in the ticket description is easy, just add 
{{.option("mergeSchema", "true")}} to enable schema merging.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> 

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723747#comment-15723747
 ] 

Cheng Lian commented on SPARK-18539:


[~v-gerasimov], [~smilegator], and [~xwu0226], after some investigation, I 
don't think this is a bug now.

Just tested the master branch using the following test cases:
{code}
  for {
useVectorizedReader <- Seq(true, false)
mergeSchema <- Seq(true, false)
  } {
test(s"foo - mergeSchema: $mergeSchema - vectorized: $useVectorizedReader") 
{
  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
useVectorizedReader.toString) {
withTempPath { dir =>
  val path = dir.getCanonicalPath
  val df = spark.range(1).coalesce(1)

  df.selectExpr("id AS a", "id AS b").write.parquet(path)
  df.selectExpr("id AS a").write.mode("append").parquet(path)

  assertResult(0) {
spark.read
  .option("mergeSchema", mergeSchema.toString)
  .parquet(path)
  .filter("b < 0")
  .count()
  }
}
  }
}
  }
{code}
It turned out that this issue only happens when schema merging is turned off. 
This also explains why PR #9940 doesn't prevent PARQUET-389: because the trick 
PR #9940 employs happens during schema merging phase. On the other hand, you 
can't expect missing columns can be properly read when schema merging is turned 
off. Therefore, I don't think it's a bug.

The fix for the snippet mentioned in the ticket description is easy, just add 
{{.option("mergeSchema", "true")}} to enable schema merging.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> 

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723718#comment-15723718
 ] 

Cheng Lian commented on SPARK-18539:


As commented on GitHub, there're two issues right now:
# This bug also affects the normal Parquet reader code path, where 
{{ParquetRecordReader}} is a 3rd party class closed for modification. 
Therefore, we can't capture the exception there.
# [PR #9940|https://github.com/apache/spark/pull/9940] should have already 
fixed this issue. But somehow it is broken right now.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> 

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722891#comment-15722891
 ] 

Cheng Lian commented on SPARK-18539:


Haven't looked deeply into this issue, but my hunch is that this is related to 
https://issues.apache.org/jira/browse/PARQUET-389, which was fixed in 
parquet-mr 1.9.0, while Spark is still using 1.8 (in 2.1) and 1.7 (in 2.0).

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)

[jira] [Assigned] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-12-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-17213:
--

Assignee: Cheng Lian

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Andrew Duffy
>Assignee: Cheng Lian
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-12-01 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712707#comment-15712707
 ] 

Cheng Lian commented on SPARK-17213:


Agree that we should disable string and binary filter push down for now until 
PARQUET-686 gets fixed.

We turned off Parquet filter pushdown for string and binary columns in 1.6 due 
to PARQUET-251 (see SPARK-11153). In Spark 2.1, we upgraded to Parquet 1.8.1 to 
get PARQUET-251 fixed, then this issue pops up due to PARQUET-686. I think this 
also affects Spark 1.5.1 and prior versions.

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Andrew Duffy
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9876) Upgrade parquet-mr to 1.8.1

2016-12-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-9876.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Upgrade parquet-mr to 1.8.1
> ---
>
> Key: SPARK-9876
> URL: https://issues.apache.org/jira/browse/SPARK-9876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Ryan Blue
> Fix For: 2.1.0
>
>
> {{parquet-mr}} 1.8.1 fixed several issues that affect Spark. For example 
> PARQUET-201 (SPARK-9407).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-30 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709869#comment-15709869
 ] 

Cheng Lian commented on SPARK-18251:


One more comment about why we shouldn't allow a {{Option\[T <: Product\]}} to 
be used as top-level Dataset type: one way to think about this more intuitively 
is to make an analogy to databases. In a database table, you cannot mark a row 
itself as null. Instead, you are only allowed to mark a field of a row to be 
null.

Instead of using {{Option\[T <: Product\]}}, the user should resort to 
{{Tuple1\[T <: Product\]}}. Thus, you have a row consisting of a single field, 
which can be filled with either a null or a struct.

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18251:
---
Assignee: Wenchen Fan

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-30 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-18251.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15979
[https://github.com/apache/spark/pull/15979]

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
> Fix For: 2.2.0
>
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18403) ObjectHashAggregateSuite is being flaky (occasional OOM errors)

2016-11-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684659#comment-15684659
 ] 

Cheng Lian edited comment on SPARK-18403 at 11/22/16 6:54 AM:
--

Here is a minimal test case (add it to {{ObjectHashAggregateSuite}}) that can 
be used to reproduce this issue steadily:
{code}
test("oom") {
  withSQLConf(
SQLConf.USE_OBJECT_HASH_AGG.key -> "true",
SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1"
  ) {
Seq(Tuple1(Seq.empty[Int]))
  .toDF("c0")
  .groupBy(lit(1))
  .agg(typed_count($"c0"), max($"c0"))
  .show()
  }
}
{code}
What I observed is that the partial aggregation phase produces a malformed 
{{UnsafeRow}} after applying the {{resultProjection}} 
[here|https://github.com/apache/spark/blob/07beb5d21c6803e80733149f1560c71cd3cacc86/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala#L254].

When printed, the malformed {{UnsafeRow}} is always
{noformat}
[0,0,28,280008,100,5a5a5a5a5a5a5a5a]
{noformat}
The {{5a5a5a5a5a5a5a5a}} is interpreted as the length of an {{ArrayData}}. 
Therefore, the JVM blows up when trying to allocate a huge array to deep copy 
this {{ArrayData}} at a later phase.

[~sameer] and [~davies], would you mind to have a look at this issue? Thanks!


was (Author: lian cheng):
Here is a minimal test case (add it to {{ObjectHashAggregateSuite}}) that can 
be used to reproduce this issue steadily:
{code}
test("oom") {
  withSQLConf(
SQLConf.USE_OBJECT_HASH_AGG.key -> "true",
SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1"
  ) {
Seq(Tuple1(Seq.empty[Int]))
  .toDF("c0")
  .groupBy(lit(1))
  .agg(typed_count($"c0"), max($"c0"))
  .show()
  }
}
{code}
What I observed is that the partial aggregation phase produces a malformed 
{{UnsafeRow}} after applying the {{resultProjection}} 
[here|https://github.com/apache/spark/blob/07beb5d21c6803e80733149f1560c71cd3cacc86/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala#L254].

When printed, the malformed {{UnsafeRow}} is always
{noformat}
[0,0,28,280008,100,5a5a5a5a5a5a5a5a]
{noformat}
The {{5a5a5a5a5a5a5a5a}} is interpreted as the length of an {{ArrayData}}. 
Therefore, the JVM blows up when trying to allocate a huge array to deep copy 
of this {{ArrayData}} at a later phase.

[~sameer] and [~davies], would you mind to have a look at this issue? Thanks!

> ObjectHashAggregateSuite is being flaky (occasional OOM errors)
> ---
>
> Key: SPARK-18403
> URL: https://issues.apache.org/jira/browse/SPARK-18403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.2.0
>
>
> This test suite fails occasionally on Jenkins due to OOM errors. I've already 
> reproduced it locally but haven't figured out the root cause.
> We should probably disable it temporarily before getting it fixed so that it 
> doesn't break the PR build too often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18403) ObjectHashAggregateSuite is being flaky (occasional OOM errors)

2016-11-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685389#comment-15685389
 ] 

Cheng Lian commented on SPARK-18403:


Figured it out. It's caused by a false sharing issue inside 
{{ObjectAggregationIterator}}. In short, after setting an {{UnsafeArrayData}} 
to an aggregation buffer, which is a safe row, the underlying buffer of the 
{{UnsafeArrayData}} gets overwritten when iterator steps forward.

Have to say that this issue is pretty hard to debug. The large array allocation 
blows up the JVM right away and you can't really find the large array in the 
heap dump since the allocation itself fails. Therefore, all the heap dumps are 
super small (~70MB) compared to the heap size (3GB for default SBT tests) and 
you can't find anything useful in the heap dumps.

I'm opening a PR to fix this issue.

> ObjectHashAggregateSuite is being flaky (occasional OOM errors)
> ---
>
> Key: SPARK-18403
> URL: https://issues.apache.org/jira/browse/SPARK-18403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.2.0
>
>
> This test suite fails occasionally on Jenkins due to OOM errors. I've already 
> reproduced it locally but haven't figured out the root cause.
> We should probably disable it temporarily before getting it fixed so that it 
> doesn't break the PR build too often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18403) ObjectHashAggregateSuite is being flaky (occasional OOM errors)

2016-11-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684659#comment-15684659
 ] 

Cheng Lian commented on SPARK-18403:


Here is a minimal test case (add it to {{ObjectHashAggregateSuite}}) that can 
be used to reproduce this issue steadily:
{code}
test("oom") {
  withSQLConf(
SQLConf.USE_OBJECT_HASH_AGG.key -> "true",
SQLConf.OBJECT_AGG_SORT_BASED_FALLBACK_THRESHOLD.key -> "1"
  ) {
Seq(Tuple1(Seq.empty[Int]))
  .toDF("c0")
  .groupBy(lit(1))
  .agg(typed_count($"c0"), max($"c0"))
  .show()
  }
}
{code}
What I observed is that the partial aggregation phase produces a malformed 
{{UnsafeRow}} after applying the {{resultProjection}} 
[here|https://github.com/apache/spark/blob/07beb5d21c6803e80733149f1560c71cd3cacc86/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala#L254].

When printed, the malformed {{UnsafeRow}} is always
{noformat}
[0,0,28,280008,100,5a5a5a5a5a5a5a5a]
{noformat}
The {{5a5a5a5a5a5a5a5a}} is interpreted as the length of an {{ArrayData}}. 
Therefore, the JVM blows up when trying to allocate a huge array to deep copy 
of this {{ArrayData}} at a later phase.

[~sameer] and [~davies], would you mind to have a look at this issue? Thanks!

> ObjectHashAggregateSuite is being flaky (occasional OOM errors)
> ---
>
> Key: SPARK-18403
> URL: https://issues.apache.org/jira/browse/SPARK-18403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.2.0
>
>
> This test suite fails occasionally on Jenkins due to OOM errors. I've already 
> reproduced it locally but haven't figured out the root cause.
> We should probably disable it temporarily before getting it fixed so that it 
> doesn't break the PR build too often.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11785) When deployed against remote Hive metastore with lower versions, JDBC metadata calls throws exception

2016-11-18 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677469#comment-15677469
 ] 

Cheng Lian commented on SPARK-11785:


But I'm not sure which PR fixes this issue, though.

> When deployed against remote Hive metastore with lower versions, JDBC 
> metadata calls throws exception
> -
>
> Key: SPARK-11785
> URL: https://issues.apache.org/jira/browse/SPARK-11785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> To reproduce this issue with 1.7-SNAPSHOT
> # Start Hive 0.13.1 metastore service using {{$HIVE_HOME/bin/hive --service 
> metastore}}
> # Configures remote Hive metastore in {{conf/hive-site.xml}} by pointing 
> {{hive.metastore.uris}} to metastore endpoint (e.g. 
> {{thrift://localhost:9083}})
> # Set {{spark.sql.hive.metastore.version}} to {{0.13.1}} and 
> {{spark.sql.hive.metastore.jars}} to {{maven}} in {{conf/spark-defaults.conf}}
> # Start Thrift server using {{$SPARK_HOME/sbin/start-thriftserver.sh}}
> # Run the testing JDBC client program attached at the end
> Exception thrown from client side:
> {noformat}
> java.sql.SQLException: Could not create ResultSet: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> java.sql.SQLException: Could not create ResultSet: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:273)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.(HiveQueryResultSet.java:188)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet$Builder.build(HiveQueryResultSet.java:170)
> at 
> org.apache.hive.jdbc.HiveDatabaseMetaData.getColumns(HiveDatabaseMetaData.java:222)
> at JDBCExperiments$.main(JDBCExperiments.scala:28)
> at JDBCExperiments.main(JDBCExperiments.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> Caused by: org.apache.thrift.protocol.TProtocolException: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> at 
> org.apache.hive.service.cli.thrift.TGetResultSetMetadataReq.validate(TGetResultSetMetadataReq.java:290)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args.validate(TCLIService.java:12041)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args$GetResultSetMetadata_argsStandardScheme.write(TCLIService.java:12098)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args$GetResultSetMetadata_argsStandardScheme.write(TCLIService.java:12067)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args.write(TCLIService.java:12018)
> at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:63)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Client.send_GetResultSetMetadata(TCLIService.java:472)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Client.GetResultSetMetadata(TCLIService.java:464)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:242)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.(HiveQueryResultSet.java:188)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet$Builder.build(HiveQueryResultSet.java:170)
> at 
> org.apache.hive.jdbc.HiveDatabaseMetaData.getColumns(HiveDatabaseMetaData.java:222)
> at JDBCExperiments$.main(JDBCExperiments.scala:28)
> at JDBCExperiments.main(JDBCExperiments.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> {noformat}
> Exception thrown from server side:
> {noformat}
> 15/11/18 02:27:01 WARN RetryingMetaStoreClient: MetaStoreClient lost 
> connection. Attempting to reconnect.
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_schema_with_environment_context'
> at 
> org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
> at 
> 

[jira] [Commented] (SPARK-11785) When deployed against remote Hive metastore with lower versions, JDBC metadata calls throws exception

2016-11-18 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677468#comment-15677468
 ] 

Cheng Lian commented on SPARK-11785:


Confirmed that this is no longer an issue for 2.1

> When deployed against remote Hive metastore with lower versions, JDBC 
> metadata calls throws exception
> -
>
> Key: SPARK-11785
> URL: https://issues.apache.org/jira/browse/SPARK-11785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> To reproduce this issue with 1.7-SNAPSHOT
> # Start Hive 0.13.1 metastore service using {{$HIVE_HOME/bin/hive --service 
> metastore}}
> # Configures remote Hive metastore in {{conf/hive-site.xml}} by pointing 
> {{hive.metastore.uris}} to metastore endpoint (e.g. 
> {{thrift://localhost:9083}})
> # Set {{spark.sql.hive.metastore.version}} to {{0.13.1}} and 
> {{spark.sql.hive.metastore.jars}} to {{maven}} in {{conf/spark-defaults.conf}}
> # Start Thrift server using {{$SPARK_HOME/sbin/start-thriftserver.sh}}
> # Run the testing JDBC client program attached at the end
> Exception thrown from client side:
> {noformat}
> java.sql.SQLException: Could not create ResultSet: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> java.sql.SQLException: Could not create ResultSet: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:273)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.(HiveQueryResultSet.java:188)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet$Builder.build(HiveQueryResultSet.java:170)
> at 
> org.apache.hive.jdbc.HiveDatabaseMetaData.getColumns(HiveDatabaseMetaData.java:222)
> at JDBCExperiments$.main(JDBCExperiments.scala:28)
> at JDBCExperiments.main(JDBCExperiments.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> Caused by: org.apache.thrift.protocol.TProtocolException: Required field 
> 'operationHandle' is unset! 
> Struct:TGetResultSetMetadataReq(operationHandle:null)
> at 
> org.apache.hive.service.cli.thrift.TGetResultSetMetadataReq.validate(TGetResultSetMetadataReq.java:290)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args.validate(TCLIService.java:12041)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args$GetResultSetMetadata_argsStandardScheme.write(TCLIService.java:12098)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args$GetResultSetMetadata_argsStandardScheme.write(TCLIService.java:12067)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$GetResultSetMetadata_args.write(TCLIService.java:12018)
> at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:63)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Client.send_GetResultSetMetadata(TCLIService.java:472)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Client.GetResultSetMetadata(TCLIService.java:464)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:242)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet.(HiveQueryResultSet.java:188)
> at 
> org.apache.hive.jdbc.HiveQueryResultSet$Builder.build(HiveQueryResultSet.java:170)
> at 
> org.apache.hive.jdbc.HiveDatabaseMetaData.getColumns(HiveDatabaseMetaData.java:222)
> at JDBCExperiments$.main(JDBCExperiments.scala:28)
> at JDBCExperiments.main(JDBCExperiments.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> {noformat}
> Exception thrown from server side:
> {noformat}
> 15/11/18 02:27:01 WARN RetryingMetaStoreClient: MetaStoreClient lost 
> connection. Attempting to reconnect.
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'get_schema_with_environment_context'
> at 
> org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
> at 
> 

[jira] [Comment Edited] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-18 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677396#comment-15677396
 ] 

Cheng Lian edited comment on SPARK-18251 at 11/18/16 6:38 PM:
--

I'd prefer option 1 because of consistency of the semantics, and I don't think 
this is really a bug since {{Option\[T\]}} shouldn't be used as top level 
{{Dataset}} types anyway.

While doing schema inference, Catalyst always translates {{Option\[T\]}} to the 
nullable version of {{T'}}, where {{T'}} is the inferred data type of {{T}}. 
Take {{case class A(i: Option\[Int\])}} as an example, if we go for option 2, 
then what should the inferred schema of {{A}} be? To keep the original 
semantics, it should be
{noformat}
new StructType()
  .add("i", IntegerType, nullable = true)
{noformat}
while option 2 requires
{noformat}
new StructType()
  .add("i", new StructType()
.add("value", IntegerType, nullable = true), nullable = true)
{noformat}
since now {{Option\[T\]}} is treated as a single field struct.

Option 1 keeps the current semantics, which is pretty clear and easy to reason 
about, while option 2 either introduces inconsistency or requires us to further 
special case schema inference for top level {{Dataset}} types.



was (Author: lian cheng):
I'd prefer option 1 because of consistency of the semantics, and I don't think 
this is really a bug since {{Option\[T\]}} shouldn't be used as top level 
{{Dataset}} types anyway.

While doing schema inference, Catalyst always translates {{Option\[T\]}} to the 
nullable version of {{T'}}, where {{T'}} is the inferred data type of {{T}}. 
Take {{case class A(i: Option\[Int\])}} as an example, if we go for option 2, 
then what should the inferred schema of {{A}} be? To keep the original 
semantics, it should be
{noformat}
new StructType()
  .add("i", IntegerType, nullable = true)
{noformat}
while option 2 requires
{noformat}
new StructType()
  .add("i", new StructType()
.add("value", IntegerType, nullable = true), nullable = true)
{noformat}
since now {{Option\[T\]}} is treated as a single field struct.

Option 1 keeps the current semantics, which is pretty clear and easy to reason 
about, while option 2 either introduce inconsistency or requires further 
special casing schema inference for top level {{Dataset}} types.


> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> 

[jira] [Comment Edited] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-18 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677396#comment-15677396
 ] 

Cheng Lian edited comment on SPARK-18251 at 11/18/16 6:37 PM:
--

I'd prefer option 1 because of consistency of the semantics, and I don't think 
this is really a bug since {{Option\[T\]}} shouldn't be used as top level 
{{Dataset}} types anyway.

While doing schema inference, Catalyst always translates {{Option\[T\]}} to the 
nullable version of {{T'}}, where {{T'}} is the inferred data type of {{T}}. 
Take {{case class A(i: Option\[Int\])}} as an example, if we go for option 2, 
then what should the inferred schema of {{A}} be? To keep the original 
semantics, it should be
{noformat}
new StructType()
  .add("i", IntegerType, nullable = true)
{noformat}
while option 2 requires
{noformat}
new StructType()
  .add("i", new StructType()
.add("value", IntegerType, nullable = true), nullable = true)
{noformat}
since now {{Option\[T\]}} is treated as a single field struct.

Option 1 keeps the current semantics, which is pretty clear and easy to reason 
about, while option 2 either introduce inconsistency or requires further 
special casing schema inference for top level {{Dataset}} types.



was (Author: lian cheng):
I'd prefer option 1 because of consistency of the semantics, and I don't think 
this is really a bug since {{Option\[T\]}} shouldn't be used as top level 
{{Dataset}} types anyway.

While doing schema inference, Catalyst always treats {{Option\[T\]}} as the 
nullable version of {{T'}}, where {{T'}} is the inferred data type of {{T}}. 
Take {{case class A(i: Option\[Int\])}} as an example, if we go for option 2, 
then what should the inferred schema of {{A}} be? To keep the original 
semantics, it should be
{noformat}
new StructType()
  .add("i", IntegerType, nullable = true)
{noformat}
while option 2 requires
{noformat}
new StructType()
  .add("i", new StructType()
.add("value", IntegerType, nullable = true), nullable = true)
{noformat}
since now {{Option\[T\]}} is treated as a single field struct.

Option 1 keeps the current semantics, which is pretty clear and easy to reason 
about, while option 2 either introduce inconsistency or requires further 
special casing schema inference for top level {{Dataset}} types.


> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 

[jira] [Commented] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-18 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677396#comment-15677396
 ] 

Cheng Lian commented on SPARK-18251:


I'd prefer option 1 because of consistency of the semantics, and I don't think 
this is really a bug since {{Option\[T\]}} shouldn't be used as top level 
{{Dataset}} types anyway.

While doing schema inference, Catalyst always treats {{Option\[T\]}} as the 
nullable version of {{T'}}, where {{T'}} is the inferred data type of {{T}}. 
Take {{case class A(i: Option\[Int\])}} as an example, if we go for option 2, 
then what should the inferred schema of {{A}} be? To keep the original 
semantics, it should be
{noformat}
new StructType()
  .add("i", IntegerType, nullable = true)
{noformat}
while option 2 requires
{noformat}
new StructType()
  .add("i", new StructType()
.add("value", IntegerType, nullable = true), nullable = true)
{noformat}
since now {{Option\[T\]}} is treated as a single field struct.

Option 1 keeps the current semantics, which is pretty clear and easy to reason 
about, while option 2 either introduce inconsistency or requires further 
special casing schema inference for top level {{Dataset}} types.


> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >