[jira] [Created] (SPARK-6581) Metadata is missing when saving parquet file using hadoop 1.0.4
Pei-Lun Lee created SPARK-6581: -- Summary: Metadata is missing when saving parquet file using hadoop 1.0.4 Key: SPARK-6581 URL: https://issues.apache.org/jira/browse/SPARK-6581 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: hadoop 1.0.4 Reporter: Pei-Lun Lee When saving parquet file with {code}df.save(foo, parquet){code} It generates only _common_data while _metadata is missing: {noformat} -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 250 Mar 27 11:29 _common_metadata* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* {noformat} If saving with {code}df.save(foo, parquet, SaveMode.Overwrite){code} Both _metadata and _common_metadata are missing: {noformat} -rwxrwxrwx 1 peilunlee staff0 Mar 27 11:29 _SUCCESS* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-1.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-2.parquet* -rwxrwxrwx 1 peilunlee staff 272 Mar 27 11:29 part-r-3.parquet* -rwxrwxrwx 1 peilunlee staff 488 Mar 27 11:29 part-r-4.parquet* {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile
[ https://issues.apache.org/jira/browse/SPARK-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377264#comment-14377264 ] Pei-Lun Lee commented on SPARK-6352: The above PR adds a new hadoop config value spark.sql.parquet.output.committer.class to let user select the output committer class used in saving parquet file, similar to SPARK-3595. The base class is ParquetOutputCommitter. There is also a DirectParquetOutputCommitter added for file system like s3. Supporting non-default OutputCommitter when using saveAsParquetFile --- Key: SPARK-6352 URL: https://issues.apache.org/jira/browse/SPARK-6352 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1, 1.2.1, 1.3.0 Reporter: Pei-Lun Lee SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can be nice to have similar behavior in saveAsParquetFile. It maybe difficult to have a fully customizable OutputCommitter solution, at least adding something like a DirectParquetOutputCommitter and letting users choose between this and the default should be enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6408) JDBCRDD fails on where clause with string literal
Pei-Lun Lee created SPARK-6408: -- Summary: JDBCRDD fails on where clause with string literal Key: SPARK-6408 URL: https://issues.apache.org/jira/browse/SPARK-6408 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Pei-Lun Lee Priority: Critical The generated SQL query string is incorrect on filtering string literals. {code}where foo='bar'{code} results in {code}where foo=bar{code} The following snippet reproduce the bug: {code} $ SPARK_CLASSPATH=h2-1.4.186.jar spark/bin/spark-shell import java.sql.DriverManager val url = jdbc:h2:mem:testdb0 Class.forName(org.h2.Driver) val conn = DriverManager.getConnection(url) conn.prepareStatement(create schema test).executeUpdate() conn.prepareStatement(create table test.people (name TEXT(32) NOT NULL, theid INTEGER NOT NULL)).executeUpdate() conn.prepareStatement(insert into test.people values ('fred', 1)).executeUpdate() conn.commit() sql(s CREATE TEMPORARY TABLE foobar USING org.apache.spark.sql.jdbc OPTIONS (url '$url', dbtable 'TEST.PEOPLE') ) sql(select * from foobar where NAME='fred').collect 15/03/19 06:34:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) org.h2.jdbc.JdbcSQLException: Column FRED not found; SQL statement: SELECT NAME,THEID FROM TEST.PEOPLE WHERE NAME = fred [42122-186] {code} Note that it is likely that other data types also have similar problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6351) ParquetRelation2
Pei-Lun Lee created SPARK-6351: -- Summary: ParquetRelation2 Key: SPARK-6351 URL: https://issues.apache.org/jira/browse/SPARK-6351 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Pei-Lun Lee With new parquet API, if multiple parquet paths with different file systems (e.g. s3n and hdfs) are given, the error IllegalArgumentException: Wrong FS will be raised. Due to the FileSystem is created from hadoop configuration, which only allow one default file system. This behavior was supposed to work in old API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6351) ParquetRelation2 does not support paths for different file systems
[ https://issues.apache.org/jira/browse/SPARK-6351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pei-Lun Lee updated SPARK-6351: --- Summary: ParquetRelation2 does not support paths for different file systems (was: ParquetRelation2 ) ParquetRelation2 does not support paths for different file systems -- Key: SPARK-6351 URL: https://issues.apache.org/jira/browse/SPARK-6351 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Pei-Lun Lee With new parquet API, if multiple parquet paths with different file systems (e.g. s3n and hdfs) are given, the error IllegalArgumentException: Wrong FS will be raised. Due to the FileSystem is created from hadoop configuration, which only allow one default file system. This behavior was supposed to work in old API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile
Pei-Lun Lee created SPARK-6352: -- Summary: Supporting non-default OutputCommitter when using saveAsParquetFile Key: SPARK-6352 URL: https://issues.apache.org/jira/browse/SPARK-6352 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1, 1.1.1, 1.3.0 Reporter: Pei-Lun Lee SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can be nice to have similar behavior in saveAsParquetFile. It maybe difficult to have a fully customizable OutputCommitter solution, at least adding something like a DirectParquetOutputCommitter and letting users choose between this and the default should be enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3947) [Spark SQL] UDAF Support
Pei-Lun Lee created SPARK-3947: -- Summary: [Spark SQL] UDAF Support Key: SPARK-3947 URL: https://issues.apache.org/jira/browse/SPARK-3947 Project: Spark Issue Type: Improvement Components: SQL Reporter: Pei-Lun Lee Right now only Hive UDAFs are supported. It would be nice to have UDAF similar to UDF through SQLContext.registerFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3371) Spark SQL: Renaming a function expression with group by gives error
Pei-Lun Lee created SPARK-3371: -- Summary: Spark SQL: Renaming a function expression with group by gives error Key: SPARK-3371 URL: https://issues.apache.org/jira/browse/SPARK-3371 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Pei-Lun Lee {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val rdd = sc.parallelize(List({foo:bar})) sqlContext.jsonRDD(rdd).registerAsTable(t1) sqlContext.registerFunction(len, (s: String) = s.length) sqlContext.sql(select len(foo) as a, count(1) from t1 group by len(foo)).collect() {code} running above code in spark-shell gives the following error {noformat} 14/09/03 17:20:13 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 214) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: foo#0 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:201) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:199) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) {noformat} remove as a in the query causes no error -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array
Pei-Lun Lee created SPARK-2489: -- Summary: Unsupported parquet datatype optional fixed_len_byte_array Key: SPARK-2489 URL: https://issues.apache.org/jira/browse/SPARK-2489 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Pei-Lun Lee tested against commit 9fe693b5 {noformat} scala sqlContext.parquetFile(/tmp/foo) java.lang.RuntimeException: Unsupported parquet datatype optional fixed_len_byte_array(4) b at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58) at org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279) {noformat} example avro schema {noformat} protocol Test { fixed Bytes4(4); record Foo { union {null, Bytes4} b; } } {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)