[jira] [Created] (SPARK-6581) Metadata is missing when saving parquet file using hadoop 1.0.4

2015-03-27 Thread Pei-Lun Lee (JIRA)
Pei-Lun Lee created SPARK-6581:
--

 Summary: Metadata is missing when saving parquet file using hadoop 
1.0.4
 Key: SPARK-6581
 URL: https://issues.apache.org/jira/browse/SPARK-6581
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: hadoop 1.0.4
Reporter: Pei-Lun Lee


When saving parquet file with {code}df.save(foo, parquet){code}
It generates only _common_data while _metadata is missing:
{noformat}
-rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*
{noformat}

If saving with {code}df.save(foo, parquet, SaveMode.Overwrite){code} Both 
_metadata and _common_metadata are missing:
{noformat}
-rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile

2015-03-23 Thread Pei-Lun Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377264#comment-14377264
 ] 

Pei-Lun Lee commented on SPARK-6352:


The above PR adds a new hadoop config value 
spark.sql.parquet.output.committer.class to let user select the output 
committer class used in saving parquet file, similar to SPARK-3595. The base 
class is ParquetOutputCommitter. There is also a DirectParquetOutputCommitter 
added for file system like s3.

 Supporting non-default OutputCommitter when using saveAsParquetFile
 ---

 Key: SPARK-6352
 URL: https://issues.apache.org/jira/browse/SPARK-6352
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.2.1, 1.3.0
Reporter: Pei-Lun Lee

 SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can 
 be nice to have similar behavior in saveAsParquetFile. It maybe difficult to 
 have a fully customizable OutputCommitter solution, at least adding something 
 like a DirectParquetOutputCommitter and letting users choose between this and 
 the default should be enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6408) JDBCRDD fails on where clause with string literal

2015-03-19 Thread Pei-Lun Lee (JIRA)
Pei-Lun Lee created SPARK-6408:
--

 Summary: JDBCRDD fails on where clause with string literal
 Key: SPARK-6408
 URL: https://issues.apache.org/jira/browse/SPARK-6408
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Pei-Lun Lee
Priority: Critical


The generated SQL query string is incorrect on filtering string literals.

{code}where foo='bar'{code} results in {code}where foo=bar{code}

The following snippet reproduce the bug:
{code}
$ SPARK_CLASSPATH=h2-1.4.186.jar spark/bin/spark-shell

import java.sql.DriverManager
val url = jdbc:h2:mem:testdb0
Class.forName(org.h2.Driver)
val conn = DriverManager.getConnection(url)
conn.prepareStatement(create schema test).executeUpdate()
conn.prepareStatement(create table test.people (name TEXT(32) NOT NULL, theid 
INTEGER NOT NULL)).executeUpdate()
conn.prepareStatement(insert into test.people values ('fred', 
1)).executeUpdate()
conn.commit()
sql(s
CREATE TEMPORARY TABLE foobar
USING org.apache.spark.sql.jdbc
OPTIONS (url '$url', dbtable 'TEST.PEOPLE')
)
sql(select * from foobar where NAME='fred').collect

15/03/19 06:34:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.h2.jdbc.JdbcSQLException: Column FRED not found; SQL statement:
SELECT NAME,THEID FROM TEST.PEOPLE WHERE NAME = fred [42122-186]
{code}

Note that it is likely that other data types also have similar problem.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6351) ParquetRelation2

2015-03-16 Thread Pei-Lun Lee (JIRA)
Pei-Lun Lee created SPARK-6351:
--

 Summary: ParquetRelation2 
 Key: SPARK-6351
 URL: https://issues.apache.org/jira/browse/SPARK-6351
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Pei-Lun Lee


With new parquet API, if multiple parquet paths with different file systems 
(e.g. s3n and hdfs) are given, the error IllegalArgumentException: Wrong FS 
will be raised. Due to the FileSystem is created from hadoop configuration, 
which only allow one default file system. This behavior was supposed to work in 
old API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6351) ParquetRelation2 does not support paths for different file systems

2015-03-16 Thread Pei-Lun Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei-Lun Lee updated SPARK-6351:
---
Summary: ParquetRelation2 does not support paths for different file systems 
 (was: ParquetRelation2 )

 ParquetRelation2 does not support paths for different file systems
 --

 Key: SPARK-6351
 URL: https://issues.apache.org/jira/browse/SPARK-6351
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Pei-Lun Lee

 With new parquet API, if multiple parquet paths with different file systems 
 (e.g. s3n and hdfs) are given, the error IllegalArgumentException: Wrong FS 
 will be raised. Due to the FileSystem is created from hadoop configuration, 
 which only allow one default file system. This behavior was supposed to work 
 in old API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile

2015-03-16 Thread Pei-Lun Lee (JIRA)
Pei-Lun Lee created SPARK-6352:
--

 Summary: Supporting non-default OutputCommitter when using 
saveAsParquetFile
 Key: SPARK-6352
 URL: https://issues.apache.org/jira/browse/SPARK-6352
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1, 1.1.1, 1.3.0
Reporter: Pei-Lun Lee


SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can be 
nice to have similar behavior in saveAsParquetFile. It maybe difficult to have 
a fully customizable OutputCommitter solution, at least adding something like a 
DirectParquetOutputCommitter and letting users choose between this and the 
default should be enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3947) [Spark SQL] UDAF Support

2014-10-14 Thread Pei-Lun Lee (JIRA)
Pei-Lun Lee created SPARK-3947:
--

 Summary: [Spark SQL] UDAF Support
 Key: SPARK-3947
 URL: https://issues.apache.org/jira/browse/SPARK-3947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Pei-Lun Lee


Right now only Hive UDAFs are supported. It would be nice to have UDAF similar 
to UDF through SQLContext.registerFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3371) Spark SQL: Renaming a function expression with group by gives error

2014-09-03 Thread Pei-Lun Lee (JIRA)
Pei-Lun Lee created SPARK-3371:
--

 Summary: Spark SQL: Renaming a function expression with group by 
gives error
 Key: SPARK-3371
 URL: https://issues.apache.org/jira/browse/SPARK-3371
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Pei-Lun Lee


{code}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rdd = sc.parallelize(List({foo:bar}))
sqlContext.jsonRDD(rdd).registerAsTable(t1)
sqlContext.registerFunction(len, (s: String) = s.length)
sqlContext.sql(select len(foo) as a, count(1) from t1 group by 
len(foo)).collect()
{code}

running above code in spark-shell gives the following error

{noformat}
14/09/03 17:20:13 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 214)
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: foo#0
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:201)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:199)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
{noformat}

remove as a in the query causes no error



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array

2014-07-15 Thread Pei-Lun Lee (JIRA)
Pei-Lun Lee created SPARK-2489:
--

 Summary: Unsupported parquet datatype optional fixed_len_byte_array
 Key: SPARK-2489
 URL: https://issues.apache.org/jira/browse/SPARK-2489
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Pei-Lun Lee


tested against commit 9fe693b5

{noformat}
scala sqlContext.parquetFile(/tmp/foo)
java.lang.RuntimeException: Unsupported parquet datatype optional 
fixed_len_byte_array(4) b
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279)
{noformat}

example avro schema
{noformat}
protocol Test {
fixed Bytes4(4);
record Foo {
union {null, Bytes4} b;
}
}
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)