[jira] [Updated] (SPARK-11044) Parquet writer version fixed as version1

2015-11-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11044: --- Fix Version/s: 1.6.0 > Parquet writer version fixed as versi

[jira] [Resolved] (SPARK-11692) Support for Parquet logical types, JSON and BSON (embedded types)

2015-11-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11692. Resolution: Fixed Fix Version/s: 1.7.0 Issue resolved by pull request 9658 [https

[jira] [Updated] (SPARK-11692) Support for Parquet logical types, JSON and BSON (embedded types)

2015-11-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11692: --- Assignee: Hyukjin Kwon > Support for Parquet logical types, JSON and BSON (embedded ty

[jira] [Resolved] (SPARK-11044) Parquet writer version fixed as version1

2015-11-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11044. Resolution: Fixed Fix Version/s: 1.7.0 Issue resolved by pull request 9060 [https

[jira] [Updated] (SPARK-11044) Parquet writer version fixed as version1

2015-11-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11044: --- Assignee: Hyukjin Kwon > Parquet writer version fixed as versi

[jira] [Comment Edited] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-11-15 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005023#comment-15005023 ] Cheng Lian edited comment on SPARK-11153 at 11/15/15 11:4

[jira] [Updated] (SPARK-11694) Parquet logical types are not being tested properly

2015-11-14 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11694: --- Assignee: Hyukjin Kwon > Parquet logical types are not being tested prope

[jira] [Resolved] (SPARK-11694) Parquet logical types are not being tested properly

2015-11-14 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11694. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9660 [https

[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-11-13 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005023#comment-15005023 ] Cheng Lian commented on SPARK-11153: Good question. We tried, see [PR #9225|h

[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-11-13 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005005#comment-15005005 ] Cheng Lian commented on SPARK-11153: Yes. > Turns off Parquet filter push-d

[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-11-13 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005004#comment-15005004 ] Cheng Lian commented on SPARK-11153: Yes. > Turns off Parquet filter push-d

[jira] [Issue Comment Deleted] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-11-13 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11153: --- Comment: was deleted (was: Yes.) > Turns off Parquet filter push-down for string and binary colu

[jira] [Resolved] (SPARK-11678) Partition discovery fail if there is a _SUCCESS file in the table's root dir

2015-11-13 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11678. Resolution: Fixed Fix Version/s: 1.6.0 1.7.0 Issue resolved by pull

[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-12 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003578#comment-15003578 ] Cheng Lian commented on SPARK-11191: What error message/exception stacktrace did

[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-12 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003549#comment-15003549 ] Cheng Lian commented on SPARK-11191: Sorry that I wasn't clear enough in my

[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-12 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002385#comment-15002385 ] Cheng Lian commented on SPARK-11191: Spark SQL hasn't supported persisted

[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-12 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002047#comment-15002047 ] Cheng Lian commented on SPARK-11191: This issue consists of two bugs. One of the

[jira] [Commented] (SPARK-5968) Parquet warning in spark-shell

2015-11-12 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001885#comment-15001885 ] Cheng Lian commented on SPARK-5968: --- As explained in the JIRA description, this i

[jira] [Resolved] (SPARK-11661) We should still pushdown filters returned by a data source's unhandledFilters

2015-11-12 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11661. Resolution: Fixed Fix Version/s: 1.6.0 1.7.0 Issue resolved by pull

[jira] [Commented] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong

2015-11-11 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000222#comment-15000222 ] Cheng Lian commented on SPARK-10954: Figured out the reason why {{created_by}

[jira] [Commented] (SPARK-9686) Spark hive jdbc client cannot get table from metadata store

2015-11-11 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000215#comment-15000215 ] Cheng Lian commented on SPARK-9686: --- [~navis] [~bugg_tb] [~pin_zhang] May I ask

[jira] [Commented] (SPARK-10113) Support for unsigned Parquet logical types

2015-11-11 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000203#comment-15000203 ] Cheng Lian commented on SPARK-10113: I think emitting a clear error message is

[jira] [Commented] (SPARK-5968) Parquet warning in spark-shell

2015-11-11 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000200#comment-15000200 ] Cheng Lian commented on SPARK-5968: --- It had once been fixed via a quite hacky t

[jira] [Commented] (SPARK-11089) Add a option for thrift-server to share a single session across all connections

2015-11-11 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000195#comment-15000195 ] Cheng Lian commented on SPARK-11089: OK, I'm taking this. > Add a optio

[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

2015-11-11 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11500: --- Fix Version/s: 1.6.0 > Not deterministic order of columns when using merging sche

[jira] [Resolved] (SPARK-11500) Not deterministic order of columns when using merging schemas.

2015-11-11 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11500. Resolution: Fixed Fix Version/s: 1.7.0 Issue resolved by pull request 9517 [https

Re: Unwanted SysOuts in Spark Parquet

2015-11-10 Thread Cheng Lian
This is because of PARQUET-369 , which prevents users or other libraries to override Parquet's JUL logging settings via SLF4J. It has been fixed in the most recent parquet-format master (PR #32

[jira] [Updated] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"

2015-11-09 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11595: --- Description: When handling {{ADD JAR}}, Spark constructs a {{java.io.File}} first using the input

[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-09 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996519#comment-14996519 ] Cheng Lian commented on SPARK-11191: One of the problem here is SPARK-11595. How

[jira] [Created] (SPARK-11595) "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/"

2015-11-09 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11595: -- Summary: "ADD JAR" doesn't work if the given path contains URL scheme like "file:/" and "hdfs:/" Key: SPARK-11595 URL: https://issues

Re: very slow parquet file write

2015-11-06 Thread Cheng Lian
none of your responses are there either. I am definitely subscribed to the list though (I get daily digests). Any clue how to fix it? Sorry, no idea :-/ On Nov 6, 2015, at 9:26 AM, Cheng Lian <mailto:lian.cs@gmail.com>> wrote: I'd expect writing Parquet files slower than

Re: very slow parquet file write

2015-11-06 Thread Cheng Lian
I'd expect writing Parquet files slower than writing JSON files since Parquet involves more complicated encoders, but maybe not that slow. Would you mind to try to profile one Spark executor using tools like YJP to see what's the hotspot? Cheng On 11/6/15 7:34 AM, rok wrote: Apologies if thi

[jira] [Commented] (SPARK-11500) Not deterministic order of columns when using merging schemas.

2015-11-05 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992834#comment-14992834 ] Cheng Lian commented on SPARK-11500: [~hyukjin.kwon] Thanks for reporting. Would

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-04 Thread Cheng Lian
Is there any chance that " spark.sql.hive.convertMetastoreParquet" is turned off? Cheng On 11/4/15 5:15 PM, Rex Xiong wrote: Thanks Cheng Lian. I found in 1.5, if I use spark to create this table with partition discovery, the partition pruning can be performed, but for my

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-03 Thread Cheng Lian
SPARK-11153 should be irrelevant because you are filtering on a partition key while SPARK-11153 is about Parquet filter push-down and doesn't affect partition pruning. Cheng On 11/3/15 7:14 PM, Rex Xiong wrote: We found the query performance is very poor due to this issue https://issues.apac

[jira] [Updated] (SPARK-10533) DataFrame filter is not handling float/double with Scientific Notation 'e' / 'E'

2015-11-03 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10533: --- Description: In DataFrames filter operation,when giving float comparison with e (2.0e2) it is not

[jira] [Resolved] (SPARK-10533) DataFrame filter is not handling float/double with Scientific Notation 'e' / 'E'

2015-11-03 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10533. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9085 [https

[jira] [Updated] (SPARK-10533) DataFrame filter is not handling float/double with Scientific Notation 'e' / 'E'

2015-11-03 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10533: --- Assignee: Adrian Wang > DataFrame filter is not handling float/double with Scientific Notation

[jira] [Resolved] (SPARK-11311) spark cannot describe temporary functions

2015-11-02 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11311. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9277 [https

[jira] [Assigned] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-11-02 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-10978: -- Assignee: Cheng Lian > Allow PrunedFilterScan to eliminate predicates from further evaluat

[jira] [Updated] (SPARK-10786) SparkSQLCLIDriver should take the whole statement to generate the CommandProcessor

2015-11-02 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10786: --- Assignee: SaintBacchus > SparkSQLCLIDriver should take the whole statement to generate

[jira] [Resolved] (SPARK-10786) SparkSQLCLIDriver should take the whole statement to generate the CommandProcessor

2015-11-02 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10786. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8895 [https

[jira] [Resolved] (SPARK-11103) Parquet filters push-down may cause exception when schema merging is turned on

2015-10-30 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11103. Resolution: Fixed Fix Version/s: 1.5.2 1.6.0 Issue resolved by pull

[jira] [Updated] (SPARK-7673) DataSourceStrategy's buildPartitionedTableScan always list file status for all data files

2015-10-30 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-7673: -- Summary: DataSourceStrategy's buildPartitionedTableScan always list file status for all data

[jira] [Updated] (SPARK-11103) Parquet filters push-down may cause exception when schema merging is turned on

2015-10-29 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11103: --- Summary: Parquet filters push-down may cause exception when schema merging is turned on (was

[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-29 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979950#comment-14979950 ] Cheng Lian commented on SPARK-11103: [~rxin] I think this should be a blocker

[jira] [Updated] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-29 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11103: --- Priority: Blocker (was: Major) > Filter applied on Merged Parquet shema with new column fail w

[jira] [Updated] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-29 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11103: --- Target Version/s: 1.5.2, 1.6.0 (was: 1.5.3, 1.6.0) > Filter applied on Merged Parquet shema w

[jira] [Commented] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979935#comment-14979935 ] Cheng Lian commented on SPARK-11376: No, {{GenerateColumnAccessor}} only exis

[jira] [Resolved] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11376. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9335 [https

[jira] [Updated] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11376: --- Description: There are two {{mutableRow}} fields in the generated code within

[jira] [Updated] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11376: --- Priority: Major (was: Minor) > Invalid generated Java code in GenerateColumnAcces

[jira] [Created] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11376: -- Summary: Invalid generated Java code in GenerateColumnAccessor Key: SPARK-11376 URL: https://issues.apache.org/jira/browse/SPARK-11376 Project: Spark Issue Type

[jira] [Updated] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11103: --- Target Version/s: 1.5.3, 1.6.0 (was: 1.6.0) > Filter applied on Merged Parquet shema with

[jira] [Comment Edited] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977961#comment-14977961 ] Cheng Lian edited comment on SPARK-11103 at 10/28/15 8:3

[jira] [Updated] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11103: --- Assignee: Hyukjin Kwon > Filter applied on Merged Parquet shema with new column fail w

[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-28 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977961#comment-14977961 ] Cheng Lian commented on SPARK-11103: Quoted from my reply on the user list: F

[jira] [Created] (PARQUET-389) Filter predicates should work with missing columns

2015-10-28 Thread Cheng Lian (JIRA)
Cheng Lian created PARQUET-389: -- Summary: Filter predicates should work with missing columns Key: PARQUET-389 URL: https://issues.apache.org/jira/browse/PARQUET-389 Project: Parquet Issue Type

Re: Filter applied on merged Parquet shemsa with new column fails.

2015-10-28 Thread Cheng Lian
Hey Hyukjin, Sorry that I missed the JIRA ticket. Thanks for bring this issue up here, your detailed investigation. From my side, I think this is a bug of Parquet. Parquet was designed to support schema evolution. When scanning a Parquet, if a column exists in the requested schema but missin

Re: Filter applied on merged Parquet shemsa with new column fails.

2015-10-28 Thread Cheng Lian
Hey Hyukjin, Sorry that I missed the JIRA ticket. Thanks for bring this issue up here, your detailed investigation. From my side, I think this is a bug of Parquet. Parquet was designed to support schema evolution. When scanning a Parquet, if a column exists in the requested schema but missin

[jira] [Created] (SPARK-11345) Make HadoopFsRelation always outputs UnsafeRow

2015-10-27 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11345: -- Summary: Make HadoopFsRelation always outputs UnsafeRow Key: SPARK-11345 URL: https://issues.apache.org/jira/browse/SPARK-11345 Project: Spark Issue Type: Bug

[jira] [Updated] (SPARK-10562) .partitionBy() creates the metastore partition columns in all lowercase, but persists the data path as MixedCase resulting in an error when the data is later attempted t

2015-10-25 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10562: --- Description: When using DataFrame.write.partitionBy().saveAsTable() it creates the partiton by

[jira] [Resolved] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-20 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11153. Resolution: Fixed Fix Version/s: 1.5.2 1.6.0 Issue resolved by pull

Re: Projection pushdown with nested data type

2015-10-16 Thread Cheng Lian
good pointer to start with. --Mohammad On Thursday, October 15, 2015 10:04 AM, Cheng Lian wrote: At its core, Parquet definitely supports reading selected fields of nested structs, and that's actually one of the initial motivations of Parquet. However, not all upper

[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2015-10-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961318#comment-14961318 ] Cheng Lian commented on SPARK-6859: --- This issue was left unresolved because Par

Re: List type to parquet schema

2015-10-16 Thread Cheng Lian
One of the benefits of a 3-level LIST structure is that it's able to represent arrays with null elements, which is the case in Hive. That's why the current parquet-format spec is using it. Cheng On 10/16/15 11:13 AM, Ryan Blue wrote: Mohammad, The spec for storing lists in Parquet is here:

[jira] [Commented] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961281#comment-14961281 ] Cheng Lian commented on SPARK-11153: Yes, it's the statistics informatio

[jira] [Updated] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11153: --- Description: Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be written with

[jira] [Updated] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11153: --- Priority: Blocker (was: Critical) > Turns off Parquet filter push-down for string and bin

[jira] [Created] (SPARK-11153) Turns off Parquet filter push-down for string and binary columns

2015-10-16 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11153: -- Summary: Turns off Parquet filter push-down for string and binary columns Key: SPARK-11153 URL: https://issues.apache.org/jira/browse/SPARK-11153 Project: Spark

[jira] [Updated] (SPARK-10895) Add pushdown string filters for Parquet

2015-10-16 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10895: --- Assignee: Liang-Chi Hsieh > Add pushdown string filters for Parq

Re: Projection pushdown with nested data type

2015-10-15 Thread Cheng Lian
At its core, Parquet definitely supports reading selected fields of nested structs, and that's actually one of the initial motivations of Parquet. However, not all upper level Parquet data models enable it. For example, parquet-avro and parquet-thrift work fine, while parquet-hive has to read

Re: requestedSchema vs fileSchema

2015-10-15 Thread Cheng Lian
Actually requested schema is not necessary to be a subset of the file schema. If a field in the requested schema doesn't exist in the file schema, Parquet fills that field with nulls, as long as the field is optional. Cheng On 10/14/15 6:25 PM, Alex Levenson wrote: It's always a cooperation

[jira] [Resolved] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work

2015-10-14 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10829. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8916 [https

[jira] [Created] (SPARK-11117) PhysicalRDD.outputsUnsafeRows should return true when the underlying data source produces UnsafeRows

2015-10-14 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-7: -- Summary: PhysicalRDD.outputsUnsafeRows should return true when the underlying data source produces UnsafeRows Key: SPARK-7 URL: https://issues.apache.org/jira/browse/SPARK-7

[jira] [Created] (SPARK-11088) Optimize DataSourceStrategy.mergeWithPartitionValues

2015-10-13 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11088: -- Summary: Optimize DataSourceStrategy.mergeWithPartitionValues Key: SPARK-11088 URL: https://issues.apache.org/jira/browse/SPARK-11088 Project: Spark Issue Type

[jira] [Resolved] (SPARK-6561) Add partition support in saveAsParquet

2015-10-13 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6561. --- Resolution: Duplicate Assignee: Cheng Lian Fix Version/s: 1.4.0 > Add partit

[jira] [Resolved] (SPARK-11018) Support UDT in codegen and unsafe projection

2015-10-12 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11018. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9016 [https

[jira] [Resolved] (SPARK-10990) Avoid the serialization multiple times during unrolling of complex types

2015-10-12 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10990. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9016 [https

Re: Fixed writer version as version1 for Parquet as wring a Parquet file.

2015-10-09 Thread Cheng Lian
Hi Hyukjin, Thanks for bringing this up. Could you please make a PR for this one? We didn't use PARQUET_2_0 mostly because it's less mature than PARQUET_1_0, but we should let users choose the writer version, as long as PARQUET_1_0 remains the default option. Cheng On 10/8/15 11:04 PM, Hyuk

[jira] [Commented] (PARQUET-387) TwoLevelListWriter does not handle null values in array

2015-10-08 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949647#comment-14949647 ] Cheng Lian commented on PARQUET-387: I believe setting {{parquet.avro.write-old-

[jira] [Resolved] (SPARK-6774) Implement Parquet complex types backwards-compatiblity rules

2015-10-08 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-6774. --- Resolution: Fixed Finally, fixed all the Parquet compatibility issues after 6 months! > Implem

[jira] [Resolved] (SPARK-8848) Write Parquet LISTs and MAPs conforming to Parquet format spec

2015-10-08 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-8848. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8988 [https://github.com

[jira] [Created] (SPARK-11007) Add dictionary support for CatalystDecimalConverter

2015-10-08 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11007: -- Summary: Add dictionary support for CatalystDecimalConverter Key: SPARK-11007 URL: https://issues.apache.org/jira/browse/SPARK-11007 Project: Spark Issue Type

Re: Question about slight performance regression in 1.8.1 and 1.8.2 release schedule

2015-10-08 Thread Cheng Lian
good idea to fix a performance regression. I'd really like to find out what caused it and what fixed it, though. Is it possible for you to bisect the Parquet tree and run the test? rb On 10/06/2015 10:09 AM, Cheng Lian wrote: Could anybody help elaborating on 1.8.2 release plan? T

[jira] [Created] (HIVE-12069) Hive cannot read Parquet decimals backed by INT32 or INT64

2015-10-08 Thread Cheng Lian (JIRA)
Cheng Lian created HIVE-12069: - Summary: Hive cannot read Parquet decimals backed by INT32 or INT64 Key: HIVE-12069 URL: https://issues.apache.org/jira/browse/HIVE-12069 Project: Hive Issue Type

[jira] [Resolved] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow

2015-10-08 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10999. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9024 [https

Re: Parquet file size

2015-10-08 Thread Cheng Lian
l.com<mailto:younes.nag...@streamtheworld.com>** *From:* odeach...@gmail.com [odeach...@gmail.com] on behalf of Deng Ching-Mallete [och...@apache.org] *Sent:* Wednesday, October 07, 2015 9:14 PM *To:* Younes Naguib *Cc:* Cheng

[jira] [Created] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow

2015-10-07 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10999: -- Summary: Physical plan node Coalesce should be able to handle UnsafeRow Key: SPARK-10999 URL: https://issues.apache.org/jira/browse/SPARK-10999 Project: Spark

Re: Parquet file size

2015-10-07 Thread Cheng Lian
, without month and day). Cheng So you want to dump all data into a single large Parquet file? On 10/7/15 1:55 PM, Younes Naguib wrote: The TSV original files is 600GB and generated 40k files of 15-25MB. y *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* October-07-15 3:18 PM *To

Re: Parquet file size

2015-10-07 Thread Cheng Lian
Why do you want larger files? Doesn't the result Parquet file contain all the data in the original TSV file? Cheng On 10/7/15 11:07 AM, Younes Naguib wrote: Hi, I’m reading a large tsv file, and creating parquet files using sparksql: insert overwrite table tbl partition(year, month, day)..

[jira] [Updated] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong

2015-10-06 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-10954: --- Assignee: Gayathri Murali > Parquet version in the "created_by" metadata field of

[jira] [Commented] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong

2015-10-06 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946032#comment-14946032 ] Cheng Lian commented on SPARK-10954: Thanks, I'm assigning this o

[jira] [Created] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong

2015-10-06 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10954: -- Summary: Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong Key: SPARK-10954 URL: https://issues.apache.org/jira/br

Re: Question about slight performance regression in 1.8.1 and 1.8.2 release schedule

2015-10-06 Thread Cheng Lian
Could anybody help elaborating on 1.8.2 release plan? Thanks :) Cheng On 9/30/15 2:42 PM, Cheng Lian wrote: Hey all, We (the Spark team) have being considering to upgrade parquet-mr in Spark to 1.8.1 to fix PARQUET-251 <https://issues.apache.org/jira/browse/PARQUET-251>. However, my

Question about slight performance regression in 1.8.1 and 1.8.2 release schedule

2015-09-30 Thread Cheng Lian
Hey all, We (the Spark team) have being considering to upgrade parquet-mr in Spark to 1.8.1 to fix PARQUET-251 . However, my micro-benchmark shows that 1.8.1 seems to be suffering a slight performance regression (5% ~ 10%) compared to 1.7.0 (

Re: Metadata in Parquet

2015-09-30 Thread Cheng Lian
Unfortunately this isn't supported at the moment https://issues.apache.org/jira/browse/SPARK-10803 Cheng On 9/30/15 10:54 AM, Philip Weaver wrote: Hi, I am using org.apache.spark.sql.types.Metadata to store extra information along with each of my fields. I'd also like to store Metadata for th

[jira] [Resolved] (SPARK-10811) Minimize array copying cost in Parquet converters

2015-09-29 Thread Cheng Lian (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10811. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8907 [https

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
430-L431>, which reads the actual Parquet footers and probably take most of the time). Cheng On 9/28/15 6:51 PM, Cheng Lian wrote: Oh I see, then probably this one, basically the parallel Spark version of my last script, using ParquetFileReader: import org.apache.parquet.hadoop.ParquetFile

Re: Performance when iterating over many parquet files

2015-09-28 Thread Cheng Lian
g very similar this weekend. It works but is very slow. The Spark method I included in my original post is about 5-6 times faster. Just wondering if there is something even faster than that. I see this as being a recurring problem over the next few months. *From:*Cheng Lian [mailto:l

<    5   6   7   8   9   10   11   12   13   14   >