[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22343 Thank YOU for your PR and open discussion on this, @seancxmao . Let's see in another PRs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 Sure, close this PR. Thank you all for your time and insights. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22343 Could you close this PR and JIRA, @seancxmao ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 I agree that correctness is more important. If we should not make behaviors consistent when do the convertion, I will close this PR. @cloud-fan @gatorsmile what do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22343 Compatibility is not a gold rule if it sacrifices correctness. Fast and **wrong** result doesn't looks like benefits to me. Do you think the customer want to get a wrong result like Hive? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 It keeps Hive compatibility but loses performance benefit by setting spark.sql.hive.convertMetastoreParquet=false. We can do better by enabling the conversion and still keeping Hive compatibility. Though this makes our implementation more complex, I guess most end users may keep `spark.sql.hive.convertMetastoreParquet=true` and `spark.sql.caseSensitive=false` which are default values, this brings benefits to end users. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22343 @seancxmao . For Hive compatibility, `spark.sql.hive.convertMetastoreParquet=false` looks enough to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 Could we see this as a behavior change? We can add a legacy conf (e.g. `spark.sql.hive.legacy.convertMetastoreParquet`, may be defined in HiveUtils) to enable users to revert back to the previous behavior for backward compatibility. If this legacy conf is set to true, behaviors will be reverted both in case-sensitive and case-insensitive mode. |caseSensitive|legacy behavior|new behavior| |-|-|-| |true|convert anyway|skip conversion, log warning message| |false|convert, fail if there's ambiguity|convert, first match if there's ambiguity| --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22343 Thank you for the pointer, @seancxmao . And thank you for clarification, @cloud-fan . It looks like we are re-creating correctness issue somewhat in this PR when `caseSensitive=true`. **BEFORE THIS PR (master)** ```scala scala> sql("INSERT OVERWRITE DIRECTORY '/tmp/hive_t' STORED AS PARQUET SELECT 'A', 'a'") scala> sql("CREATE TABLE hive_t(a STRING) STORED AS PARQUET LOCATION '/tmp/hive_t'") scala> sql("CREATE TABLE spark_t(a STRING) USING PARQUET LOCATION '/tmp/hive_t'") scala> sql("set spark.sql.caseSensitive=true") scala> spark.table("hive_t").show +---+ | a| +---+ | a| +---+ scala> spark.table("spark_t").show +---+ | a| +---+ | a| +---+ ``` **AFTER THIS PR** ```scala scala> sql("set spark.sql.caseSensitive=true") scala> spark.table("hive_t").show +---+ | a| +---+ | A| +---+ scala> spark.table("spark_t").show +---+ | a| +---+ | a| +---+ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22343 To clarify: this is just a workaround when we hit a problematic(having case-insensitive duplicated filed names in the parquet file) hive parquet tables and we want to read it with the native parquet reader. The hive behaivor is weird but we need to follow it as we are reading a hive table. Personally I think it's not a big deal. If the hive table is malformed, I think we don't have to follow hive's bugy behavior. If people are confused by this patch and think this doesn't worth, I'm ok to just leave it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 @dongjoon-hyun It is a little complicated. There has been a discussion about this in #22184. Below are some key comments from @cloud-fan and @gatorsmile, just FYI. * https://github.com/apache/spark/pull/22184#discussion_r212834477 * https://github.com/apache/spark/pull/22184#issuecomment-416885728 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22343 What I asked was the following, wasn't it? > In case-insensitive mode, when converting hive parquet table to parquet data source, we switch the duplicated fields resolution mode to ask parquet data source to pick the first matched field - the same behavior as hive parquet table - to keep behaviors consistent. Spark should not pick up the first matched field in any cases because it's considered as a correctness behavior in previous PR which is backported to `branch-2.3` https://github.com/apache/spark/pull/22183. I don't think we need to follow incorrect Hive behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 Hi, @dongjoon-hyun When we find duplicated field names in the case of convertMetastoreXXX, we have 2 options (1) raise exception as parquet data source. To most of end users, they do not know the difference between hive parquet table and parquet data source. If the conversion leads to different behaviors, they may be confused, and in some cases even lead to tricky data issues silently. (2) Adjust behaviors of parquet data source to keep behaviors consistent. This seems more friendly to end users, and avoid any potential issues introduced by the conversion. BTW, for parquet data source which is not converted from hive parquet table, we raise exception when there is ambiguity, sine this is more intuitive and reasonable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22343 Hi, @seancxmao . Should we be consistent? IIRC, all the previous PR raises Exception to prevent any potential issues. In this case, I have a feeling that `convertMetastoreXXX` should be used to prevent the problem of Hive behavior by raising Exception, not hiding the problem of Hive behavior. > In case-insensitive mode, when converting hive parquet table to parquet data source, we switch the duplicated fields resolution mode to ask parquet data source to pick the first matched field - the same behavior as hive parquet table - to keep behaviors consistent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22343 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95864/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22343 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22343 **[Test build #95864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95864/testReport)** for PR 22343 at commit [`95673cd`](https://github.com/apache/spark/commit/95673cdacb1da65d7ac45c212a365e167ae0d713). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22343 **[Test build #95864 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95864/testReport)** for PR 22343 at commit [`95673cd`](https://github.com/apache/spark/commit/95673cdacb1da65d7ac45c212a365e167ae0d713). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22343 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22343 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22343 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95857/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22343 **[Test build #95857 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95857/testReport)** for PR 22343 at commit [`95673cd`](https://github.com/apache/spark/commit/95673cdacb1da65d7ac45c212a365e167ae0d713). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 @dongjoon-hyun @HyukjinKwon I created a new JIRA ticket and try to use a more complete and clear title for this PR. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org