[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-16 Thread chenghao-intel
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1408#issuecomment-49138529 @yhuai @concretevitamin @rxin I've create another PR for this follow up, we can discuss this more at: https://github.com/apache/spark/pull/1439 --- If your

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread marmbrus
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48930414 Thanks for reviewing this everyone. I'm all for commenting and cleaning things up here, but if possible I'd like to merge this in today. There are a couple of people

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin closed the pull request at: https://github.com/apache/spark/pull/1390 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48935494 @yhuai suggested a much simpler fix -- I benchmarked this and it gave the same performance boost. I am closing this and opening a new PR. --- If your project is

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
GitHub user concretevitamin opened a pull request: https://github.com/apache/spark/pull/1408 [SPARK-2443][SQL] Fix slow read from partitioned tables This fix obtains a comparable performance boost as [PR #1390](https://github.com/apache/spark/pull/1390) by moving an array update

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48936743 New PR here: #1408 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1408#issuecomment-48936856 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1408#issuecomment-48937213 QA tests have started for PR 1408. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16631/consoleFull ---

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14894946 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread marmbrus
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14898638 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc:

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1408#issuecomment-48951901 QA results for PR 1408:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread marmbrus
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1408#issuecomment-48954454 Thanks! I've merged this into both master and 1.0. Are there other followup thing we want to fix from the discussion on the other PR? or should I consider

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1408 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1408#issuecomment-48954674 I think we should ask the users who reported the performance issue if this fix solves their problems. Otherwise the comments in the previous PR seem to only

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14902569 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14902570 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread chenghao-intel
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1408#issuecomment-48978522 This will works in most of cases I think. But it may raise exceptions if the Table's Deserializer differs from the partition's Deserializer, since they may have

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread yhuai
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1408#issuecomment-48979538 @chenghao-intel Can you ping me after you create the PR or the JIRA? Thanks:) --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48832990 @yhuai can you take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14856885 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc:

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48845188 I am reviewing it. Will comment it later today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread chenghao-intel
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48859675 The code looks good to me. However, I think we can avoid the work around solution (de-serializing (with partition serde) and then serialize (with table serde)

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread chenghao-intel
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48859842 And as the Hive SerDe actually provides the feature of `lazy` parsing, hence during the converting of `raw object` to `Row`, we need to support the column pruning

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48860018 @chenghao-intel I am not sure I understand your comment on column pruning. I think for a Hive table, we should use `ColumnProjectionUtils` to set needed columns. So,

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14862289 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc:

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14862300 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc:

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14862338 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc:

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread chenghao-intel
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14862941 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc:

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-12 Thread concretevitamin
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48830080 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-12 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48830138 QA tests have started for PR 1390. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16599/consoleFull ---

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-12 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48831466 QA results for PR 1390:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test