[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48832990 @yhuai can you take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14856885 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc:

[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-13 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r14857013 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -21,16 +21,27 @@ import java.util.Properties import

[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-13 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r14857017 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -21,16 +21,27 @@ import java.util.Properties import

[GitHub] spark pull request: Delete the useless import

2014-07-13 Thread XuTingjun
Github user XuTingjun closed the pull request at: https://github.com/apache/spark/pull/1284 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-13 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r14857026 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -21,16 +21,27 @@ import java.util.Properties import

[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-13 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r14857044 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -47,6 +47,13 @@ private[sql] abstract class SparkStrategies

[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-48834061 I brought this up to date. @andrewor14 can you take a look at this? I'd want to merge this quickly so I can submit my other scheduler patches too. --- If your project is

[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-48834107 QA tests have started for PR 1259. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16600/consoleFull ---

[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...

2014-07-13 Thread aarondav
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1385#issuecomment-48834426 Is this related to the other conf-related concurrency issue that was fixed recently? https://github.com/apache/spark/pull/1273 --- If your project is set up for it,

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835312 We have gone over this in the past .. it is suboptimal to make it a linear function of executor/driver memory. Overhead is a function of number of executors,

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835447 That makes sense, but then it doesn't explain why a constant amount works for a given job when executor memory is low, and then doesn't work when it is high. This has

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread nishkamravi2
Github user nishkamravi2 commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835560 Sean, the memory_overhead is fairly substantial. More than 2GB for a 30GB executor. Less than 400MB for a 2GB executor. --- If your project is set up for it, you

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835566 The default constant is actually a lowerbound to account for other overheads (since yarn will aggressively kill tasks)... Unfortunately we have not sized this

[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-48835596 QA results for PR 1259:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835618 That would be a function of your jobs. Other apps would have a drastically different characteristics ... Which is why we can't generalize to a simple fraction of

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835656 The basic issue is you are trying to model overhead using the wrong variable... It has no correlation on executor memory actually (other than vm overheads as heap

Re: [GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread Mridul Muralidharan
You are lucky :-) for some of our jobs, in a 8gb container, overhead is 1.8gb ! On 13-Jul-2014 2:40 pm, nishkamravi2 g...@git.apache.org wrote: Github user nishkamravi2 commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835560 Sean, the

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835727 Yes of course, lots of settings' best or even usable values are ultimately app-specific. Ideally, defaults work for lots of cases. A flat value is the simplest of models,

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835769 You are lucky :-) for some of our jobs, in a 8gb container, overhead is 1.8gb ! On 13-Jul-2014 2:41 pm, nishkamravi2 notificati...@github.com wrote:

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread nishkamravi2
Github user nishkamravi2 commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835852 Experimented with three different workloads and noticed common patterns of proportionality. Other parameters were left unchanged and only executor size was

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread nishkamravi2
Github user nishkamravi2 commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835881 That's why the parameter is configurable. If you have jobs that cause 20-25% memory_overhead, default values will not help. --- If your project is set up for it,

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48836123 You are missing my point I think ... To give unscientific anecdotal example : our gbdt expiriments , which run on about 22 nodes need no tuning. While our

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread nishkamravi2
Github user nishkamravi2 commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48836220 Mridul, I think you are missing the point. We understand that this parameter will in a lot of cases have to be specified by the developer, since there is no easy

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48836408 On Jul 13, 2014 3:16 PM, nishkamravi2 notificati...@github.com wrote: Mridul, I think you are missing the point. We understand that this parameter will in a

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48836619 Hmm, looks like some of my responses to Sean via mail reply have not shown up here ... Maybe mail gateway delays ? --- If your project is set up for it, you can reply

[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48836879 Since this is a recurring nightmare for our users, let me try to list down the factors which influence overhead given current spark codebase state in the jira when

[GitHub] spark pull request: Made rdd.py pep8 complaint by using Autopep8 a...

2014-07-13 Thread ScrapCodes
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/1354#issuecomment-48837932 @rxin done ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: Made rdd.py pep8 complaint by using Autopep8 a...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1354#issuecomment-48837923 QA tests have started for PR 1354. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16601/consoleFull ---

[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/1392 [SPARK-2290] Worker should directly use its own sparkHome instead of appDesc.sparkHome when LaunchExecutor Worker should directly use its own sparkHome instead of appDesc.sparkHome when

[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48839494 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48839557 #1244 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48839668 fix #1244 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: Made rdd.py pep8 complaint by using Autopep8 a...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1354#issuecomment-48839833 QA results for PR 1354:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test

[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1244#issuecomment-48839912 I've fixed the compile problem. Please review and test again. Thanks very much. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1281#issuecomment-48840373 Hi @ash211, I think this change is needed. Since the method Utils.getLocalDir is used by some function such as HttpBroadcast, which is different from

[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1281#issuecomment-48840378 Hi @ash211, I think this change is needed. Since the method Utils.getLocalDir is used by some function such as HttpBroadcast, which is different from

[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1281#issuecomment-48840401 Hi @ash211, I think this change is needed. Since the method Utils.getLocalDir is used by some function such as HttpBroadcast, which is different from

[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-48840541 QA tests have started for PR 1313. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16602/consoleFull ---

[GitHub] spark pull request: [WIP]When the executor is thrown OutOfMemoryEr...

2014-07-13 Thread witgo
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/1387#issuecomment-48841019 Now, `SparkContext.cleaner` without considering the executor memory usage. This will cause the spark to fail in the shortage of memory. --- If your project is set up for

[GitHub] spark pull request: [WIP]When the executor is thrown OutOfMemoryEr...

2014-07-13 Thread witgo
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/1387#issuecomment-48841151 @srowen [Executor.scala#L253](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L253) handle exceptions. But the

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread srowen
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/1393 SPARK-2465. Use long as user / item ID for ALS I'd like to float this for consideration: use longs instead of ints for user and product IDs in the ALS implementation. The main reason for is

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48842883 QA tests have started for PR 1393. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16603/consoleFull ---

[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-48843020 QA results for PR 1313:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread witgo
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48843123 The overall increase how much memory? Have a detailed contrast? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48843270 I think the most significant change is the Rating object. It goes from 8 + (ref) + 8 (object) + 4 (int) + 4 (int) + 8 (double) = 32 bytes to 8 (ref) + 8 (object) + 4

[GitHub] spark pull request: [SPARK-2253] Aggregator: Disable partial aggre...

2014-07-13 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1191#issuecomment-48845073 A few comments on this: - We probably can't break the existing combineByKey through a config setting. If people want to use this directly, they'll need to use another

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48845188 I am reviewing it. Will comment it later today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48845420 QA results for PR 1393:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brcase class

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48846073 QA tests have started for PR 1393. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16604/consoleFull ---

[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...

2014-07-13 Thread srowen
Github user srowen closed the pull request at: https://github.com/apache/spark/pull/906 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...

2014-07-13 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/906#issuecomment-48846227 Obsoleted by SBT build changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-13 Thread CodingCat
Github user CodingCat commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-48846392 @mridulm I updated the patch, now, the order is PROCESS_LOCAL-NODE_LOCAL-noPref / Speculative-RACK_LOCAL-NON_LOCAL --- If your project is set up for it, you

[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files

2014-07-13 Thread srowen
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/1394 SPARK-2363. Clean MLlib's sample data files (Just made a PR for this, @mengxr was the reporter of:) MLlib has sample data under serveral folders: 1) data/mllib 2) data/ 3)

[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1394#issuecomment-48846547 QA tests have started for PR 1394. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16605/consoleFull ---

[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1259#discussion_r14859597 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -490,19 +488,19 @@ private[spark] class TaskSetManager(

[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1259#discussion_r14859634 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -429,9 +425,11 @@ private[spark] class TaskSetManager(

[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1259#discussion_r14859639 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -521,14 +519,13 @@ private[spark] class TaskSetManager(

[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1259#discussion_r14859645 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -429,9 +425,11 @@ private[spark] class TaskSetManager(

[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1259#discussion_r14859653 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -570,19 +561,17 @@ private[spark] class TaskSetManager(

[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1259#discussion_r14859650 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -536,23 +533,17 @@ private[spark] class TaskSetManager(

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1360#issuecomment-48848235 QA tests have started for PR 1360. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16606/consoleFull ---

[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-48848292 Hi @rxin, I took a pass over the patch and the changes mostly look good. On a higher level point, I notice that we log this pattern `0.0:4.0 (TID 4 ...)` quite often,

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48848503 QA results for PR 1393:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brcase class

[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-48849034 Hi @CodingCat looks good to me. My only doubt, which we discussed last, was whether we want to differentiate between tasks which have no locations at all vs tasks

[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1394#issuecomment-48849070 QA results for PR 1394:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test

[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1360#issuecomment-48850708 QA results for PR 1360:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48850704 QA tests have started for PR 1393. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16607/consoleFull ---

[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-13 Thread falaki
Github user falaki commented on the pull request: https://github.com/apache/spark/pull/1351#issuecomment-48850882 This is not a bad idea, especially considering that a file can be split across partitions. @marmbrus you suggested this feature. What do you think about Reynold's

[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...

2014-07-13 Thread staple
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/1395 [SPARK-546] Add full outer join to RDD and DStream. You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-546 Alternatively you

[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...

2014-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1395#issuecomment-48851025 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-13 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1351#issuecomment-48851059 Note that there are multiple problems. We can solve the problem of out of memory by simply limiting the length of a record. Ideally, csvRDD(RDD[String]) should just be one

[GitHub] spark pull request: [SQL] Whitelist more Hive tests.

2014-07-13 Thread marmbrus
GitHub user marmbrus opened a pull request: https://github.com/apache/spark/pull/1396 [SQL] Whitelist more Hive tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/marmbrus/spark moreTests Alternatively you can review and

[GitHub] spark pull request: [SQL] Whitelist more Hive tests.

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1396#issuecomment-48851429 QA tests have started for PR 1396. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16608/consoleFull ---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48852700 QA results for PR 1393:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brcase class

[GitHub] spark pull request: [SQL] Whitelist more Hive tests.

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1396#issuecomment-48853489 QA results for PR 1396:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test

[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread andrewor14
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48853888 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48854002 QA tests have started for PR 1392. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16609/consoleFull ---

[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48855862 QA results for PR 1392:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test

[GitHub] spark pull request: [SPARK-1945][MLLIB] Documentation Improvements...

2014-07-13 Thread miccagiann
Github user miccagiann commented on the pull request: https://github.com/apache/spark/pull/1311#issuecomment-48855958 Hello guys, I have provided Java examples for the following documentation files: mllib-clustering.md mllib-collaborative-filtering.md

[GitHub] spark pull request: [SQL][CORE] SPARK-2102

2014-07-13 Thread marmbrus
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1377#issuecomment-48857036 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: [SQL][CORE] SPARK-2102

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1377#issuecomment-48857093 QA tests have started for PR 1377. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16610/consoleFull ---

[GitHub] spark pull request: [SPARK-2125] Add sort flag and move sort into ...

2014-07-13 Thread jerryshao
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/1210#issuecomment-48859519 Hi Matei, thanks a lot for your review, I will change the code according to your comments. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread chenghao-intel
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48859675 The code looks good to me. However, I think we can avoid the work around solution (de-serializing (with partition serde) and then serialize (with table serde)

[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread aarondav
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-48859674 If we actually want people to get information out of all those numbers, can we consider using a human readable format such as `Task(stageId = 1, taskId = 5, attempt =

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread chenghao-intel
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48859842 And as the Hive SerDe actually provides the feature of `lazy` parsing, hence during the converting of `raw object` to `Row`, we need to support the column pruning

[GitHub] spark pull request: [WIP]When the executor is thrown OutOfMemoryEr...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1387#issuecomment-48859861 QA tests have started for PR 1387. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16611/consoleFull ---

[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-13 Thread lirui-intel
Github user lirui-intel commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-48859854 This looks good to me :) Just a reminder that when TaskSchedulerImpl calls TaskSetManager.resourceOffer, the maxLocality (changed to preferredLocality in this

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48860018 @chenghao-intel I am not sure I understand your comment on column pruning. I think for a Hive table, we should use `ColumnProjectionUtils` to set needed columns. So,

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14862289 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc:

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14862300 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc:

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14862338 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc:

[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files

2014-07-13 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1394#issuecomment-48860407 @srowen This looks good to me and thank you for updating the docs as well! --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files

2014-07-13 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1394 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [SQL][CORE] SPARK-2102

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1377#issuecomment-48860618 QA results for PR 1377:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test

[GitHub] spark pull request: remove not used test in src/main

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1397#issuecomment-48861087 QA tests have started for PR 1397. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16612/consoleFull ---

[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...

2014-07-13 Thread scwf
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/1385#issuecomment-48861711 @rxin and @aarondav, yeah ,the master branch deadlocks, it seems locks of #1273 and Hadoop-10456 lead to the problem. when run hivesql self join sql--- hql(SELECT t1.a,

[GitHub] spark pull request: [SPARK-1945][MLLIB] Documentation Improvements...

2014-07-13 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1311#discussion_r14862906 --- Diff: docs/mllib-clustering.md --- @@ -69,7 +69,54 @@ println(Within Set Sum of Squared Errors = + WSSSE) All of MLlib's methods use Java-friendly

[GitHub] spark pull request: [SPARK-1945][MLLIB] Documentation Improvements...

2014-07-13 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1311#discussion_r14862907 --- Diff: docs/mllib-clustering.md --- @@ -69,7 +69,54 @@ println(Within Set Sum of Squared Errors = + WSSSE) All of MLlib's methods use Java-friendly

[GitHub] spark pull request: [SPARK-1945][MLLIB] Documentation Improvements...

2014-07-13 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1311#discussion_r14862911 --- Diff: docs/mllib-collaborative-filtering.md --- @@ -99,7 +99,88 @@ val model = ALS.trainImplicit(ratings, rank, numIterations, alpha) All of MLlib's

  1   2   >