[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48832990 @yhuai can you take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14856885 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon // Create local references so that the outer object isn't serialized. val tableDesc = _tableDesc + val tableSerDeClass = tableDesc.getDeserializerClass + val broadcastedHiveConf = _broadcastedHiveConf val localDeserializer = partDeserializer val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc) - hivePartitionRDD.mapPartitions { iter = + hivePartitionRDD.mapPartitions { case iter = --- End diff -- is the pattern matching here necessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r14857013 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -21,16 +21,27 @@ import java.util.Properties import scala.collection.JavaConverters._ +private object SQLConf { + @transient protected[spark] val confSettings = java.util.Collections.synchronizedMap( --- End diff -- Why singleton? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r14857017 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -21,16 +21,27 @@ import java.util.Properties import scala.collection.JavaConverters._ +private object SQLConf { + @transient protected[spark] val confSettings = java.util.Collections.synchronizedMap( --- End diff -- In one of the earlier commits, I tried the approach of making an operator mixing in SQLConf -- hence that necessitated a singleton object to hold the settings. Now I just have that operator take a SQLContext in order to get access to the conf, so if you want I could remove this singleton. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Delete the useless import
Github user XuTingjun closed the pull request at: https://github.com/apache/spark/pull/1284 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r14857026 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -21,16 +21,27 @@ import java.util.Properties import scala.collection.JavaConverters._ +private object SQLConf { + @transient protected[spark] val confSettings = java.util.Collections.synchronizedMap( --- End diff -- Let's remove singletons. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r14857044 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -47,6 +47,13 @@ private[sql] abstract class SparkStrategies extends QueryPlanner[SparkPlan] { /** * Uses the ExtractEquiJoinKeys pattern to find joins where at least some of the predicates can be * evaluated by matching hash keys. + * + * This strategy applies a simple optimization based on the estimates of the physical sizes of + * the two join sides. When planning a [[execution.BroadcastHashJoin]], if one side has an + * estimated physical size smaller than the user-settable threshold + * `spark.sql.auto.convert.join.size`, the planner would mark it as the ''build'' relation and --- End diff -- Document it in more explicit ways that a table will get broadcasted (instead of just saying build; you can still build in shuffle) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-48834061 I brought this up to date. @andrewor14 can you take a look at this? I'd want to merge this quickly so I can submit my other scheduler patches too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-48834107 QA tests have started for PR 1259. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16600/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1385#issuecomment-48834426 Is this related to the other conf-related concurrency issue that was fixed recently? https://github.com/apache/spark/pull/1273 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835312 We have gone over this in the past .. it is suboptimal to make it a linear function of executor/driver memory. Overhead is a function of number of executors, number of opened files, shuffle vm pressure, etc. It is NOT a function of executor memory : which is why it is separately configured. On 13-Jul-2014 11:16 am, UCB AMPLab notificati...@github.com wrote: Can one of the admins verify this patch? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1391#issuecomment-48832590. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835447 That makes sense, but then it doesn't explain why a constant amount works for a given job when executor memory is low, and then doesn't work when it is high. This has also been my experience and I don't have a great grasp on why it would be. More threads and open files in a busy executor? It goes indirectly with how big you need your executor to be, but not directly. Nishkam do you have a sense of how much extra memory you had to configure to get it to work when executor memory increased? is it pretty marginal, or quite substantial? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user nishkamravi2 commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835560 Sean, the memory_overhead is fairly substantial. More than 2GB for a 30GB executor. Less than 400MB for a 2GB executor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835566 The default constant is actually a lowerbound to account for other overheads (since yarn will aggressively kill tasks)... Unfortunately we have not sized this properly : and don't have good recommendation on how to set it. This is compounded by magic constants in spark for various IO ops, non deterministic network behaviour (we should be able to estimate upper bound here = 2x number of workers), vm memory use (shuffle output is mmapp'ed whole ... going foul with yarn virtual men limits) and so on. Hence sizing this is, unfortunately, app specific. On 13-Jul-2014 2:34 pm, Sean Owen notificati...@github.com wrote: That makes sense, but then it doesn't explain why a constant amount works for a given job when executor memory is low, and then doesn't work when it is high. This has also been my experience and I don't have a great grasp on why it would be. More threads and open files in a busy executor? It goes indirectly with how big you need your executor to be, but not directly. Nishkam do you have a sense of how much extra memory you had to configure to get it to work when executor memory increased? is it pretty marginal, or quite substantial? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1391#issuecomment-48835447. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-48835596 QA results for PR 1259:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass TaskRunner(brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16600/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835618 That would be a function of your jobs. Other apps would have a drastically different characteristics ... Which is why we can't generalize to a simple fraction of executor memory. It actually buys us nothing in general case ... Jobs will continue to fail when it is incorrect : while wasting a lot of memory On 13-Jul-2014 2:38 pm, nishkamravi2 notificati...@github.com wrote: Yes, I'm aware of the discussion on this issue in the past. Experiments confirm that overhead is a function of executor memory. Why and how can be figured out with due diligence and analysis. It may be a function of other parameters and the function may be fairly complex. However, the proportionality is undeniable. Besides, we are only adjusting the default value and making it a bit more resilient. The memory_overhead parameter can still be configured by the developer separately. The constant additive factor makes little sense (empirically). â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1391#issuecomment-48835500. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835656 The basic issue is you are trying to model overhead using the wrong variable... It has no correlation on executor memory actually (other than vm overheads as heap increases) On 13-Jul-2014 2:44 pm, Mridul Muralidharan mri...@gmail.com wrote: That would be a function of your jobs. Other apps would have a drastically different characteristics ... Which is why we can't generalize to a simple fraction of executor memory. It actually buys us nothing in general case ... Jobs will continue to fail when it is incorrect : while wasting a lot of memory On 13-Jul-2014 2:38 pm, nishkamravi2 notificati...@github.com wrote: Yes, I'm aware of the discussion on this issue in the past. Experiments confirm that overhead is a function of executor memory. Why and how can be figured out with due diligence and analysis. It may be a function of other parameters and the function may be fairly complex. However, the proportionality is undeniable. Besides, we are only adjusting the default value and making it a bit more resilient. The memory_overhead parameter can still be configured by the developer separately. The constant additive factor makes little sense (empirically). â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1391#issuecomment-48835500. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: [GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
You are lucky :-) for some of our jobs, in a 8gb container, overhead is 1.8gb ! On 13-Jul-2014 2:40 pm, nishkamravi2 g...@git.apache.org wrote: Github user nishkamravi2 commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835560 Sean, the memory_overhead is fairly substantial. More than 2GB for a 30GB executor. Less than 400MB for a 2GB executor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835727 Yes of course, lots of settings' best or even usable values are ultimately app-specific. Ideally, defaults work for lots of cases. A flat value is the simplest of models, and anecdotally, the current default value does not work in medium- to large-memory YARN jobs. You can increase the default, but then the overhead gets silly for small jobs -- 1GB? And all of these are not-uncommon use cases. None of that implies the overhead logically scales with container memory. Empirically, it may do, and that's useful. Until the magic explanatory variable is found, which one is less problematic for end users -- a flat constant that frequently has to be tuned, or an imperfect model that could get it right in more cases? That said it is kind of a developer API change and feels like something to not keep reimagining. Niskham can you share any anecdotal evidence about how the overhead changes. If executor memory is the only variable changing, that seems to be evidence against it being driven by other factors. but I don't know if that's what we know. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835769 You are lucky :-) for some of our jobs, in a 8gb container, overhead is 1.8gb ! On 13-Jul-2014 2:41 pm, nishkamravi2 notificati...@github.com wrote: Sean, the memory_overhead is fairly substantial. More than 2GB for a 30GB executor. Less than 400MB for a 2GB executor. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1391#issuecomment-48835560. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user nishkamravi2 commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835852 Experimented with three different workloads and noticed common patterns of proportionality. Other parameters were left unchanged and only executor size was increased. The memory-overhead ranges between 0.05-0.08 * executor_memory size. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user nishkamravi2 commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48835881 That's why the parameter is configurable. If you have jobs that cause 20-25% memory_overhead, default values will not help. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48836123 You are missing my point I think ... To give unscientific anecdotal example : our gbdt expiriments , which run on about 22 nodes need no tuning. While our collaborative filtering expiriments, running on 300 nodes require much higher overhead. But QR factorization on the same 300 nodes need much lower overhead. The values are all over the place and very app specific. In an effort to ensure jobs always run to completion, setting overhead to high fraction of executor memory might ensure successful completion but at high performance loss and substandard scaling. I would like a good default estimate of overhead ... But that is not fraction of executor memory. Instead of trying to model the overhead using executor memory, better would be to look at actual parameters which influence it (as in, look at code and figure it out; followed by validation and tuning of course) and use that as estimate. On 13-Jul-2014 2:58 pm, nishkamravi2 notificati...@github.com wrote: That's why the parameter is configurable. If you have jobs that cause 20-25% memory_overhead, default values will not help. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1391#issuecomment-48835881. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user nishkamravi2 commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48836220 Mridul, I think you are missing the point. We understand that this parameter will in a lot of cases have to be specified by the developer, since there is no easy way to model it (that's why we are retaining it as a configurable parameter). However, the question is what would be a good default value be. I would like a good default estimate of overhead ... But that is not fraction of executor memory. You are mistaken. It may not be a directly correlated variable, but it is most certainly indirectly correlated. And it is probably correlated to other app-specific parameters as well. Until the magic explanatory variable is found, which one is less problematic for end users -- a flat constant that frequently has to be tuned, or an imperfect model that could get it right in more cases? This is the right point of view. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48836408 On Jul 13, 2014 3:16 PM, nishkamravi2 notificati...@github.com wrote: Mridul, I think you are missing the point. We understand that this parameter will in a lot of cases have to be specified by the developer, since there is no easy way to model it (that's why we are retaining it as a configurable parameter). However, the question is what would be a good default value be. It does not help to estimate using the wrong variable. Any correlation which exists are incidental and app specific, as I elaborated before. The only actual correlation between executor memory and overhead is java vm overheads in managing very large heaps (and that is very high as a fraction). Other factors in spark have far higher impact than this. I would like a good default estimate of overhead ... But that is not fraction of executor memory. You are mistaken. It may not be a directly correlated variable, but it is most certainly indirectly correlated. And it is probably correlated to other app-specific parameters as well. Please see above. Until the magic explanatory variable is found, which one is less problematic for end users -- a flat constant that frequently has to be tuned, or an imperfect model that could get it right in more cases? This is the right point of view. Which has been our view even in previous discussions :-) It is unfortunate that we did not approximate this better from the start and went with the constant from the prototype.l impl. Note that this estimation would be very volatile to spark internals â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48836619 Hmm, looks like some of my responses to Sean via mail reply have not shown up here ... Maybe mail gateway delays ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1391#issuecomment-48836879 Since this is a recurring nightmare for our users, let me try to list down the factors which influence overhead given current spark codebase state in the jira when I am back at my desk ... And we can add to that and model from there (I won't be able to lead the effort though unfortunately, so would be great if you or Sean can). If it so happens that end of the exercise it is linear function of memory, I am fine with it : as long as we decide based on actual data :-) On 13-Jul-2014 3:26 pm, Mridul Muralidharan mri...@gmail.com wrote: On Jul 13, 2014 3:16 PM, nishkamravi2 notificati...@github.com wrote: Mridul, I think you are missing the point. We understand that this parameter will in a lot of cases have to be specified by the developer, since there is no easy way to model it (that's why we are retaining it as a configurable parameter). However, the question is what would be a good default value be. It does not help to estimate using the wrong variable. Any correlation which exists are incidental and app specific, as I elaborated before. The only actual correlation between executor memory and overhead is java vm overheads in managing very large heaps (and that is very high as a fraction). Other factors in spark have far higher impact than this. I would like a good default estimate of overhead ... But that is not fraction of executor memory. You are mistaken. It may not be a directly correlated variable, but it is most certainly indirectly correlated. And it is probably correlated to other app-specific parameters as well. Please see above. Until the magic explanatory variable is found, which one is less problematic for end users -- a flat constant that frequently has to be tuned, or an imperfect model that could get it right in more cases? This is the right point of view. Which has been our view even in previous discussions :-) It is unfortunate that we did not approximate this better from the start and went with the constant from the prototype.l impl. Note that this estimation would be very volatile to spark internals â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Made rdd.py pep8 complaint by using Autopep8 a...
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/1354#issuecomment-48837932 @rxin done ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Made rdd.py pep8 complaint by using Autopep8 a...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1354#issuecomment-48837923 QA tests have started for PR 1354. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16601/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
GitHub user YanTangZhai opened a pull request: https://github.com/apache/spark/pull/1392 [SPARK-2290] Worker should directly use its own sparkHome instead of appDesc.sparkHome when LaunchExecutor Worker should directly use its own sparkHome instead of appDesc.sparkHome when LaunchExecutor You can merge this pull request into a Git repository by running: $ git pull https://github.com/YanTangZhai/spark SPARK-2290 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1392.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1392 commit d3072fc05c7c20ec9d90732db2b9b26a4d27e290 Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T11:50:14Z Update ApplicationDescription.scala commit 78ec6bc8c5d1af64ca21e1a231b47911df6d4f90 Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T11:52:34Z Update JsonProtocol.scala commit 95e6ccc354167117430ce4cb7b2f5063a454ff1d Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T11:54:55Z Update TestClient.scala commit 508dcb65d04e3f12f99e03572a1cc277e7f1aeca Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T11:58:01Z Update SparkDeploySchedulerBackend.scala commit 6d6700aaad941779485eee2c35c4ab0cd278529e Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T12:01:40Z Update Worker.scala commit c360154ae5b03e7854d63573494fc6113295a7ec Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T12:04:16Z Update JsonProtocolSuite.scala commit 6febb215fb73735760fae957a4e71e2a61c17c77 Author: YanTangZhai tyz0...@163.com Date: 2014-07-13T12:07:35Z Update ExecutorRunnerTest.scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48839494 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48839557 #1244 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48839668 fix #1244 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Made rdd.py pep8 complaint by using Autopep8 a...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1354#issuecomment-48839833 QA results for PR 1354:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16601/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1244#issuecomment-48839912 I've fixed the compile problem. Please review and test again. Thanks very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1281#issuecomment-48840373 Hi @ash211, I think this change is needed. Since the method Utils.getLocalDir is used by some function such as HttpBroadcast, which is different from DiskBlockManager. The two problems are different. Even though #1274 has been merged, the problem is still exist. Please review again. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1281#issuecomment-48840378 Hi @ash211, I think this change is needed. Since the method Utils.getLocalDir is used by some function such as HttpBroadcast, which is different from DiskBlockManager. The two problems are different. Even though #1274 has been merged, the problem is still exist. Please review again. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...
Github user YanTangZhai commented on the pull request: https://github.com/apache/spark/pull/1281#issuecomment-48840401 Hi @ash211, I think this change is needed. Since the method Utils.getLocalDir is used by some function such as HttpBroadcast, which is different from DiskBlockManager. The two problems are different. Even though #1274 has been merged, the problem is still exist. Please review again. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-48840541 QA tests have started for PR 1313. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16602/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]When the executor is thrown OutOfMemoryEr...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/1387#issuecomment-48841019 Now, `SparkContext.cleaner` without considering the executor memory usage. This will cause the spark to fail in the shortage of memory. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]When the executor is thrown OutOfMemoryEr...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/1387#issuecomment-48841151 @srowen [Executor.scala#L253](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L253) handle exceptions. But the memory overflow does not seem to be handled correctly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/1393 SPARK-2465. Use long as user / item ID for ALS I'd like to float this for consideration: use longs instead of ints for user and product IDs in the ALS implementation. The main reason for is that identifiers are not generally numeric at all, and will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits means collisions are likely after hundreds of thousands of users and items, which is not unrealistic. Hashing to 64 bits pushes this back to billions. It would also mean numeric IDs that happen to be larger than the largest int can be used directly as identifiers. On the downside of course: 8 bytes instead of 4 bytes of memory used per Rating. Thoughts? You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-2465 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1393.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1393 commit d4082ad7aa7468b63605d141f6d68e278983678a Author: Sean Owen so...@cloudera.com Date: 2014-07-13T14:57:15Z Use long instead of int for user/product IDs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48842883 QA tests have started for PR 1393. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16603/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-48843020 QA results for PR 1313:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16602/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48843123 The overall increase how much memory? Have a detailed contrast? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48843270 I think the most significant change is the Rating object. It goes from 8 + (ref) + 8 (object) + 4 (int) + 4 (int) + 8 (double) = 32 bytes to 8 (ref) + 8 (object) + 4 (long) + 4 (long) + 8 (double) = 40 bytes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2253] Aggregator: Disable partial aggre...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1191#issuecomment-48845073 A few comments on this: - We probably can't break the existing combineByKey through a config setting. If people want to use this directly, they'll need to use another interface. Otherwise we can have combineByKey do this on the map side but not on the reduce side. - Since spilling is on by default now, we should probably add an implementation for that too before merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48845188 I am reviewing it. Will comment it later today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48845420 QA results for PR 1393:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brcase class Rating(user: Long, product: Long, rating: Double)brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16603/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48846073 QA tests have started for PR 1393. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16604/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...
Github user srowen closed the pull request at: https://github.com/apache/spark/pull/906 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/906#issuecomment-48846227 Obsoleted by SBT build changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user CodingCat commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-48846392 @mridulm I updated the patch, now, the order is PROCESS_LOCAL-NODE_LOCAL-noPref / Speculative-RACK_LOCAL-NON_LOCAL --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/1394 SPARK-2363. Clean MLlib's sample data files (Just made a PR for this, @mengxr was the reporter of:) MLlib has sample data under serveral folders: 1) data/mllib 2) data/ 3) mllib/data/* Per previous discussion with Matei Zaharia, we want to put them under `data/mllib` and clean outdated files. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-2363 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1394.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1394 commit 54313ddec6daa02c17ffd8d2b6425ab94e0a778a Author: Sean Owen so...@cloudera.com Date: 2014-07-13T17:19:22Z Move ML example data from /mllib/data/ and /data/ into /data/mllib/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1394#issuecomment-48846547 QA tests have started for PR 1394. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16605/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1259#discussion_r14859597 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -490,19 +488,19 @@ private[spark] class TaskSetManager( info.markSuccessful() removeRunningTask(tid) sched.dagScheduler.taskEnded( - tasks(index), Success, result.value, result.accumUpdates, info, result.metrics) + tasks(index), Success, result.value(), result.accumUpdates, info, result.metrics) if (!successful(index)) { tasksSuccessful += 1 - logInfo(Finished TID %s in %d ms on %s (progress: %d/%d).format( -tid, info.duration, info.host, tasksSuccessful, numTasks)) + logInfo(Finished %s:%s (TID %d) in %d ms on %s (%d/%d).format( +taskSet.id, info.id, info.taskId, info.duration, info.host, tasksSuccessful, numTasks)) // Mark successful and stop if all the tasks have succeeded. successful(index) = true if (tasksSuccessful == numTasks) { isZombie = true } } else { - logInfo(Ignorning task-finished event for TID + tid + because task + -index + has already completed successfully) + logInfo(Ignorning task-finished event for + taskSet.id + : + info.id + --- End diff -- Ignorning --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1259#discussion_r14859634 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -429,9 +425,11 @@ private[spark] class TaskSetManager( } val timeTaken = clock.getTime() - startTime addRunningTask(taskId) - logInfo(Serialized task %s:%d as %d bytes in %d ms.format( -taskSet.id, index, serializedTask.limit, timeTaken)) - val taskName = task %s:%d.format(taskSet.id, index) + + val taskName = task %s:%d.%d.format(taskSet.id, index, attemptNum) --- End diff -- You can just use `task %s:%s.format(taskSet.id, info.id)` to be consistent with other places like L494. This could become a little hard to keep track of if we decide to change the notation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1259#discussion_r14859639 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -521,14 +519,13 @@ private[spark] class TaskSetManager( info.markFailed() val index = info.index copiesRunning(index) -= 1 -if (!isZombie) { - logWarning(Lost TID %s (task %s:%d).format(tid, taskSet.id, index)) -} var taskMetrics : TaskMetrics = null -var failureReason: String = null + +val failureReason = sLost task ${taskSet.id}:$index (TID $tid) on executor ${info.host}: + --- End diff -- I think instead of `$index` you meant `${info.id}` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1259#discussion_r14859645 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -429,9 +425,11 @@ private[spark] class TaskSetManager( } val timeTaken = clock.getTime() - startTime addRunningTask(taskId) - logInfo(Serialized task %s:%d as %d bytes in %d ms.format( -taskSet.id, index, serializedTask.limit, timeTaken)) - val taskName = task %s:%d.format(taskSet.id, index) + + val taskName = task %s:%d.%d.format(taskSet.id, index, attemptNum) --- End diff -- Another option is to just add a field `taskName` in either `TaskInfo` or `Task` itself, and just use that everywhere. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1259#discussion_r14859653 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -570,19 +561,17 @@ private[spark] class TaskSetManager( } } if (printFull) { - val locs = ef.stackTrace.map(loc = \tat %s.format(loc.toString)) - logWarning(Loss was due to %s\n%s\n%s.format( -ef.className, ef.description, locs.mkString(\n))) + logWarning(failureReason) } else { - logInfo(Loss was due to %s [duplicate %d].format(ef.description, dupCount)) + logInfo(sLost task ${taskSet.id}:$index (TID $tid) on executor ${info.host}: + --- End diff -- same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1259#discussion_r14859650 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -536,23 +533,17 @@ private[spark] class TaskSetManager( // Not adding to failed executors for FetchFailed. isZombie = true - case TaskKilled = -// Not adding to failed executors for TaskKilled. -logWarning(Task %d was killed..format(tid)) - case ef: ExceptionFailure = -taskMetrics = ef.metrics.getOrElse(null) +taskMetrics = ef.metrics.orNull if (ef.className == classOf[NotSerializableException].getName()) { // If the task result wasn't serializable, there's no point in trying to re-execute it. - logError(Task %s:%s had a not serializable result: %s; not retrying.format( -taskSet.id, index, ef.description)) - abort(Task %s:%s had a not serializable result: %s.format( -taskSet.id, index, ef.description)) + logError(Task %s:%s (TID %d) had a not serializable result: %s; not retrying.format( +taskSet.id, index, tid, ef.description)) + abort(Task %s:%s (TID %d) had a not serializable result: %s.format( +taskSet.id, index, tid, ef.description)) --- End diff -- same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1360#issuecomment-48848235 QA tests have started for PR 1360. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16606/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-48848292 Hi @rxin, I took a pass over the patch and the changes mostly look good. On a higher level point, I notice that we log this pattern `0.0:4.0 (TID 4 ...)` quite often, where the task ID is logged twice. If the user has no idea what the colon notation means, they might try to (as I did) interpret `4.0` as something else other than the task ID because `TID 4` is already there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48848503 QA results for PR 1393:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brcase class Rating(user: Long, product: Long, rating: Double)brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16604/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-48849034 Hi @CodingCat looks good to me. My only doubt, which we discussed last, was whether we want to differentiate between tasks which have no locations at all vs tasks which have preferred location but none available. Currently both of these are hosted in the same data structure. @lirui-intel do you have any thoughts on this PR ? Since you changed and probably tested some of this last ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1394#issuecomment-48849070 QA results for PR 1394:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16605/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1360#issuecomment-48850708 QA results for PR 1360:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16606/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48850704 QA tests have started for PR 1393. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16607/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs
Github user falaki commented on the pull request: https://github.com/apache/spark/pull/1351#issuecomment-48850882 This is not a bad idea, especially considering that a file can be split across partitions. @marmbrus you suggested this feature. What do you think about Reynold's suggestion? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/1395 [SPARK-546] Add full outer join to RDD and DStream. You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-546 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1395.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1395 commit 0575e2f45ec26eddf7470cc9fc67f52fc53120ff Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-07-13T18:19:03Z Fix left outer join documentation comments. commit 217614dbe4540bb8155e3fbb6eae56f70795b28d Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-07-13T18:19:03Z In JavaPairDStream, make class tag specification in rightOuterJoin consistent with other functions. commit 084b2d5edda993975ae2c1794ef8bf4dea2ae11b Author: Aaron Staple aaron.sta...@gmail.com Date: 2014-07-13T18:19:03Z [SPARK-546] Add full outer join to RDD and DStream. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1395#issuecomment-48851025 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1351#issuecomment-48851059 Note that there are multiple problems. We can solve the problem of out of memory by simply limiting the length of a record. Ideally, csvRDD(RDD[String]) should just be one element per record, while csvFile should properly parse the file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Whitelist more Hive tests.
GitHub user marmbrus opened a pull request: https://github.com/apache/spark/pull/1396 [SQL] Whitelist more Hive tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/marmbrus/spark moreTests Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1396.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1396 commit ccd8f972e3b7b3e37e11f9bc0cf33edce4267a0d Author: Michael Armbrust mich...@databricks.com Date: 2014-07-11T03:16:46Z Whitelist more tests. commit 8b6001cedd936eaab707ca3d65ba2afa9ccf2edc Author: Michael Armbrust mich...@databricks.com Date: 2014-07-13T20:37:16Z Add golden files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Whitelist more Hive tests.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1396#issuecomment-48851429 QA tests have started for PR 1396. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16608/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-48852700 QA results for PR 1393:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brcase class Rating(user: Long, product: Long, rating: Double)brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16607/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Whitelist more Hive tests.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1396#issuecomment-48853489 QA results for PR 1396:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16608/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48853888 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48854002 QA tests have started for PR 1392. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16609/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1392#issuecomment-48855862 QA results for PR 1392:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16609/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1945][MLLIB] Documentation Improvements...
Github user miccagiann commented on the pull request: https://github.com/apache/spark/pull/1311#issuecomment-48855958 Hello guys, I have provided Java examples for the following documentation files: mllib-clustering.md mllib-collaborative-filtering.md mllib-dimensionality-reduction.md mllib-linear-methods.md mllib-optimization.md Enjoy and do not hesitate to contact me for any remark/correction Thanks, Michael --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][CORE] SPARK-2102
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1377#issuecomment-48857036 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][CORE] SPARK-2102
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1377#issuecomment-48857093 QA tests have started for PR 1377. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16610/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2125] Add sort flag and move sort into ...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/1210#issuecomment-48859519 Hi Matei, thanks a lot for your review, I will change the code according to your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48859675 The code looks good to me. However, I think we can avoid the work around solution (de-serializing (with partition serde) and then serialize (with table serde) again) for adapting the higher level table scan (`TableScanOperator` in Shark), which have to providing a unique `ObjectInspector` for the downstream Operators. Not like `TableScanOperator`, `HiveTableScan` in `Spark-Hive` doesn't reply on `ObjectInspector`, and its output type is `GenericMutableRow`, I think we could make the object conversion (from raw type to `Row` object) directly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-48859674 If we actually want people to get information out of all those numbers, can we consider using a human readable format such as `Task(stageId = 1, taskId = 5, attempt = 0)` or something along those lines? (Do we actually need both task id and task index in general, by the way?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48859842 And as the Hive SerDe actually provides the feature of `lazy` parsing, hence during the converting of `raw object` to `Row`, we need to support the column pruning Sorry, some high level comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP]When the executor is thrown OutOfMemoryEr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1387#issuecomment-48859861 QA tests have started for PR 1387. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16611/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user lirui-intel commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-48859854 This looks good to me :) Just a reminder that when TaskSchedulerImpl calls TaskSetManager.resourceOffer, the maxLocality (changed to preferredLocality in this PR) doesn't always begin with PROCESS_LOCAL, rather, it only iterates through valid levels in the TaskSetManager. That being said, we still need something this PR provides to delay tasks in the noPref list, because valid levels are not re-computed when task finishes. So it may contains, say, PROCESS_LOCAL even when such tasks have actually run out. I agree that we may want to separate tasks without preference vs tasks whose preferred location is not available. I think it's possible (maybe unlikely though) that a TaskSetManager contains both kinds of these tasks, e.g. with narrow dependency and some of parent RDD's partitions are cached. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48860018 @chenghao-intel I am not sure I understand your comment on column pruning. I think for a Hive table, we should use `ColumnProjectionUtils` to set needed columns. So, RCFile and ORC can just read needed columns from HDFS. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14862289 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon // Create local references so that the outer object isn't serialized. val tableDesc = _tableDesc + val tableSerDeClass = tableDesc.getDeserializerClass + val broadcastedHiveConf = _broadcastedHiveConf val localDeserializer = partDeserializer val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc) - hivePartitionRDD.mapPartitions { iter = + hivePartitionRDD.mapPartitions { case iter = val hconf = broadcastedHiveConf.value.value val rowWithPartArr = new Array[Object](2) -// Map each tuple to a row object -iter.map { value = - val deserializer = localDeserializer.newInstance() - deserializer.initialize(hconf, partProps) - val deserializedRow = deserializer.deserialize(value) - rowWithPartArr.update(0, deserializedRow) - rowWithPartArr.update(1, partValues) - rowWithPartArr.asInstanceOf[Object] + +val partSerDe = localDeserializer.newInstance() +val tableSerDe = tableSerDeClass.newInstance() +partSerDe.initialize(hconf, partProps) +tableSerDe.initialize(hconf, tableDesc.getProperties) + +val tblConvertedOI = ObjectInspectorConverters.getConvertedOI( + partSerDe.getObjectInspector, tableSerDe.getObjectInspector, true) + .asInstanceOf[StructObjectInspector] +val partTblObjectInspectorConverter = ObjectInspectorConverters.getConverter( + partSerDe.getObjectInspector, tblConvertedOI) + +// This is done per partition, and unnecessary to put it in the iterations (in iter.map). +rowWithPartArr.update(1, partValues) + +// Map each tuple to a row object. +if (partTblObjectInspectorConverter.isInstanceOf[IdentityConverter]) { + iter.map { case value = +rowWithPartArr.update(0, partSerDe.deserialize(value)) +rowWithPartArr.asInstanceOf[Object] + } +} else { + iter.map { case value = +val deserializedRow = { + // If partition schema does not match table schema, update the row to match. + val convertedRow = + partTblObjectInspectorConverter.convert(partSerDe.deserialize(value)) + + // If conversion was performed, convertedRow will be a standard Object, but if + // conversion wasn't necessary, it will still be lazy. We can't have both across + // partitions, so we serialize and deserialize again to make it lazy. + if (tableSerDe.isInstanceOf[OrcSerde]) { +convertedRow + } else { +convertedRow match { + case _: LazyStruct = convertedRow + case _: HiveColumnarStruct = convertedRow + case _ = tableSerDe.deserialize( + tableSerDe.asInstanceOf[Serializer].serialize(convertedRow, tblConvertedOI)) --- End diff -- As mentioned by @chenghao-intel, can we avoid it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14862300 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon // Create local references so that the outer object isn't serialized. val tableDesc = _tableDesc + val tableSerDeClass = tableDesc.getDeserializerClass + val broadcastedHiveConf = _broadcastedHiveConf val localDeserializer = partDeserializer val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc) - hivePartitionRDD.mapPartitions { iter = + hivePartitionRDD.mapPartitions { case iter = val hconf = broadcastedHiveConf.value.value val rowWithPartArr = new Array[Object](2) -// Map each tuple to a row object -iter.map { value = - val deserializer = localDeserializer.newInstance() - deserializer.initialize(hconf, partProps) - val deserializedRow = deserializer.deserialize(value) - rowWithPartArr.update(0, deserializedRow) - rowWithPartArr.update(1, partValues) - rowWithPartArr.asInstanceOf[Object] + +val partSerDe = localDeserializer.newInstance() +val tableSerDe = tableSerDeClass.newInstance() +partSerDe.initialize(hconf, partProps) +tableSerDe.initialize(hconf, tableDesc.getProperties) + +val tblConvertedOI = ObjectInspectorConverters.getConvertedOI( + partSerDe.getObjectInspector, tableSerDe.getObjectInspector, true) + .asInstanceOf[StructObjectInspector] +val partTblObjectInspectorConverter = ObjectInspectorConverters.getConverter( + partSerDe.getObjectInspector, tblConvertedOI) + +// This is done per partition, and unnecessary to put it in the iterations (in iter.map). +rowWithPartArr.update(1, partValues) + +// Map each tuple to a row object. +if (partTblObjectInspectorConverter.isInstanceOf[IdentityConverter]) { + iter.map { case value = +rowWithPartArr.update(0, partSerDe.deserialize(value)) +rowWithPartArr.asInstanceOf[Object] + } +} else { + iter.map { case value = +val deserializedRow = { + // If partition schema does not match table schema, update the row to match. + val convertedRow = + partTblObjectInspectorConverter.convert(partSerDe.deserialize(value)) + + // If conversion was performed, convertedRow will be a standard Object, but if + // conversion wasn't necessary, it will still be lazy. We can't have both across + // partitions, so we serialize and deserialize again to make it lazy. + if (tableSerDe.isInstanceOf[OrcSerde]) { +convertedRow --- End diff -- Why do we need to take care ORC separately? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14862338 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon // Create local references so that the outer object isn't serialized. val tableDesc = _tableDesc + val tableSerDeClass = tableDesc.getDeserializerClass + val broadcastedHiveConf = _broadcastedHiveConf val localDeserializer = partDeserializer val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc) - hivePartitionRDD.mapPartitions { iter = + hivePartitionRDD.mapPartitions { case iter = val hconf = broadcastedHiveConf.value.value val rowWithPartArr = new Array[Object](2) -// Map each tuple to a row object -iter.map { value = - val deserializer = localDeserializer.newInstance() - deserializer.initialize(hconf, partProps) - val deserializedRow = deserializer.deserialize(value) - rowWithPartArr.update(0, deserializedRow) - rowWithPartArr.update(1, partValues) - rowWithPartArr.asInstanceOf[Object] + +val partSerDe = localDeserializer.newInstance() +val tableSerDe = tableSerDeClass.newInstance() +partSerDe.initialize(hconf, partProps) +tableSerDe.initialize(hconf, tableDesc.getProperties) + +val tblConvertedOI = ObjectInspectorConverters.getConvertedOI( + partSerDe.getObjectInspector, tableSerDe.getObjectInspector, true) + .asInstanceOf[StructObjectInspector] +val partTblObjectInspectorConverter = ObjectInspectorConverters.getConverter( + partSerDe.getObjectInspector, tblConvertedOI) + +// This is done per partition, and unnecessary to put it in the iterations (in iter.map). +rowWithPartArr.update(1, partValues) + +// Map each tuple to a row object. +if (partTblObjectInspectorConverter.isInstanceOf[IdentityConverter]) { + iter.map { case value = +rowWithPartArr.update(0, partSerDe.deserialize(value)) +rowWithPartArr.asInstanceOf[Object] + } +} else { + iter.map { case value = +val deserializedRow = { + // If partition schema does not match table schema, update the row to match. + val convertedRow = + partTblObjectInspectorConverter.convert(partSerDe.deserialize(value)) + + // If conversion was performed, convertedRow will be a standard Object, but if + // conversion wasn't necessary, it will still be lazy. We can't have both across + // partitions, so we serialize and deserialize again to make it lazy. + if (tableSerDe.isInstanceOf[OrcSerde]) { +convertedRow + } else { +convertedRow match { --- End diff -- I think we need to comment why we need to do this pattern matching. Also, why do we handle `LazyStruct` and `ColumnarStruct` specially? There are similar classes, e.g. `LazyBinaryStruct` and `LazyBinaryColumnarStruct`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1394#issuecomment-48860407 @srowen This looks good to me and thank you for updating the docs as well! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1394 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL][CORE] SPARK-2102
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1377#issuecomment-48860618 QA results for PR 1377:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16610/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: remove not used test in src/main
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1397#issuecomment-48861087 QA tests have started for PR 1397. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16612/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/1385#issuecomment-48861711 @rxin and @aarondav, yeah ï¼the master branch deadlocks, it seems locks of #1273 and Hadoop-10456 lead to the problem. when run hivesql self join sql--- hql(SELECT t1.a, t1.b, t1.c FROM table_A t1 JOIN table_A t2 ON (t1.a = t2.a)), the program stucks. i think clean SparkContext.hadoopFile api is a better way for fix it. in this way, we do not need the lock in #1273 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1945][MLLIB] Documentation Improvements...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1311#discussion_r14862906 --- Diff: docs/mllib-clustering.md --- @@ -69,7 +69,54 @@ println(Within Set Sum of Squared Errors = + WSSSE) All of MLlib's methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by -calling `.rdd()` on your `JavaRDD` object. +calling `.rdd()` on your `JavaRDD` object. A standalone application example +that is equivalent to the provided example in Scala is given bellow: + +{% highlight java %} +import org.apache.spark.api.java.*; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.function.Function; +import org.apache.spark.mllib.clustering.KMeans; +import org.apache.spark.mllib.clustering.KMeansModel; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.mllib.linalg.Vector; + +public class KMeansExample { + public static void main(String[] args) { +SparkConf conf = new SparkConf().setAppName(K-means Example); +JavaSparkContext sc = new JavaSparkContext(conf); + +// Load and parse data +String path = {SPARK_HOME}/data/kmeans_data.txt; --- End diff -- We moved all data files to `data/mllib` in #1394 . Could you merge the current master and update the path here and in other places? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1945][MLLIB] Documentation Improvements...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1311#discussion_r14862907 --- Diff: docs/mllib-clustering.md --- @@ -69,7 +69,54 @@ println(Within Set Sum of Squared Errors = + WSSSE) All of MLlib's methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by -calling `.rdd()` on your `JavaRDD` object. +calling `.rdd()` on your `JavaRDD` object. A standalone application example +that is equivalent to the provided example in Scala is given bellow: + +{% highlight java %} +import org.apache.spark.api.java.*; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.function.Function; +import org.apache.spark.mllib.clustering.KMeans; +import org.apache.spark.mllib.clustering.KMeansModel; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.mllib.linalg.Vector; + +public class KMeansExample { + public static void main(String[] args) { +SparkConf conf = new SparkConf().setAppName(K-means Example); +JavaSparkContext sc = new JavaSparkContext(conf); + +// Load and parse data +String path = {SPARK_HOME}/data/kmeans_data.txt; +JavaRDDString data = sc.textFile(path); +JavaRDDVector parsedData = data.map( + new FunctionString, Vector() { +public Vector call(String s) { + String[] sarray = s.split( ); + double[] values = new double[sarray.length]; + for (int i = 0; i sarray.length; i++) +values[i] = Double.parseDouble(sarray[i]); + return Vectors.dense(values); +} + } +); + +// Cluster the data into two classes using KMeans +int numClusters = 2; +int numIterations = 20; +KMeansModel clusters = KMeans.train(JavaRDD.toRDD(parsedData), numClusters, numIterations); --- End diff -- `JavaRDD.toRDD(parsedData)` - `parsedData.rdd()` (simpler) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1945][MLLIB] Documentation Improvements...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1311#discussion_r14862911 --- Diff: docs/mllib-collaborative-filtering.md --- @@ -99,7 +99,88 @@ val model = ALS.trainImplicit(ratings, rank, numIterations, alpha) All of MLlib's methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by -calling `.rdd()` on your `JavaRDD` object. +calling `.rdd()` on your `JavaRDD` object. A standalone application example +that is equivalent to the provided example in Scala is given bellow: + +{% highlight java %} +import scala.Product; +import scala.Tuple2; + +import org.apache.spark.api.java.*; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.function.Function; +import org.apache.spark.mllib.recommendation.ALS; +import org.apache.spark.mllib.recommendation.Rating; +import org.apache.spark.mllib.recommendation.MatrixFactorizationModel; + +public class CollaborativeFiltering { + public static void main(String[] args) { +SparkConf conf = new SparkConf().setAppName(Collaborative Filtering Example); +JavaSparkContext sc = new JavaSparkContext(conf); + +// Load and parse the data +String path = /home/michael/workspace/spark/mllib/data/als/test.data; --- End diff -- Please use a relative path. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---