[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1390#issuecomment-48832990
  
@yhuai can you take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1390#discussion_r14856885
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
 
   // Create local references so that the outer object isn't serialized.
   val tableDesc = _tableDesc
+  val tableSerDeClass = tableDesc.getDeserializerClass
+
   val broadcastedHiveConf = _broadcastedHiveConf
   val localDeserializer = partDeserializer
 
   val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc)
-  hivePartitionRDD.mapPartitions { iter =
+  hivePartitionRDD.mapPartitions { case iter =
--- End diff --

is the pattern matching here necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r14857013
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -21,16 +21,27 @@ import java.util.Properties
 
 import scala.collection.JavaConverters._
 
+private object SQLConf {
+  @transient protected[spark] val confSettings = 
java.util.Collections.synchronizedMap(
--- End diff --

Why singleton? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-13 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r14857017
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -21,16 +21,27 @@ import java.util.Properties
 
 import scala.collection.JavaConverters._
 
+private object SQLConf {
+  @transient protected[spark] val confSettings = 
java.util.Collections.synchronizedMap(
--- End diff --

In one of the earlier commits, I tried the approach of making an operator 
mixing in SQLConf -- hence that necessitated a singleton object to hold the 
settings. Now I just have that operator take a SQLContext in order to get 
access to the conf, so if you want I could remove this singleton. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Delete the useless import

2014-07-13 Thread XuTingjun
Github user XuTingjun closed the pull request at:

https://github.com/apache/spark/pull/1284


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r14857026
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -21,16 +21,27 @@ import java.util.Properties
 
 import scala.collection.JavaConverters._
 
+private object SQLConf {
+  @transient protected[spark] val confSettings = 
java.util.Collections.synchronizedMap(
--- End diff --

Let's remove singletons. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r14857044
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -47,6 +47,13 @@ private[sql] abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
   /**
* Uses the ExtractEquiJoinKeys pattern to find joins where at least 
some of the predicates can be
* evaluated by matching hash keys.
+   *
+   * This strategy applies a simple optimization based on the estimates of 
the physical sizes of
+   * the two join sides.  When planning a [[execution.BroadcastHashJoin]], 
if one side has an
+   * estimated physical size smaller than the user-settable threshold
+   * `spark.sql.auto.convert.join.size`, the planner would mark it as the 
''build'' relation and
--- End diff --

Document it in more explicit ways that a table will get broadcasted 
(instead of just saying build; you can still build in shuffle)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1259#issuecomment-48834061
  
I brought this up to date. @andrewor14 can you take a look at this? I'd 
want to merge this quickly so I can submit my other scheduler patches too. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1259#issuecomment-48834107
  
QA tests have started for PR 1259. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16600/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...

2014-07-13 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1385#issuecomment-48834426
  
Is this related to the other conf-related concurrency issue that was fixed 
recently? https://github.com/apache/spark/pull/1273


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48835312
  
We have gone over this in the past .. it is suboptimal to make it a linear
function of executor/driver memory.
Overhead is a function of number of executors, number of opened files,
shuffle vm pressure, etc.
It is NOT a function of executor memory : which is why it is separately
configured.
 On 13-Jul-2014 11:16 am, UCB AMPLab notificati...@github.com wrote:

 Can one of the admins verify this patch?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1391#issuecomment-48832590.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48835447
  
That makes sense, but then it doesn't explain why a constant amount works 
for a given job when executor memory is low, and then doesn't work when it is 
high. This has also been my experience and I don't have a great grasp on why it 
would be. More threads and open files in a busy executor? It goes indirectly 
with how big you need your executor to be, but not directly.

Nishkam do you have a sense of how much extra memory you had to configure 
to get it to work when executor memory increased? is it pretty marginal, or 
quite substantial?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread nishkamravi2
Github user nishkamravi2 commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48835560
  
Sean, the memory_overhead is fairly substantial. More than 2GB for a 30GB 
executor. Less than 400MB for a 2GB executor. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48835566
  
The default constant is actually a lowerbound to account for other
overheads (since yarn will aggressively kill tasks)... Unfortunately we
have not sized this properly : and don't have good recommendation on how to
set it.

This is compounded by magic constants in spark for various IO ops, non
deterministic network behaviour (we should be able to estimate upper bound
here = 2x number of workers), vm memory use (shuffle output is mmapp'ed
whole ... going foul with yarn virtual men limits) and so on.

Hence sizing this is, unfortunately, app specific.
 On 13-Jul-2014 2:34 pm, Sean Owen notificati...@github.com wrote:

 That makes sense, but then it doesn't explain why a constant amount works
 for a given job when executor memory is low, and then doesn't work when it
 is high. This has also been my experience and I don't have a great grasp 
on
 why it would be. More threads and open files in a busy executor? It goes
 indirectly with how big you need your executor to be, but not directly.

 Nishkam do you have a sense of how much extra memory you had to configure
 to get it to work when executor memory increased? is it pretty marginal, 
or
 quite substantial?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1391#issuecomment-48835447.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1259#issuecomment-48835596
  
QA results for PR 1259:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass TaskRunner(brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16600/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48835618
  
That would be a function of your jobs.
Other apps would have a drastically different characteristics ... Which is
why we can't generalize to a simple fraction of executor memory.
It actually buys us nothing in general case ... Jobs will continue to fail
when it is incorrect : while wasting a lot of memory
On 13-Jul-2014 2:38 pm, nishkamravi2 notificati...@github.com wrote:

 Yes, I'm aware of the discussion on this issue in the past. Experiments
 confirm that overhead is a function of executor memory. Why and how can be
 figured out with due diligence and analysis. It may be a function of other
 parameters and the function may be fairly complex. However, the
 proportionality is undeniable. Besides, we are only adjusting the default
 value and making it a bit more resilient. The memory_overhead parameter 
can
 still be configured by the developer separately. The constant additive
 factor makes little sense (empirically).

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1391#issuecomment-48835500.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48835656
  
The basic issue is you are trying to model overhead using the wrong
variable... It has no correlation on executor memory actually (other than
vm overheads as heap increases)
On 13-Jul-2014 2:44 pm, Mridul Muralidharan mri...@gmail.com wrote:

 That would be a function of your jobs.
 Other apps would have a drastically different characteristics ... Which is
 why we can't generalize to a simple fraction of executor memory.
 It actually buys us nothing in general case ... Jobs will continue to fail
 when it is incorrect : while wasting a lot of memory
 On 13-Jul-2014 2:38 pm, nishkamravi2 notificati...@github.com wrote:

 Yes, I'm aware of the discussion on this issue in the past. Experiments
 confirm that overhead is a function of executor memory. Why and how can 
be
 figured out with due diligence and analysis. It may be a function of 
other
 parameters and the function may be fairly complex. However, the
 proportionality is undeniable. Besides, we are only adjusting the default
 value and making it a bit more resilient. The memory_overhead parameter 
can
 still be configured by the developer separately. The constant additive
 factor makes little sense (empirically).

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1391#issuecomment-48835500.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread Mridul Muralidharan
You are lucky :-) for some of our jobs, in a 8gb container, overhead is
1.8gb !
On 13-Jul-2014 2:40 pm, nishkamravi2 g...@git.apache.org wrote:

 Github user nishkamravi2 commented on the pull request:

 https://github.com/apache/spark/pull/1391#issuecomment-48835560

 Sean, the memory_overhead is fairly substantial. More than 2GB for a
 30GB executor. Less than 400MB for a 2GB executor.


 ---
 If your project is set up for it, you can reply to this email and have your
 reply appear on GitHub as well. If your project does not have this feature
 enabled and wishes so, or if the feature is enabled but not working, please
 contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
 with INFRA.
 ---



[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48835727
  
Yes of course, lots of settings' best or even usable values are ultimately 
app-specific. Ideally, defaults work for lots of cases. A flat value is the 
simplest of models, and anecdotally, the current default value does not work in 
medium- to large-memory YARN jobs. You can increase the default, but then the 
overhead gets silly for small jobs -- 1GB? And all of these are not-uncommon 
use cases.

None of that implies the overhead logically scales with container memory. 
Empirically, it may do, and that's useful. Until the magic explanatory variable 
is found, which one is less problematic for end users -- a flat constant that 
frequently has to be tuned, or an imperfect model that could get it right in 
more cases? 

That said it is kind of a developer API change and feels like something to 
not keep reimagining.

Niskham can you share any anecdotal evidence about how the overhead 
changes. If executor memory is the only variable changing, that seems to be 
evidence against it being driven by other factors. but I don't know if that's 
what we know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48835769
  
You are lucky :-) for some of our jobs, in a 8gb container, overhead is
1.8gb !
On 13-Jul-2014 2:41 pm, nishkamravi2 notificati...@github.com wrote:

 Sean, the memory_overhead is fairly substantial. More than 2GB for a 30GB
 executor. Less than 400MB for a 2GB executor.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1391#issuecomment-48835560.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread nishkamravi2
Github user nishkamravi2 commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48835852
  
Experimented with three different workloads and noticed common patterns of 
proportionality. 
Other parameters were left unchanged and only executor size was increased. 
The memory-overhead ranges between 0.05-0.08 * executor_memory size. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread nishkamravi2
Github user nishkamravi2 commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48835881
  
That's why the parameter is configurable. If you have jobs that cause 
20-25% memory_overhead, default values will not help. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48836123
  
You are missing my point I think ... To give unscientific anecdotal example
: our gbdt expiriments , which run on about 22 nodes need no tuning.
While our collaborative filtering expiriments, running on 300 nodes require
much higher overhead.
But QR factorization on the same 300 nodes need much lower overhead.
The values are all over the place and very app specific.

In an effort to ensure jobs always run to completion, setting overhead to
high fraction of executor memory might ensure successful completion but at
high performance loss and substandard scaling.

I would like a good default estimate of overhead ... But that is not
fraction of executor memory.
Instead of trying to model the overhead using executor memory, better would
be to look at actual parameters which influence it (as in, look at code and
figure it out; followed by validation and tuning of course) and use that as
estimate.
 On 13-Jul-2014 2:58 pm, nishkamravi2 notificati...@github.com wrote:

 That's why the parameter is configurable. If you have jobs that cause
 20-25% memory_overhead, default values will not help.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1391#issuecomment-48835881.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread nishkamravi2
Github user nishkamravi2 commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48836220
  
Mridul, I think you are missing the point. We understand that this 
parameter will in a lot of cases have to be specified by the developer, since 
there is no easy way to model it (that's why we are retaining it as a 
configurable parameter). However, the question is what would be a good default 
value be. 

I would like a good default estimate of overhead ... But that is not
fraction of executor memory. 

You are mistaken. It may not be a directly correlated variable, but it is 
most certainly indirectly correlated. And it is probably correlated to other 
app-specific parameters as well. 

Until the magic explanatory variable is found, which one is less 
problematic for end users -- a flat constant that frequently has to be tuned, 
or an imperfect model that could get it right in more cases?

This is the right point of view.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48836408
  
On Jul 13, 2014 3:16 PM, nishkamravi2 notificati...@github.com wrote:

 Mridul, I think you are missing the point. We understand that this
parameter will in a lot of cases have to be specified by the developer,
since there is no easy way to model it (that's why we are retaining it as a
configurable parameter). However, the question is what would be a good
default value be.


It does not help to estimate using the wrong variable.
Any correlation which exists are incidental and app specific, as I
elaborated before.

The only actual correlation between executor memory and overhead is java vm
overheads in managing very large heaps (and that is very high as a
fraction). Other factors in spark have far higher impact than this.

 I would like a good default estimate of overhead ... But that is not
 fraction of executor memory. 

 You are mistaken. It may not be a directly correlated variable, but it is
most certainly indirectly correlated. And it is probably correlated to
other app-specific parameters as well.

Please see above.


 Until the magic explanatory variable is found, which one is less
problematic for end users -- a flat constant that frequently has to be
tuned, or an imperfect model that could get it right in more cases?

 This is the right point of view.

Which has been our view even in previous discussions :-)
It is unfortunate that we did not approximate this better from the start
and went with the constant from the prototype.l impl.

Note that this estimation would be very volatile to spark internals


 —
 Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48836619
  
Hmm, looks like some of my responses to Sean via mail reply have not shown 
up here ... Maybe mail gateway delays ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Modify default YARN memory_overhead-- from an ...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1391#issuecomment-48836879
  
Since this is a recurring nightmare for our users, let me try to list down
the factors which influence overhead given current spark codebase state in
the jira when I am back at my desk  ... And we can add to that and model
from there (I won't be able to lead the effort though unfortunately, so
would be great if you or Sean can).

If it so happens that end of the exercise it is linear function of memory,
I am fine with it : as long as we decide based on actual data :-)
On 13-Jul-2014 3:26 pm, Mridul Muralidharan mri...@gmail.com wrote:


 On Jul 13, 2014 3:16 PM, nishkamravi2 notificati...@github.com wrote:
 
  Mridul, I think you are missing the point. We understand that this
 parameter will in a lot of cases have to be specified by the developer,
 since there is no easy way to model it (that's why we are retaining it as 
a
 configurable parameter). However, the question is what would be a good
 default value be.
 

 It does not help to estimate using the wrong variable.
 Any correlation which exists are incidental and app specific, as I
 elaborated before.

 The only actual correlation between executor memory and overhead is java
 vm overheads in managing very large heaps (and that is very high as a
 fraction). Other factors in spark have far higher impact than this.

  I would like a good default estimate of overhead ... But that is not
  fraction of executor memory. 
 
  You are mistaken. It may not be a directly correlated variable, but it
 is most certainly indirectly correlated. And it is probably correlated to
 other app-specific parameters as well.

 Please see above.

 
  Until the magic explanatory variable is found, which one is less
 problematic for end users -- a flat constant that frequently has to be
 tuned, or an imperfect model that could get it right in more cases?
 
  This is the right point of view.

 Which has been our view even in previous discussions :-)
 It is unfortunate that we did not approximate this better from the start
 and went with the constant from the prototype.l impl.

 Note that this estimation would be very volatile to spark internals

 
  —
  Reply to this email directly or view it on GitHub.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Made rdd.py pep8 complaint by using Autopep8 a...

2014-07-13 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/1354#issuecomment-48837932
  
@rxin done !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Made rdd.py pep8 complaint by using Autopep8 a...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1354#issuecomment-48837923
  
QA tests have started for PR 1354. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16601/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread YanTangZhai
GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/1392

[SPARK-2290] Worker should directly use its own sparkHome instead of 
appDesc.sparkHome when LaunchExecutor

Worker should directly use its own sparkHome instead of appDesc.sparkHome 
when LaunchExecutor

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-2290

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1392.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1392


commit d3072fc05c7c20ec9d90732db2b9b26a4d27e290
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T11:50:14Z

Update ApplicationDescription.scala

commit 78ec6bc8c5d1af64ca21e1a231b47911df6d4f90
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T11:52:34Z

Update JsonProtocol.scala

commit 95e6ccc354167117430ce4cb7b2f5063a454ff1d
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T11:54:55Z

Update TestClient.scala

commit 508dcb65d04e3f12f99e03572a1cc277e7f1aeca
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T11:58:01Z

Update SparkDeploySchedulerBackend.scala

commit 6d6700aaad941779485eee2c35c4ab0cd278529e
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T12:01:40Z

Update Worker.scala

commit c360154ae5b03e7854d63573494fc6113295a7ec
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T12:04:16Z

Update JsonProtocolSuite.scala

commit 6febb215fb73735760fae957a4e71e2a61c17c77
Author: YanTangZhai tyz0...@163.com
Date:   2014-07-13T12:07:35Z

Update ExecutorRunnerTest.scala




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1392#issuecomment-48839494
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1392#issuecomment-48839557
  
#1244


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1392#issuecomment-48839668
  
fix #1244 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Made rdd.py pep8 complaint by using Autopep8 a...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1354#issuecomment-48839833
  
QA results for PR 1354:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16601/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1244#issuecomment-48839912
  
I've fixed the compile problem. Please review and test again. Thanks very 
much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1281#issuecomment-48840373
  
Hi @ash211, I think this change is needed. Since the method 
Utils.getLocalDir is used by some function such as HttpBroadcast, which is 
different from DiskBlockManager. The two problems are different. Even though 
#1274 has been merged, the problem is still exist. Please review again. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1281#issuecomment-48840378
  
Hi @ash211, I think this change is needed. Since the method 
Utils.getLocalDir is used by some function such as HttpBroadcast, which is 
different from DiskBlockManager. The two problems are different. Even though 
#1274 has been merged, the problem is still exist. Please review again. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2325] Utils.getLocalDir had better chec...

2014-07-13 Thread YanTangZhai
Github user YanTangZhai commented on the pull request:

https://github.com/apache/spark/pull/1281#issuecomment-48840401
  
Hi @ash211, I think this change is needed. Since the method 
Utils.getLocalDir is used by some function such as HttpBroadcast, which is 
different from DiskBlockManager. The two problems are different. Even though 
#1274 has been merged, the problem is still exist. Please review again. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-48840541
  
QA tests have started for PR 1313. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16602/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]When the executor is thrown OutOfMemoryEr...

2014-07-13 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1387#issuecomment-48841019
  
Now, `SparkContext.cleaner` without considering the executor memory usage. 
This will cause the spark to fail in the shortage of memory.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]When the executor is thrown OutOfMemoryEr...

2014-07-13 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1387#issuecomment-48841151
  
@srowen 
[Executor.scala#L253](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L253)
 handle exceptions. But the memory overflow does not seem to be handled 
correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/1393

SPARK-2465. Use long as user / item ID for ALS

I'd like to float this for consideration: use longs instead of ints for 
user and product IDs in the ALS implementation.

The main reason for is that identifiers are not generally numeric at all, 
and will be hashed to an integer. (This is a separate issue.) Hashing to 32 
bits means collisions are likely after hundreds of thousands of users and 
items, which is not unrealistic. Hashing to 64 bits pushes this back to 
billions.

It would also mean numeric IDs that happen to be larger than the largest 
int can be used directly as identifiers.

On the downside of course: 8 bytes instead of 4 bytes of memory used per 
Rating.

Thoughts?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-2465

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1393.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1393


commit d4082ad7aa7468b63605d141f6d68e278983678a
Author: Sean Owen so...@cloudera.com
Date:   2014-07-13T14:57:15Z

Use long instead of int for user/product IDs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-48842883
  
QA tests have started for PR 1393. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16603/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-48843020
  
QA results for PR 1313:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16602/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-48843123
  
The overall increase how much memory? Have a detailed contrast?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---



[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-48843270
  
I think the most significant change is the Rating object. It goes from 8 + 
(ref) + 8 (object) + 4 (int) + 4 (int) + 8 (double) = 32 bytes to 8 (ref) + 8 
(object) + 4 (long) + 4 (long) + 8 (double) = 40 bytes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2253] Aggregator: Disable partial aggre...

2014-07-13 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1191#issuecomment-48845073
  
A few comments on this:
- We probably can't break the existing combineByKey through a config 
setting. If people want to use this directly, they'll need to use another 
interface. Otherwise we can have combineByKey do this on the map side but not 
on the reduce side.
- Since spilling is on by default now, we should probably add an 
implementation for that too before merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1390#issuecomment-48845188
  
I am reviewing it. Will comment it later today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-48845420
  
QA results for PR 1393:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brcase class Rating(user: Long, product: Long, rating: 
Double)brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16603/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-48846073
  
QA tests have started for PR 1393. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16604/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...

2014-07-13 Thread srowen
Github user srowen closed the pull request at:

https://github.com/apache/spark/pull/906


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1949. Servlet 2.5 vs 3.0 conflict in SBT...

2014-07-13 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/906#issuecomment-48846227
  
Obsoleted by SBT build changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-13 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-48846392
  
@mridulm  I updated the patch, now, the order is 

PROCESS_LOCAL-NODE_LOCAL-noPref / Speculative-RACK_LOCAL-NON_LOCAL


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files

2014-07-13 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/1394

SPARK-2363. Clean MLlib's sample data files

(Just made a PR for this, @mengxr was the reporter of:)

MLlib has sample data under serveral folders:
1) data/mllib
2) data/
3) mllib/data/*
Per previous discussion with Matei Zaharia, we want to put them under 
`data/mllib` and clean outdated files.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-2363

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1394.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1394


commit 54313ddec6daa02c17ffd8d2b6425ab94e0a778a
Author: Sean Owen so...@cloudera.com
Date:   2014-07-13T17:19:22Z

Move ML example data from /mllib/data/ and /data/ into /data/mllib/




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1394#issuecomment-48846547
  
QA tests have started for PR 1394. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16605/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1259#discussion_r14859597
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -490,19 +488,19 @@ private[spark] class TaskSetManager(
 info.markSuccessful()
 removeRunningTask(tid)
 sched.dagScheduler.taskEnded(
-  tasks(index), Success, result.value, result.accumUpdates, info, 
result.metrics)
+  tasks(index), Success, result.value(), result.accumUpdates, info, 
result.metrics)
 if (!successful(index)) {
   tasksSuccessful += 1
-  logInfo(Finished TID %s in %d ms on %s (progress: %d/%d).format(
-tid, info.duration, info.host, tasksSuccessful, numTasks))
+  logInfo(Finished %s:%s (TID %d) in %d ms on %s (%d/%d).format(
+taskSet.id, info.id, info.taskId, info.duration, info.host, 
tasksSuccessful, numTasks))
   // Mark successful and stop if all the tasks have succeeded.
   successful(index) = true
   if (tasksSuccessful == numTasks) {
 isZombie = true
   }
 } else {
-  logInfo(Ignorning task-finished event for TID  + tid +  because 
task  +
-index +  has already completed successfully)
+  logInfo(Ignorning task-finished event for  + taskSet.id + : + 
info.id +
--- End diff --

Ignorning


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1259#discussion_r14859634
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -429,9 +425,11 @@ private[spark] class TaskSetManager(
   }
   val timeTaken = clock.getTime() - startTime
   addRunningTask(taskId)
-  logInfo(Serialized task %s:%d as %d bytes in %d ms.format(
-taskSet.id, index, serializedTask.limit, timeTaken))
-  val taskName = task %s:%d.format(taskSet.id, index)
+
+  val taskName = task %s:%d.%d.format(taskSet.id, index, 
attemptNum)
--- End diff --

You can just use `task %s:%s.format(taskSet.id, info.id)` to be 
consistent with other places like L494. This could become a little hard to keep 
track of if we decide to change the notation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1259#discussion_r14859639
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -521,14 +519,13 @@ private[spark] class TaskSetManager(
 info.markFailed()
 val index = info.index
 copiesRunning(index) -= 1
-if (!isZombie) {
-  logWarning(Lost TID %s (task %s:%d).format(tid, taskSet.id, index))
-}
 var taskMetrics : TaskMetrics = null
-var failureReason: String = null
+
+val failureReason = sLost task ${taskSet.id}:$index (TID $tid) on 
executor ${info.host}:  +
--- End diff --

I think instead of `$index` you meant `${info.id}`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1259#discussion_r14859645
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -429,9 +425,11 @@ private[spark] class TaskSetManager(
   }
   val timeTaken = clock.getTime() - startTime
   addRunningTask(taskId)
-  logInfo(Serialized task %s:%d as %d bytes in %d ms.format(
-taskSet.id, index, serializedTask.limit, timeTaken))
-  val taskName = task %s:%d.format(taskSet.id, index)
+
+  val taskName = task %s:%d.%d.format(taskSet.id, index, 
attemptNum)
--- End diff --

Another option is to just add a field `taskName` in either `TaskInfo` or 
`Task` itself, and just use that everywhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1259#discussion_r14859653
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -570,19 +561,17 @@ private[spark] class TaskSetManager(
   }
 }
 if (printFull) {
-  val locs = ef.stackTrace.map(loc = \tat 
%s.format(loc.toString))
-  logWarning(Loss was due to %s\n%s\n%s.format(
-ef.className, ef.description, locs.mkString(\n)))
+  logWarning(failureReason)
 } else {
-  logInfo(Loss was due to %s [duplicate 
%d].format(ef.description, dupCount))
+  logInfo(sLost task ${taskSet.id}:$index (TID $tid) on executor 
${info.host}:  +
--- End diff --

same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1259#discussion_r14859650
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -536,23 +533,17 @@ private[spark] class TaskSetManager(
 // Not adding to failed executors for FetchFailed.
 isZombie = true
 
-  case TaskKilled =
-// Not adding to failed executors for TaskKilled.
-logWarning(Task %d was killed..format(tid))
-
   case ef: ExceptionFailure =
-taskMetrics = ef.metrics.getOrElse(null)
+taskMetrics = ef.metrics.orNull
 if (ef.className == classOf[NotSerializableException].getName()) {
   // If the task result wasn't serializable, there's no point in 
trying to re-execute it.
-  logError(Task %s:%s had a not serializable result: %s; not 
retrying.format(
-taskSet.id, index, ef.description))
-  abort(Task %s:%s had a not serializable result: %s.format(
-taskSet.id, index, ef.description))
+  logError(Task %s:%s (TID %d) had a not serializable result: %s; 
not retrying.format(
+taskSet.id, index, tid, ef.description))
+  abort(Task %s:%s (TID %d) had a not serializable result: 
%s.format(
+taskSet.id, index, tid, ef.description))
--- End diff --

same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1360#issuecomment-48848235
  
QA tests have started for PR 1360. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16606/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1259#issuecomment-48848292
  
Hi @rxin, I took a pass over the patch and the changes mostly look good. On 
a higher level point, I notice that we log this pattern `0.0:4.0 (TID 4 ...)` 
quite often, where the task ID is logged twice. If the user has no idea what 
the colon notation means, they might try to (as I did) interpret `4.0` as 
something else other than the task ID because `TID 4` is already there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-48848503
  
QA results for PR 1393:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brcase class Rating(user: Long, product: Long, rating: 
Double)brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16604/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-13 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-48849034
  
Hi @CodingCat looks good to me.
My only doubt, which we discussed last, was whether we want to 
differentiate between tasks which have no locations at all vs tasks which have 
preferred location but none available.
Currently both of these are hosted in the same data structure.

@lirui-intel do you have any thoughts on this PR ? Since you changed and 
probably tested some of this last !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1394#issuecomment-48849070
  
QA results for PR 1394:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16605/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2425 Don't kill a still-running Applicat...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1360#issuecomment-48850708
  
QA results for PR 1360:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16606/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-48850704
  
QA tests have started for PR 1393. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16607/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-13 Thread falaki
Github user falaki commented on the pull request:

https://github.com/apache/spark/pull/1351#issuecomment-48850882
  
This is not a bad idea, especially considering that a file can be split 
across partitions. @marmbrus you suggested this feature. What do you think 
about Reynold's suggestion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...

2014-07-13 Thread staple
GitHub user staple opened a pull request:

https://github.com/apache/spark/pull/1395

[SPARK-546] Add full outer join to RDD and DStream.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staple/spark SPARK-546

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1395.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1395


commit 0575e2f45ec26eddf7470cc9fc67f52fc53120ff
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-07-13T18:19:03Z

Fix left outer join documentation comments.

commit 217614dbe4540bb8155e3fbb6eae56f70795b28d
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-07-13T18:19:03Z

In JavaPairDStream, make class tag specification in rightOuterJoin 
consistent with other functions.

commit 084b2d5edda993975ae2c1794ef8bf4dea2ae11b
Author: Aaron Staple aaron.sta...@gmail.com
Date:   2014-07-13T18:19:03Z

[SPARK-546] Add full outer join to RDD and DStream.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-546] Add full outer join to RDD and DSt...

2014-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1395#issuecomment-48851025
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-13 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1351#issuecomment-48851059
  
Note that there are multiple problems. We can solve the problem of out of 
memory by simply limiting the length of a record. Ideally, csvRDD(RDD[String]) 
should just be one element per record, while csvFile should properly parse the 
file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Whitelist more Hive tests.

2014-07-13 Thread marmbrus
GitHub user marmbrus opened a pull request:

https://github.com/apache/spark/pull/1396

[SQL] Whitelist more Hive tests.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/marmbrus/spark moreTests

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1396.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1396


commit ccd8f972e3b7b3e37e11f9bc0cf33edce4267a0d
Author: Michael Armbrust mich...@databricks.com
Date:   2014-07-11T03:16:46Z

Whitelist more tests.

commit 8b6001cedd936eaab707ca3d65ba2afa9ccf2edc
Author: Michael Armbrust mich...@databricks.com
Date:   2014-07-13T20:37:16Z

Add golden files.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Whitelist more Hive tests.

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1396#issuecomment-48851429
  
QA tests have started for PR 1396. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16608/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-48852700
  
QA results for PR 1393:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brcase class Rating(user: Long, product: Long, rating: 
Double)brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16607/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Whitelist more Hive tests.

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1396#issuecomment-48853489
  
QA results for PR 1396:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16608/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1392#issuecomment-48853888
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1392#issuecomment-48854002
  
QA tests have started for PR 1392. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16609/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1392#issuecomment-48855862
  
QA results for PR 1392:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16609/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1945][MLLIB] Documentation Improvements...

2014-07-13 Thread miccagiann
Github user miccagiann commented on the pull request:

https://github.com/apache/spark/pull/1311#issuecomment-48855958
  
Hello guys,

I have provided Java examples for the following documentation files:
mllib-clustering.md
mllib-collaborative-filtering.md
mllib-dimensionality-reduction.md
mllib-linear-methods.md
mllib-optimization.md

Enjoy and do not hesitate to contact me for any remark/correction

Thanks,
Michael


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][CORE] SPARK-2102

2014-07-13 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1377#issuecomment-48857036
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][CORE] SPARK-2102

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1377#issuecomment-48857093
  
QA tests have started for PR 1377. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16610/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2125] Add sort flag and move sort into ...

2014-07-13 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/1210#issuecomment-48859519
  
Hi Matei, thanks a lot for your review, I will change the code according to 
your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1390#issuecomment-48859675
  
The code looks good to me. However, I think we can avoid the work around 
solution (de-serializing (with partition serde) and then serialize (with table 
serde) again) for adapting the higher level table scan (`TableScanOperator` in 
Shark), which have to providing a unique `ObjectInspector` for the downstream 
Operators.

Not like `TableScanOperator`, `HiveTableScan` in `Spark-Hive` doesn't reply 
on `ObjectInspector`, 
and its output type is `GenericMutableRow`, I think we could make the 
object conversion (from raw type to `Row` object) directly.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-13 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1259#issuecomment-48859674
  
If we actually want people to get information out of all those numbers, can 
we consider using a human readable format such as `Task(stageId = 1, taskId = 
5, attempt = 0)` or something along those lines?

(Do we actually need both task id and task index in general, by the way?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1390#issuecomment-48859842
  
And as the Hive SerDe actually provides the feature of `lazy` parsing, 
hence during the converting of `raw object` to `Row`, we need to support the 
column pruning

Sorry, some high level comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP]When the executor is thrown OutOfMemoryEr...

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1387#issuecomment-48859861
  
QA tests have started for PR 1387. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16611/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-13 Thread lirui-intel
Github user lirui-intel commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-48859854
  
This looks good to me :)
Just a reminder that when TaskSchedulerImpl calls 
TaskSetManager.resourceOffer, the maxLocality (changed to preferredLocality in 
this PR) doesn't always begin with PROCESS_LOCAL, rather, it only iterates 
through valid levels in the TaskSetManager. That being said, we still need 
something this PR provides to delay tasks in the noPref list, because valid 
levels are not re-computed when task finishes. So it may contains, say, 
PROCESS_LOCAL even when such tasks have actually run out.

I agree that we may want to separate tasks without preference vs tasks 
whose preferred location is not available. I think it's possible (maybe 
unlikely though) that a TaskSetManager contains both kinds of these tasks, e.g. 
with narrow dependency and some of parent RDD's partitions are cached.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1390#issuecomment-48860018
  
@chenghao-intel I am not sure I understand your comment on column pruning. 
I think for a Hive table, we should use `ColumnProjectionUtils` to set needed 
columns. So, RCFile and ORC can just read needed columns from HDFS.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1390#discussion_r14862289
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
 
   // Create local references so that the outer object isn't serialized.
   val tableDesc = _tableDesc
+  val tableSerDeClass = tableDesc.getDeserializerClass
+
   val broadcastedHiveConf = _broadcastedHiveConf
   val localDeserializer = partDeserializer
 
   val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc)
-  hivePartitionRDD.mapPartitions { iter =
+  hivePartitionRDD.mapPartitions { case iter =
 val hconf = broadcastedHiveConf.value.value
 val rowWithPartArr = new Array[Object](2)
-// Map each tuple to a row object
-iter.map { value =
-  val deserializer = localDeserializer.newInstance()
-  deserializer.initialize(hconf, partProps)
-  val deserializedRow = deserializer.deserialize(value)
-  rowWithPartArr.update(0, deserializedRow)
-  rowWithPartArr.update(1, partValues)
-  rowWithPartArr.asInstanceOf[Object]
+
+val partSerDe = localDeserializer.newInstance()
+val tableSerDe = tableSerDeClass.newInstance()
+partSerDe.initialize(hconf, partProps)
+tableSerDe.initialize(hconf,  tableDesc.getProperties)
+
+val tblConvertedOI = ObjectInspectorConverters.getConvertedOI(
+  partSerDe.getObjectInspector, tableSerDe.getObjectInspector, 
true)
+  .asInstanceOf[StructObjectInspector]
+val partTblObjectInspectorConverter = 
ObjectInspectorConverters.getConverter(
+  partSerDe.getObjectInspector, tblConvertedOI)
+
+// This is done per partition, and unnecessary to put it in the 
iterations (in iter.map).
+rowWithPartArr.update(1, partValues)
+
+// Map each tuple to a row object.
+if 
(partTblObjectInspectorConverter.isInstanceOf[IdentityConverter]) {
+  iter.map { case value =
+rowWithPartArr.update(0, partSerDe.deserialize(value))
+rowWithPartArr.asInstanceOf[Object]
+  }
+} else {
+  iter.map { case value =
+val deserializedRow = {
+  // If partition schema does not match table schema, update 
the row to match.
+  val convertedRow =
+
partTblObjectInspectorConverter.convert(partSerDe.deserialize(value))
+
+  // If conversion was performed, convertedRow will be a 
standard Object, but if
+  // conversion wasn't necessary, it will still be lazy. We 
can't have both across
+  // partitions, so we serialize and deserialize again to make 
it lazy.
+  if (tableSerDe.isInstanceOf[OrcSerde]) {
+convertedRow
+  } else {
+convertedRow match {
+  case _: LazyStruct = convertedRow
+  case _: HiveColumnarStruct = convertedRow
+  case _ = tableSerDe.deserialize(
+
tableSerDe.asInstanceOf[Serializer].serialize(convertedRow, tblConvertedOI))
--- End diff --

As mentioned by @chenghao-intel, can we avoid it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1390#discussion_r14862300
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
 
   // Create local references so that the outer object isn't serialized.
   val tableDesc = _tableDesc
+  val tableSerDeClass = tableDesc.getDeserializerClass
+
   val broadcastedHiveConf = _broadcastedHiveConf
   val localDeserializer = partDeserializer
 
   val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc)
-  hivePartitionRDD.mapPartitions { iter =
+  hivePartitionRDD.mapPartitions { case iter =
 val hconf = broadcastedHiveConf.value.value
 val rowWithPartArr = new Array[Object](2)
-// Map each tuple to a row object
-iter.map { value =
-  val deserializer = localDeserializer.newInstance()
-  deserializer.initialize(hconf, partProps)
-  val deserializedRow = deserializer.deserialize(value)
-  rowWithPartArr.update(0, deserializedRow)
-  rowWithPartArr.update(1, partValues)
-  rowWithPartArr.asInstanceOf[Object]
+
+val partSerDe = localDeserializer.newInstance()
+val tableSerDe = tableSerDeClass.newInstance()
+partSerDe.initialize(hconf, partProps)
+tableSerDe.initialize(hconf,  tableDesc.getProperties)
+
+val tblConvertedOI = ObjectInspectorConverters.getConvertedOI(
+  partSerDe.getObjectInspector, tableSerDe.getObjectInspector, 
true)
+  .asInstanceOf[StructObjectInspector]
+val partTblObjectInspectorConverter = 
ObjectInspectorConverters.getConverter(
+  partSerDe.getObjectInspector, tblConvertedOI)
+
+// This is done per partition, and unnecessary to put it in the 
iterations (in iter.map).
+rowWithPartArr.update(1, partValues)
+
+// Map each tuple to a row object.
+if 
(partTblObjectInspectorConverter.isInstanceOf[IdentityConverter]) {
+  iter.map { case value =
+rowWithPartArr.update(0, partSerDe.deserialize(value))
+rowWithPartArr.asInstanceOf[Object]
+  }
+} else {
+  iter.map { case value =
+val deserializedRow = {
+  // If partition schema does not match table schema, update 
the row to match.
+  val convertedRow =
+
partTblObjectInspectorConverter.convert(partSerDe.deserialize(value))
+
+  // If conversion was performed, convertedRow will be a 
standard Object, but if
+  // conversion wasn't necessary, it will still be lazy. We 
can't have both across
+  // partitions, so we serialize and deserialize again to make 
it lazy.
+  if (tableSerDe.isInstanceOf[OrcSerde]) {
+convertedRow
--- End diff --

Why do we need to take care ORC separately?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-13 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1390#discussion_r14862338
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
 
   // Create local references so that the outer object isn't serialized.
   val tableDesc = _tableDesc
+  val tableSerDeClass = tableDesc.getDeserializerClass
+
   val broadcastedHiveConf = _broadcastedHiveConf
   val localDeserializer = partDeserializer
 
   val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc)
-  hivePartitionRDD.mapPartitions { iter =
+  hivePartitionRDD.mapPartitions { case iter =
 val hconf = broadcastedHiveConf.value.value
 val rowWithPartArr = new Array[Object](2)
-// Map each tuple to a row object
-iter.map { value =
-  val deserializer = localDeserializer.newInstance()
-  deserializer.initialize(hconf, partProps)
-  val deserializedRow = deserializer.deserialize(value)
-  rowWithPartArr.update(0, deserializedRow)
-  rowWithPartArr.update(1, partValues)
-  rowWithPartArr.asInstanceOf[Object]
+
+val partSerDe = localDeserializer.newInstance()
+val tableSerDe = tableSerDeClass.newInstance()
+partSerDe.initialize(hconf, partProps)
+tableSerDe.initialize(hconf,  tableDesc.getProperties)
+
+val tblConvertedOI = ObjectInspectorConverters.getConvertedOI(
+  partSerDe.getObjectInspector, tableSerDe.getObjectInspector, 
true)
+  .asInstanceOf[StructObjectInspector]
+val partTblObjectInspectorConverter = 
ObjectInspectorConverters.getConverter(
+  partSerDe.getObjectInspector, tblConvertedOI)
+
+// This is done per partition, and unnecessary to put it in the 
iterations (in iter.map).
+rowWithPartArr.update(1, partValues)
+
+// Map each tuple to a row object.
+if 
(partTblObjectInspectorConverter.isInstanceOf[IdentityConverter]) {
+  iter.map { case value =
+rowWithPartArr.update(0, partSerDe.deserialize(value))
+rowWithPartArr.asInstanceOf[Object]
+  }
+} else {
+  iter.map { case value =
+val deserializedRow = {
+  // If partition schema does not match table schema, update 
the row to match.
+  val convertedRow =
+
partTblObjectInspectorConverter.convert(partSerDe.deserialize(value))
+
+  // If conversion was performed, convertedRow will be a 
standard Object, but if
+  // conversion wasn't necessary, it will still be lazy. We 
can't have both across
+  // partitions, so we serialize and deserialize again to make 
it lazy.
+  if (tableSerDe.isInstanceOf[OrcSerde]) {
+convertedRow
+  } else {
+convertedRow match {
--- End diff --

I think we need to comment why we need to do this pattern matching. Also, 
why do we handle `LazyStruct` and `ColumnarStruct` specially? There are similar 
classes, e.g. `LazyBinaryStruct` and `LazyBinaryColumnarStruct`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files

2014-07-13 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1394#issuecomment-48860407
  
@srowen This looks good to me and thank you for updating the docs as well!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2363. Clean MLlib's sample data files

2014-07-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1394


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][CORE] SPARK-2102

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1377#issuecomment-48860618
  
QA results for PR 1377:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16610/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: remove not used test in src/main

2014-07-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1397#issuecomment-48861087
  
QA tests have started for PR 1397. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16612/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...

2014-07-13 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/1385#issuecomment-48861711
  
@rxin and @aarondav, yeah ,the master branch deadlocks, it seems locks of 
#1273 and Hadoop-10456 lead to the problem. when run hivesql self join sql--- 
hql(SELECT t1.a, t1.b, t1.c FROM table_A t1 JOIN table_A t2 ON (t1.a = 
t2.a)), the program stucks.

i think clean SparkContext.hadoopFile api is a better way for fix it. in 
this way, we do not need the lock in 
#1273 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1945][MLLIB] Documentation Improvements...

2014-07-13 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1311#discussion_r14862906
  
--- Diff: docs/mllib-clustering.md ---
@@ -69,7 +69,54 @@ println(Within Set Sum of Squared Errors =  + WSSSE)
 All of MLlib's methods use Java-friendly types, so you can import and call 
them there the same
 way you do in Scala. The only caveat is that the methods take Scala RDD 
objects, while the
 Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD 
to a Scala one by
-calling `.rdd()` on your `JavaRDD` object.
+calling `.rdd()` on your `JavaRDD` object. A standalone application example
+that is equivalent to the provided example in Scala is given bellow:
+
+{% highlight java %}
+import org.apache.spark.api.java.*;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.mllib.clustering.KMeans;
+import org.apache.spark.mllib.clustering.KMeansModel;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.mllib.linalg.Vector;
+
+public class KMeansExample {
+  public static void main(String[] args) {
+SparkConf conf = new SparkConf().setAppName(K-means Example);
+JavaSparkContext sc = new JavaSparkContext(conf);
+
+// Load and parse data
+String path = {SPARK_HOME}/data/kmeans_data.txt;
--- End diff --

We moved all data files to `data/mllib` in #1394 . Could you merge the 
current master and update the path here and in other places? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1945][MLLIB] Documentation Improvements...

2014-07-13 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1311#discussion_r14862907
  
--- Diff: docs/mllib-clustering.md ---
@@ -69,7 +69,54 @@ println(Within Set Sum of Squared Errors =  + WSSSE)
 All of MLlib's methods use Java-friendly types, so you can import and call 
them there the same
 way you do in Scala. The only caveat is that the methods take Scala RDD 
objects, while the
 Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD 
to a Scala one by
-calling `.rdd()` on your `JavaRDD` object.
+calling `.rdd()` on your `JavaRDD` object. A standalone application example
+that is equivalent to the provided example in Scala is given bellow:
+
+{% highlight java %}
+import org.apache.spark.api.java.*;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.mllib.clustering.KMeans;
+import org.apache.spark.mllib.clustering.KMeansModel;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.mllib.linalg.Vector;
+
+public class KMeansExample {
+  public static void main(String[] args) {
+SparkConf conf = new SparkConf().setAppName(K-means Example);
+JavaSparkContext sc = new JavaSparkContext(conf);
+
+// Load and parse data
+String path = {SPARK_HOME}/data/kmeans_data.txt;
+JavaRDDString data = sc.textFile(path);
+JavaRDDVector parsedData = data.map(
+  new FunctionString, Vector() {
+public Vector call(String s) {
+  String[] sarray = s.split( );
+  double[] values = new double[sarray.length];
+  for (int i = 0; i  sarray.length; i++)
+values[i] = Double.parseDouble(sarray[i]);
+  return Vectors.dense(values);
+}
+  }
+);
+
+// Cluster the data into two classes using KMeans
+int numClusters = 2;
+int numIterations = 20;
+KMeansModel clusters = KMeans.train(JavaRDD.toRDD(parsedData), 
numClusters, numIterations);
--- End diff --

`JavaRDD.toRDD(parsedData)` - `parsedData.rdd()` (simpler)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1945][MLLIB] Documentation Improvements...

2014-07-13 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1311#discussion_r14862911
  
--- Diff: docs/mllib-collaborative-filtering.md ---
@@ -99,7 +99,88 @@ val model = ALS.trainImplicit(ratings, rank, 
numIterations, alpha)
 All of MLlib's methods use Java-friendly types, so you can import and call 
them there the same
 way you do in Scala. The only caveat is that the methods take Scala RDD 
objects, while the
 Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD 
to a Scala one by
-calling `.rdd()` on your `JavaRDD` object.
+calling `.rdd()` on your `JavaRDD` object. A standalone application example
+that is equivalent to the provided example in Scala is given bellow:
+
+{% highlight java %}
+import scala.Product;
+import scala.Tuple2;
+
+import org.apache.spark.api.java.*;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.mllib.recommendation.ALS;
+import org.apache.spark.mllib.recommendation.Rating;
+import org.apache.spark.mllib.recommendation.MatrixFactorizationModel;
+
+public class CollaborativeFiltering {
+  public static void main(String[] args) {
+SparkConf conf = new SparkConf().setAppName(Collaborative Filtering 
Example);
+JavaSparkContext sc = new JavaSparkContext(conf);
+
+// Load and parse the data
+String path = /home/michael/workspace/spark/mllib/data/als/test.data;
--- End diff --

Please use a relative path.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   >