date:20140804

[GitHub] spark pull request: [yarn]The method has a never used parameter

2014-08-04 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1761#issuecomment-51020215
  
In general, I'm wary of merging this type of very-small-scale code cleanup; 
I don't think this makes the code any easier to understand and it may cause 
merge conflicts down the road, creating maintenance hassles for us (see my 
comments on #1728).  Other committers may disagree with me, so feel free to 
chime in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2638 MapOutputTracker concurrency improv...

2014-08-04 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1542#issuecomment-51020347
  
Since this same locking pattern occurs at several places in the code, I 
think it might make sense to abstract it behind a function or macro, which 
would give us a centralized place to experiment with different synchronization 
/ locking strategies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2510]Word2Vec: Distributed Repr...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1719#issuecomment-51020440
  
QA results for PR 1719:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass Word2Vec(brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17842/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1687] [PySpark] pickable namedtuple

2014-08-04 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1623#issuecomment-51020596
  
This looks okay, but I still wonder whether there's a simpler approach.  
Have you looked at how [dill](https://github.com/uqfoundation/dill) handles 
namedtuples?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2583] ConnectionManager error reporting

2014-08-04 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1758#issuecomment-51020658
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1687] [PySpark] pickable namedtuple

2014-08-04 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1623#issuecomment-51020851
  
It's easy to extend pickle to support namedtuple, couldpickle and dill have 
done in this way, but they are slow. We want to use cPickle for dataset, it 
should be fast by default. I had not find an way to extend cPickle, do you have 
any ideas?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2583] ConnectionManager error reporting

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1758#issuecomment-51020991
  
QA tests have started for PR 1758. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17845/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2510]Word2Vec: Distributed Repr...

2014-08-04 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1719#issuecomment-51021254
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2817] add show create table support

2014-08-04 Thread tianyi

Github user tianyi commented on the pull request:

https://github.com/apache/spark/pull/1760#issuecomment-51021277
  
@chenghao-intel is these files all right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1687] [PySpark] pickable namedtuple

2014-08-04 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1623#issuecomment-51021399
  
Here's another (contrived) example that breaks:

```python
from collections import namedtuple as nt
from pyspark import SparkContext
from pyspark.serializers import PickleSerializer

sc = SparkContext(local)
p = PickleSerializer()

Person = nt(Person, 'id firstName lastName')
jon = Person(1, Jon, Doe)
sc.textFile(/usr/share/dict/words).map(lambda x: jon).first()
```

It looks like the problem here is that line 306 assumes that old references 
will be named `namedtuple`, which isn't true if I import it under a different 
name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2510]Word2Vec: Distributed Repr...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1719#issuecomment-51021526
  
QA tests have started for PR 1719. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17846/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1812] Enable cross build for scala 2.11...

2014-08-04 Thread ScrapCodes

Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/spark/pull/996#discussion_r15741984
  
--- Diff: assembly/pom.xml ---
@@ -26,7 +26,7 @@
   /parent
 
   groupIdorg.apache.spark/groupId
-  artifactIdspark-assembly_2.10/artifactId
+  artifactIdspark-assembly_${scala.binary.version}/artifactId
--- End diff --

Maintaining a separate branch is difficult and requires sometimes non 
trivial merges to keep it in sync. 
I don't really know, what an archetype build which generates pom(s) for 
different scala version mean.
I feel we can have some release scripts that can take care of releasing 
maven poms with expression in artifact replaced by its corresponding constant. 
This sort of thing sbt does natively. I think this is much better than 
complicating the build further, because once we have an archetype we will have 
to adjust sbt a great deal to accommodate it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2638 MapOutputTracker concurrency improv...

2014-08-04 Thread javadba

Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1542#issuecomment-51022369
  
Thanks for commenting Josh. I will see about putting together something on 
this including solid testcases.  ETA later in the coming week. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1687] [PySpark] pickable namedtuple

2014-08-04 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1623#issuecomment-51022848
  
Yes, it's easy to break it. 

Having an solution working in 99% cases is better than no solutions, or 
much slower solution working 100% cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2510]Word2Vec: Distributed Repr...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1719#issuecomment-51023016
  
QA results for PR 1719:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass Word2Vec(brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17843/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1687] [PySpark] pickable namedtuple

2014-08-04 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1623#issuecomment-51023029
  
This feature is not blocker, because we prefer use Row() instead of 
namedtuple to do inferSchema().

If user really want to use namedtuple or customized class in __main__, they 
could use cloudpickle.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

2014-08-04 Thread avulanov

Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1484#issuecomment-51023586
  
@mengxr Could you review or comment this? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2510]Word2Vec: Distributed Repr...

2014-08-04 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1719#issuecomment-51023600
  
LGTM. Merged into both master and branch-1.1. @Ishiihara Thanks a lot for 
implementing word2vec! Please help improve its performance during the QA 
period. One task left is Java support. If you want to spend some time on it, 
there are some examples in `HashingTF.scala` and `JavaTfIdfSuite.java`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-08-04 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r15742427
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -233,4 +235,60 @@ object MLUtils {
 }
 sqDist
   }
+
+/**
+   * Load corpus from a given path. Terms and documents will be translated 
into integers, with a
+   * term-integer map and a document-integer map.
+   *
+   * @param dir The path of corpus.
+   * @param dirStopWords The path of stop words.
+   * @return (RDD[Document], Term-integer map, doc-integer map)
+   */
+  def loadCorpus(
+  sc: SparkContext,
+  dir: String,
+  minSplits: Int,
+  dirStopWords: String = ):
+  (RDD[Document], Index[String], Index[String]) = {
+
+// Containers and indexers for terms and documents
+val termMap = Index[String]()
+val docMap = Index[String]()
+
+val stopWords =
+  if (dirStopWords == ) {
+Set.empty[String]
+  }
+  else {
+sc.textFile(dirStopWords, minSplits).
+  map(x = x.replaceAll( (?m)\s+$, 
)).distinct().collect().toSet
+  }
+val broadcastStopWord = sc.broadcast(stopWords)
+
+// Tokenize and filter terms
+val almostData = sc.wholeTextFiles(dir, minSplits).map { case 
(fileName, content) =
+  val tokens = JavaWordTokenizer(content)
--- End diff --

We should allow users to customize here.  We can add a parameter 
`tokenizer: (String) = Iterable[String]`, to  loadCorpus, and `dirStopWords ` 
is not required.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Remove support for waiting for executors in st...

2014-08-04 Thread kayousterhout

GitHub user kayousterhout opened a pull request:

https://github.com/apache/spark/pull/1762

Remove support for waiting for executors in standalone mode.

Current code waits until some minimum fraction of expected executors
have registered before beginning scheduling.  The current code in
standalone mode suffers from a race condition (SPARK-2635). This
race condition could be fixed, but this functionality is easily
achieved by the user (they can use the storage status to determine
how many executors are up, as described by @pwendell in #1462)
so adding the extra complexity to the scheduler code may not be worthwhile.

This commit removes the functionality in standalone mode but not for
YARN -- where it is more necessary and the number of expected executors
is well-defined.

This PR is a POC; if the powers-that-be determine that this is what we 
should
do, I will file a JIRA.

This should be backported into 1.1 if committed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kayousterhout/spark-1 remove_executor_wait

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1762.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1762


commit fa746ed65a8ac685edd33c79159521398d99aa69
Author: Kay Ousterhout kayousterh...@gmail.com
Date:   2014-08-04T06:59:05Z

Remove support for waiting for executors in standalone mode.

Current code waits until some minimum fraction of expected executors
have registered before beginning scheduling.  The current code in
standalone mode suffers from a race condition (SPARK-2635). This
race condition could be fixed, but this functionality is easily
achieved by the user (they can use the storage status to determine
how many executors are up, as described by @pwendell in #1462)
so adding the extra complexity to the scheduler code is not worthwile.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2555] Support configuration spark.sched...

2014-08-04 Thread kayousterhout

Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/1462#issuecomment-51024346
  
@pwendell I created https://github.com/apache/spark/pull/1762 for your 
judgment of what the right thing to do here is!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Remove support for waiting for executors in st...

2014-08-04 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/1762#discussion_r15742607
  
--- Diff: 
yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala
 ---
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster
+
+import akka.actor.ActorSystem
+
+import org.apache.spark.SparkContext
+import org.apache.spark.scheduler.TaskSchedulerImpl
+
+/**
+ * Subclass of CoarseGrainedSchedulerBackend that handles waiting until 
sufficient resources
+ * are registered before beginning to schedule tasks (needed by both Yarn 
scheduler backends).
+  */
+private[spark] class YarnSchedulerBackend(
+scheduler: TaskSchedulerImpl,
+actorSystem: ActorSystem)
+  extends CoarseGrainedSchedulerBackend(scheduler, actorSystem) {
+
+  // Submit tasks only after (registered executors / total expected 
executors)
+  // is equal to at least this value (expected to be a double between 0 
and 1, inclusive).
+  var minRegisteredRatio = 
conf.getDouble(spark.scheduler.minRegisteredExecutorsRatio, 0.8)
+  if (minRegisteredRatio  1) minRegisteredRatio = 1
+  // Regardless of whether the required number of executors have 
registered, return true from
+  // isReady() after this amount of time has elapsed.
+  val maxRegisteredWaitingTime =
+conf.getInt(spark.scheduler.maxRegisteredExecutorsWaitingTime, 3)
+  private val createTime = System.currentTimeMillis()
+
+  var totalExpectedExecutors: Int = _
+
+  override def isReady(): Boolean = {
--- End diff --

Realized I should fix this to only print the log message once -- will do 
this if we decide to use this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2510]Word2Vec: Distributed Repr...

2014-08-04 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1719


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

2014-08-04 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1484#issuecomment-51024568
  
Sure. We had some transformers implemented under `mllib.feature`, similar 
to sk-learn's approach. For feature selection, we can follow the same approach 
if we view feature selection as transformation: 1) fit a dataset and select a 
subset of features, 2) transform a dataset by picking out selected features. So 
for the API, I suggest the following

~~~
class ChiSquaredFeatureSelector(numFeatures: Int) extends Serializable {
  def fit(dataset: RDD[LabeledPoint]): this.type
  def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
} 
~~~

and we can hide the implementation from public interfaces. Please let me 
know whether this sounds good to you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: fix GraphX EdgeRDD zipPartitions

2014-08-04 Thread luluorta

GitHub user luluorta opened a pull request:

https://github.com/apache/spark/pull/1763

fix GraphX EdgeRDD zipPartitions

If the users set âspark.default.parallelismâ and the value is different 
with the EdgeRDD partition number, GraphX jobs will throw:
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/luluorta/spark fix-graph-zip

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1763.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1763


commit 83389614959fb2c84b947362af1e0babbfe767d5
Author: luluorta luluo...@gmail.com
Date:   2014-08-04T07:03:17Z

fix GraphX EdgeRDD zipPartitions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2823]fix GraphX EdgeRDD zipPartitions

2014-08-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1763#issuecomment-51024735
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2678][Core] Added -- to prevent spark...

2014-08-04 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1715#issuecomment-51024718
  
@liancheng why not just have the jars listed on the classpath in the order 
they are given to us? This is also how classpaths work in general, when I run a 
java command, I don't give a special flag for the first element in the 
classpath, I just put it first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2678][Core] Added -- to prevent spark...

2014-08-04 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1715#issuecomment-51024798
  
@andrewor14 I still don't understand how this is different. Basically, the 
JVM works such that you put a set of jars in order (indicating precedence) and 
then you can refer to a class defined in any of the jars. Why should we differ 
from the normal JVM semantics and have a special name for the first jar in the 
class?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2566. Update ShuffleWriteMetrics increme...

2014-08-04 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/1481#discussion_r15742765
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -29,7 +29,7 @@ import akka.actor.{ActorSystem, Cancellable, Props}
 import sun.nio.ch.DirectBuffer
 
 import org.apache.spark._
-import org.apache.spark.executor.{DataReadMethod, InputMetrics}
+import org.apache.spark.executor.{ShuffleWriteMetrics, DataReadMethod, 
InputMetrics}
--- End diff --

alphabetization (here and elsewhere as well)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1687] [PySpark] pickable namedtuple

2014-08-04 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1623#issuecomment-51024991
  
I found another technique that may be more robust to `namedtuple` being 
accessible under different names.  We can replace `namedtuple`'s code object at 
runtime in order to interpose on calls to it:

```python
import types
def copy_func(f, name=None):  # See 
http://stackoverflow.com/a/6528148/590203
return types.FunctionType(f.func_code, f.func_globals, name or 
f.func_name,
f.func_defaults, f.func_closure)

from collections import namedtuple
namedtuple._old_namedtuple = copy_func(namedtuple)
def wrapped(*args, **kwargs):
print Called the wrapped function!
return namedtuple._old_namedtuple(*args, **kwargs)
namedtuple.func_code = wrapped.func_code

print namedtuple(Person, name age)
```

This prints

```
Called the wrapped function!
class 'collections.Person'
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-2510]Word2Vec: Distributed Repr...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1719#issuecomment-51025157
  
QA results for PR 1719:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass Word2Vec(brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17846/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2817] add show create table support

2014-08-04 Thread chenghao-intel

Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1760#issuecomment-51025307
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2566. Update ShuffleWriteMetrics increme...

2014-08-04 Thread sryza

Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/1481#issuecomment-51025499
  
Updated patch addresses @pwendell and @kayousterhout 's comments and adds 
tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2566. Update ShuffleWriteMetrics increme...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1481#issuecomment-51025769
  
QA tests have started for PR 1481. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17848/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1812] Enable cross build for scala 2.11...

2014-08-04 Thread markhamstra

Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/996#discussion_r15743162
  
--- Diff: assembly/pom.xml ---
@@ -26,7 +26,7 @@
   /parent
 
   groupIdorg.apache.spark/groupId
-  artifactIdspark-assembly_2.10/artifactId
+  artifactIdspark-assembly_${scala.binary.version}/artifactId
--- End diff --

I won't claim to be a maven-archetype-plugin expert, or to have figured out 
everything that we would need to do to start using archetyped builds (e.g. I 
haven't got sub-project builds with archetypes figured out yet), but the basics 
of archetypes are pretty simple to use and do get us a significant part of the 
way to cross-building Spark.

To see how the rudiments work, you should be able to quickly and easily do 
the following:

1) Clone the spark repo
2) cd to spark/core
3) mvn archetype:create-from-project
4) cd into the new target/generated-sources/archetype
5) mvn install
6) make a tmp dir someplace and cd into it
7) mvn archetype:generate -DarchetypeCatalog=local 
-DgroupId=org.apache.spark -DartifactId=spark-core_2.11 -Dversion=2.0.0-SNAPSHOT
8) select '1', the only archetype that you now have installed locally
9) observe that you now have a spark-core tree ready to build 
spark-core_2.11-2.0.0-SNAPSHOT (except for the spark-parent reference -- I told 
you, I haven't got sub-projects figured out yet.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2555] Support configuration spark.sched...

2014-08-04 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1462#issuecomment-51026034
  
Okay let me run it by some more people tomorrow and figure it out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

2014-08-04 Thread aarondav

Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1764#discussion_r15743214
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala ---
@@ -266,7 +266,8 @@ package object dsl {
 
   object plans {  // scalastyle:ignore
 implicit class DslLogicalPlan(val logicalPlan: LogicalPlan) extends 
LogicalPlanFunctions {
-  def writeToFile(path: String) = WriteToFile(path, logicalPlan)
+// Writing to files doesn't make sense without Hadoop in scope (unless 
we assume local fs)
+//  def writeToFile(path: String) = WriteToFile(path, classOf[??], 
logicalPlan)
--- End diff --

Not sure what to do with this, it doesn't seem to make a lot of sense in 
the catalyst package unless catalyst wants to depend on Hadoop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

2014-08-04 Thread aarondav

GitHub user aarondav opened a pull request:

https://github.com/apache/spark/pull/1764

[SPARK-2824/2825][SQL] Work towards separating data location from format

Currently, there is a fundamental assumption in SparkSQL that a Parquet
table is stored at a certain Hadoop path and that a Metastore table is
stored within the Hive warehouse. However, the fact that a table is
Parquet or serialized as an object file is independent of where the data
is actually located.

This patch attempts to work towards creating a cleaner separation between
where the data is located and the format the data is in by introducing
two concepts: a TableFormat and a TableLocation. This abstraction
enables code like the following:

```scala
val myTable = // ...
myTable.saveAsTable(myTable, classOf[ParquetFormat])
hql(SELECT * FROM myTable).collect // reads from Parquet!

// Also allows expansion of file-writing later:
myTable.saveAsFile(/my/file, classOf[ParquetFormat])
```

Additionally, this allows us to trivially support external tables with 
arbitrary formats.

However, this PR doesn't attempt to make any radical changes. Parquet files 
still only
support being written to a single Hadoop directory, but this can be part of 
a Hive table or
a normal directory. The MetastoreRelation still requires living within the 
Metastore because
it relies heavily on the metadata there. The hope of this patch is that it 
enables the two linked
features ([SPARK-2824](https://issues.apache.org/jira/browse/SPARK-2824) 
and [SPARK-2825](https://issues.apache.org/jira/browse/SPARK-2825)) while 
adding a useful abstraction for the future.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aarondav/spark hive

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1764.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1764


commit b18a4ae3c9bbe8f917ccc7e9aeb1ece25d54bc46
Author: Aaron Davidson aa...@databricks.com
Date:   2014-08-02T19:55:41Z

[SPARK-2824/2825][SQL] Work towards separating data location from format

Currently, there is a fundamental assumption in SparkSQL that a Parquet
table is stored at a certain Hadoop path and that a Metastore table is
stored within the Hive warehouse. However, the fact that a table is
Parquet or serialized as an object file is independent of where the data
is actually located.

This patch attempts to work towards creating a cleaner separation between
where the data is located and the format the data is in by introducing
two concepts: a TableFormat and a TableLocation. This abstraction
enables code like the following:

```scala
val myTable = //
myTable.saveAsTable(myTable, classOf[ParquetFormat])
hql(SELECT * FROM myTable).collect // reads from Parquet!

// Also allows expansion of file-writing later:
myTable.saveAsFile(/my/file, classOf[ParquetFormat])
```

Additionally, this allows us to trivially support external tables
with arbitrary formats.

However, this PR doesn't attempt to make any radical changes. Parquet files 
still only
support being written to a single Hadoop directory, but this can be part of 
a Hive table or
a normal directory. The MetastoreRelation still requires living within the 
Metastore because
it relies heavily on the metadata there. The hope of this patch is that it 
enables the two linked
features ([SPARK-2824](https://issues.apache.org/jira/browse/SPARK-2824) 
and [SPARK-2825](https://issues.apache.org/jira/browse/SPARK-2825)) while 
adding a useful abstraction for the
future.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

2014-08-04 Thread aarondav

Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1764#discussion_r15743237
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/TableFormat.scala
 ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.physical
--- End diff --

Not certain what the right place for this class is...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2566. Update ShuffleWriteMetrics increme...

2014-08-04 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1481#issuecomment-51026321
  
Cool - thanks Sandy. Let's see if tests pass. Can likely merge this 
tomorrow and fix any remaining issues (if they exist).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

2014-08-04 Thread aarondav

Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1764#discussion_r15743336
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -353,15 +356,14 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
* in the parent directory. If so, this is used. Else we read the actual 
footer at the given
* location.
* @param origPath The path at which we expect one (or more) Parquet 
files.
-   * @param configuration The Hadoop configuration to use.
+   * @param conf The Hadoop configuration to use.
* @return The `ParquetMetadata` containing among other things the 
schema.
*/
-  def readMetaData(origPath: Path, configuration: Option[Configuration]): 
ParquetMetadata = {
+  def readMetaData(origPath: Path, conf: Configuration): ParquetMetadata = 
{
 if (origPath == null) {
   throw new IllegalArgumentException(Unable to read Parquet metadata: 
path is null)
 }
 val job = new Job()
-val conf = configuration.getOrElse(ContextUtil.getConfiguration(job))
--- End diff --

I wanted to get rid of the optional configuration, but perhaps I should put 
this back. Making a new Job and then asking for a Configuration doesn't seem 
like it'd be more useful than just constructing a new Configuration, though, at 
least in terms of the properties set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1764#issuecomment-51026425
  
QA tests have started for PR 1764. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17849/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1764#issuecomment-51026471
  
QA results for PR 1764:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass ParquetFormat(sqlContext: SQLContext, conf: 
Configuration) extends TableFormat {brclass MetastoreFormat(hiveContext: 
HiveContext) extends TableFormat {brcase class HiveTableLocation(brbrFor 
more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17849/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1764#issuecomment-51026733
  
QA tests have started for PR 1764. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17850/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2824/2825][SQL] Work towards separating...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1764#issuecomment-51026778
  
QA results for PR 1764:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass ParquetFormat(sqlContext: SQLContext, conf: 
Configuration) extends TableFormat {brclass MetastoreFormat(hiveContext: 
HiveContext) extends TableFormat {brcase class HiveTableLocation(brbrFor 
more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17850/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2815]: Compilation failed upon the hado...

2014-08-04 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1754#issuecomment-51027051
  
I think this is an intermediate YARN version that is different from both 
the yarn-alpha and yarn-stable API's. @witgo what if you apply the patch here - 
does it work?

https://github.com/apache/spark/pull/151/files

In terms of merging this - I'm not sure we want to support 3 different YARN 
API's in the build and I think CDH itself said that YARN was not 
stable/supported here, so I'm not sure if we want to merge that patch. I am 
curious though whether it fixes it for you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2815]: Compilation failed upon the hado...

2014-08-04 Thread witgo

Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1754#issuecomment-51028168
  
This PR  makes sbt' s behavior is consistent with [building-with-maven.md] 
(https://github.com/apache/spark/blob/master/docs/building-with-maven.md)  
description.

table class=table
  thead
trthYARN version/ththProfile required/th/tr
  /thead
  tbody
trtd0.23.x to 2.1.x/tdtdyarn-alpha/td/tr
trtd2.2.x and later/tdtdyarn/td/tr
  /tbody
/table


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1779] add warning when memoryFraction i...

2014-08-04 Thread scwf

Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/714#issuecomment-51029448
  
hey @andrewor14 ï¼ i can not see FAILED unit tests info, so i do not know 
how to resolve it. can you help me 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Optionally parallelize the Spark build.

2014-08-04 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1752#issuecomment-51030127
  
@pwendell Ah, so that's what's causing it. Yes, fix forward by all means, 
but can this be disabled until that time? it looks like about half or more of 
all test runs are failing spuriously and that just means they have to be run 
2-3 times. It's now slower to get to a passed test suite when they really pass. 
In a way, parallelizing single builds is less prone to this conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] [SPARK-2826] Reduce the memory copy whil...

2014-08-04 Thread chenghao-intel

GitHub user chenghao-intel opened a pull request:

https://github.com/apache/spark/pull/1765

[SQL] [SPARK-2826] Reduce the memory copy while building the hashmap for 
HashOuterJoin

This is a follow up for #1147 , this PR will improve the performance about 
10% - 15% in my local tests.
```
Before:
LeftOuterJoin: took 16750 ms ([300] records)
LeftOuterJoin: took 15179 ms ([300] records)
RightOuterJoin: took 15515 ms ([300] records)
RightOuterJoin: took 15276 ms ([300] records)
FullOuterJoin: took 19150 ms ([600] records)
FullOuterJoin: took 18935 ms ([600] records)

After:
LeftOuterJoin: took 15218 ms ([300] records)
LeftOuterJoin: took 13503 ms ([300] records)
RightOuterJoin: took 13663 ms ([300] records)
RightOuterJoin: took 14025 ms ([300] records)
FullOuterJoin: took 16624 ms ([600] records)
FullOuterJoin: took 16578 ms ([600] records)
```

Besides the performance improvement, I also do some clean up as suggested 
in #1147

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chenghao-intel/spark hash_outer_join_fixing

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1765.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1765


commit ab1f9e0cce75b875f2bacefe3972c7fbe6fc898c
Author: Cheng Hao hao.ch...@intel.com
Date:   2014-08-04T07:51:04Z

Reduce the memory copy while building the hashmap




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] [SPARK-2826] Reduce the memory copy whil...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1765#issuecomment-51030910
  
QA tests have started for PR 1765. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17851/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2815]: Compilation failed upon the hado...

2014-08-04 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1754#discussion_r15745620
  
--- Diff: project/SparkBuild.scala ---
@@ -71,7 +71,7 @@ object SparkBuild extends PomBuild {
 }
 Properties.envOrNone(SPARK_HADOOP_VERSION) match {
   case Some(v) =
-if (v.matches(0.23.*)) isAlphaYarn = true
+if (^2\\.[2-9]+.r.findFirstIn(v) == None) isAlphaYarn = true
--- End diff --

If there is ever a Hadoop 2.10, this pattern would consider it a YARN alpha 
version. Better to match against `2\\.[01]\\.[0-9]+` or something. This won't 
address the actual issue in the OP. It looks like a bit of cleanup, but in a 
deprecated code path. It would let Hadoop 2.0 / 2.1 be supported but they 
aren't actually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1986][GraphX]move lib.Analytics to org....

2014-08-04 Thread larryxiao

GitHub user larryxiao opened a pull request:

https://github.com/apache/spark/pull/1766

[SPARK-1986][GraphX]move lib.Analytics to org.apache.spark.examples

to support ~/spark/bin/run-example GraphXAnalytics triangles
/soc-LiveJournal1.txt --numEPart=256

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/larryxiao/spark 1986

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1766.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1766


commit c6ac09211b886908fbb665e5763fc01f8fa140f3
Author: Larry Xiao xia...@sjtu.edu.cn
Date:   2014-08-04T08:57:34Z

[SPARK-1986][GraphX]move lib.Analytics to org.apache.spark.examples

to support ~/spark/bin/run-example GraphXAnalytics triangles
/soc-LiveJournal1.txt --numEPart=256




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1986][GraphX]move lib.Analytics to org....

2014-08-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1766#issuecomment-51032872
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

2014-08-04 Thread avulanov

Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1484#issuecomment-51033011
  
@mengxr 
1.  Do I understand correct, that you propose that `fit(dataset: 
RDD[LabeledPoint])` should compute feature scores according to the feature 
selection algorithm and `transform(dataset: RDD[LabeledPoint])` should return 
the filtered dataset?
2.  It seems that such an interface allows misuse when someone calls 
`transform` before `fit`. In some sense it is similar to calling `predict` 
before actually learning the model. This is avoided in MLLib classification 
models implementation by means of `ClassificationModel` interface that has 
`predict` only. Individual classifier has object that returns its instance 
(that does training as well). I like this approach more because it is less 
error-prone from user prospective, but it is a little bit implicit from 
developer's prospective (you need to know that you need to implement a fabric). 
Long story short, why not to seal `fit` inside the constructor or inside the 
object?
```
trait FeatureSelector extends Serializable {
   def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
//EITHER
class ChiSquaredFeatureSelector(dataset: RDD[LabeledPoint], numFeatures: 
Int) extends FeatureSelector {
  // perform chi squared computations...
  // implement transform
   override def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
// OR (like in classification models):
class ChiSquaredFeatureSelector extends FeatureSelector {
   private def fit(dataset: RDD[LabeledPoint])
  // implement transform
   override def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
object ChiSquaredFeatureSelector{
   def fit(dataset: RDD[LabeledPoint], numFeatures: Int) {
  val chi = new ChiSquaredFeatureSelector 
  chi.fit
  return chi
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2827][GraphX]Add degree distribution op...

2014-08-04 Thread luluorta

GitHub user luluorta opened a pull request:

https://github.com/apache/spark/pull/1767

[SPARK-2827][GraphX]Add degree distribution operators in GraphOps for GraphX



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/luluorta/spark graphx-degree

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1767.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1767


commit d1e053f3c7a95272676edfad485a31f69290effd
Author: luluorta luluo...@gmail.com
Date:   2014-08-04T08:42:28Z

add max/min degree

commit 1c35298bfd3bea5b8eeba6bb4804b3fe74ff7fd9
Author: luluorta luluo...@gmail.com
Date:   2014-08-04T08:56:29Z

add  degree distribution




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2827][GraphX]Add degree distribution op...

2014-08-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1767#issuecomment-51033669
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2815]: Compilation failed upon the hado...

2014-08-04 Thread witgo

Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1754#issuecomment-51033818
  
@pwendell   #151 compilation fails.
There seems to be infinite loop: 
`  SPARK_HADOOP_VERSION=2.0.0-cdh4.5.0 SPARK_YARN=true SPARK_HIVE=true 
sbt/sbt clean  assembly` -
```
java.lang.StackOverflowError
at scala.reflect.internal.HasFlags$class.isSynthetic(HasFlags.scala:115)
at scala.reflect.internal.Symbols$Symbol.isSynthetic(Symbols.scala:112)
at xsbt.ExtractUsedNames.eligibleAsUsedName(ExtractUsedNames.scala:121)
at 
xsbt.ExtractUsedNames.handleClassicTreeNode$1(ExtractUsedNames.scala:87)
at 
xsbt.ExtractUsedNames.xsbt$ExtractUsedNames$$handleTreeNode$1(ExtractUsedNames.scala:97)
at 
xsbt.ExtractUsedNames$$anonfun$handleMacroExpansion$1$2.apply(ExtractUsedNames.scala:64)
at 
xsbt.ExtractUsedNames$$anonfun$handleMacroExpansion$1$2.apply(ExtractUsedNames.scala:64)
at 
scala.reflect.internal.Trees$ForeachTreeTraverser.traverse(Trees.scala:1489)
at 
scala.reflect.internal.Trees$ForeachTreeTraverser.traverse(Trees.scala:1487)
at scala.reflect.internal.Trees$class.itraverse(Trees.scala:1174)
at scala.reflect.internal.SymbolTable.itraverse(SymbolTable.scala:13)
at scala.reflect.internal.SymbolTable.itraverse(SymbolTable.scala:13)
at scala.reflect.api.Trees$Traverser.traverse(Trees.scala:2825)
at 
scala.reflect.internal.Trees$ForeachTreeTraverser.traverse(Trees.scala:1490)
at 
scala.reflect.internal.Trees$TreeContextApiImpl.foreach(Trees.scala:80)
at 
xsbt.ExtractUsedNames.handleMacroExpansion$1(ExtractUsedNames.scala:64)
at 
xsbt.ExtractUsedNames.xsbt$ExtractUsedNames$$handleTreeNode$1(ExtractUsedNames.scala:95)
at 
xsbt.ExtractUsedNames$$anonfun$handleMacroExpansion$1$2.apply(ExtractUsedNames.scala:64)
at 
xsbt.ExtractUsedNames$$anonfun$handleMacroExpansion$1$2.apply(ExtractUsedNames.scala:64)
at 
scala.reflect.internal.Trees$ForeachTreeTraverser.traverse(Trees.scala:1489)
at 
scala.reflect.internal.Trees$ForeachTreeTraverser.traverse(Trees.scala:1487)
at scala.reflect.api.Trees$Traverser.traverseTrees(Trees.scala:2829)
at scala.reflect.internal.Trees$class.itraverse(Trees.scala:1174)
at scala.reflect.internal.SymbolTable.itraverse(SymbolTable.scala:13)
at scala.reflect.internal.SymbolTable.itraverse(SymbolTable.scala:13)
at scala.reflect.api.Trees$Traverser.traverse(Trees.scala:2825)
at 
scala.reflect.internal.Trees$ForeachTreeTraverser.traverse(Trees.scala:1490)
at 
scala.reflect.internal.Trees$TreeContextApiImpl.foreach(Trees.scala:80)
at 
xsbt.ExtractUsedNames.handleMacroExpansion$1(ExtractUsedNames.scala:64)
at 
xsbt.ExtractUsedNames.xsbt$ExtractUsedNames$$handleTreeNode$1(ExtractUsedNames.scala:95)
at 
xsbt.ExtractUsedNames$$anonfun$handleMacroExpansion$1$2.apply(ExtractUsedNames.scala:64)
at 
xsbt.ExtractUsedNames$$anonfun$handleMacroExpansion$1$2.apply(ExtractUsedNames.scala:64)
at 
scala.reflect.internal.Trees$ForeachTreeTraverser.traverse(Trees.scala:1489)
at 
scala.reflect.internal.Trees$ForeachTreeTraverser.traverse(Trees.scala:1487)
at scala.reflect.api.Trees$Traverser.traverseTrees(Trees.scala:2829)
at scala.reflect.internal.Trees$class.itraverse(Trees.scala:1174)
at scala.reflect.internal.SymbolTable.itraverse(SymbolTable.scala:13)
at scala.reflect.internal.SymbolTable.itraverse(SymbolTable.scala:13)
at scala.reflect.api.Trees$Traverser.traverse(Trees.scala:2825)
at 
scala.reflect.internal.Trees$ForeachTreeTraverser.traverse(Trees.scala:1490)
at 
scala.reflect.internal.Trees$TreeContextApiImpl.foreach(Trees.scala:80)
at 
xsbt.ExtractUsedNames.handleMacroExpansion$1(ExtractUsedNames.scala:64)
at 
xsbt.ExtractUsedNames.xsbt$ExtractUsedNames$$handleTreeNode$1(ExtractUsedNames.scala:95)
at 
xsbt.ExtractUsedNames$$anonfun$handleMacroExpansion$1$2.apply(ExtractUsedNames.scala:64)
at 
xsbt.ExtractUsedNames$$anonfun$handleMacroExpansion$1$2.apply(ExtractUsedNames.scala:64)
at 
scala.reflect.internal.Trees$ForeachTreeTraverser.traverse(Trees.scala:1489)

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2815]: Compilation failed upon the hado...

2014-08-04 Thread witgo

Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1754#issuecomment-51034387
  
We need to explicitly pointed out that spark does not support the version 
`2.0.x` and `2.1.x` of yarn ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2713] Executors of same application in ...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1616#issuecomment-51037911
  
QA tests have started for PR 1616. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17852/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] [SPARK-2826] Reduce the memory copy whil...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1765#issuecomment-51039744
  
QA results for PR 1765:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17851/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2713] Executors of same application in ...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1616#issuecomment-51044768
  
QA results for PR 1616:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17852/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2792. Fix reading too much or too little...

2014-08-04 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/1722#discussion_r15750212
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala ---
@@ -35,16 +35,15 @@ private[spark] class JavaSerializationStream(out: 
OutputStream, counterReset: In
   /**
* Calling reset to avoid memory leak:
* 
http://stackoverflow.com/questions/1281549/memory-leak-traps-in-the-java-standard-api
-   * But only call it every 10,000th time to avoid bloated serialization 
streams (when
+   * But only call it every 100th time to avoid bloated serialization 
streams (when
* the stream 'resets' object class descriptions have to be re-written)
*/
   def writeObject[T: ClassTag](t: T): SerializationStream = {
 objOut.writeObject(t)
+counter += 1
 if (counterReset  0  counter = counterReset) {
--- End diff --

This is the right behavior, but is a slight change ... I dont think anyone 
is expecting the earlier  behavior though !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2792. Fix reading too much or too little...

2014-08-04 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1722#issuecomment-51047651
  
LGTM !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2815]: Compilation failed upon the hado...

2014-08-04 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1754#issuecomment-51048750
  
@witgo I don't think #151 is to be committed, if I understand correctly. 
It's not 100% clear which versions of YARN 2.0.x actually work with 
`yarn-alpha`, and which if any work with `yarn`. If anything it's worth a note 
that the pre-stable YARN versions are not guaranteed to work, but might.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2713] Executors of same application in ...

2014-08-04 Thread li-zhihui

Github user li-zhihui commented on the pull request:

https://github.com/apache/spark/pull/1616#issuecomment-51051752
  
@JoshRosen added comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2678][Core] Added -- to prevent spark...

2014-08-04 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1715#issuecomment-51055752
  
@pwendell OK, the `java -firstCpElement` example really convinced me :) I 
used to think asking users to care about the order of the jars is a little too 
much, but every sane programmer over JVM should care about this anyway.

I'll try to deliver a version as you described soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1750#issuecomment-51064528
  
QA tests have started for PR 1750. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17853/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...

2014-08-04 Thread javadba

Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-51065193
  
Hi,
  For some reason the CORE module testing has ballooned in overall testing 
time: it took over 7.5 hours to run. There was one timeout error out of 736 
tests - and it is quite unlikely to have anything to do with the code added in 
this PR.  

Here is the test that failed and then the overall results:

DriverSuite:
Spark assembly has been built with Hive, including Datanucleus jars on 
classpath
- driver should exit after finishing *** FAILED ***
  TestFailedDueToTimeoutException was thrown during property 
evaluation. (DriverSuite.scala:40)
Message: The code passed to failAfter did not complete within 60 
seconds.
Location: (DriverSuite.scala:41)
Occurred at table row 0 (zero based, not counting headings), which 
had values (
  master = local
)

  Tests: succeeded 723, failed 1, canceled 0, ignored 7, pending 0
*** 1 TEST FAILED ***
[INFO] 

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [  
1.180 s]
[INFO] Spark Project Core . FAILURE [  
07:35 h]

So I am not presently in a position to run regression tests - given the 
overall runtime will be doulbe-digit hours.  Would someone please run Jenkins 
on this code?  




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Fix postfixOps warnings in the test suite

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1323#issuecomment-51068459
  
QA tests have started for PR 1323. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17854/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Fix postfixOps warnings in the test suite

2014-08-04 Thread witgo

Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/1323#issuecomment-51069490
  
Related work #1330 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2482: Resolve sbt warnings during build

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1330#issuecomment-51070559
  
QA tests have started for PR 1330. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17855/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-08-04 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/1269#discussion_r15759585
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/topicmodeling/utils/serialization/TObjectIntHashMapSerializer.scala
 ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.topicmodeling.utils.serialization
+
+import com.esotericsoftware.kryo.io.{Input, Output}
+import com.esotericsoftware.kryo.{Kryo, Serializer}
+import gnu.trove.map.hash.TObjectIntHashMap
--- End diff --

This can be replaced with ` breeze.util.Index`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-08-04 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/1269#discussion_r15759818
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/topicmodeling/topicmodels/RobustPLSASuite.scala
 ---
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.topicmodeling.topicmodels
+
+import java.util.Random
+
+import 
org.apache.spark.mllib.clustering.topicmodeling.topicmodels.regulaizers.{SymmetricDirichletDocumentOverTopicDistributionRegularizer,
 SymmetricDirichletTopicRegularizer}
+
+class RobustPLSASuite extends 
AbstractTopicModelSuite[RobustDocumentParameters,
+  RobustGlobalParameters] {
+  test(feasibility) {
+val numberOfTopics = 2
+val numberOfIterations = 10
+
+val plsa = new RobustPLSA(sc,
+  numberOfTopics,
+  numberOfIterations,
+  new Random(),
+  new SymmetricDirichletDocumentOverTopicDistributionRegularizer(0.2f),
+  new SymmetricDirichletTopicRegularizer(0.2f))
+
+testPLSA(plsa)
+  }
+
+}
--- End diff --

It seems that git need to blank line end.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2817] add show create table support

2014-08-04 Thread tianyi

Github user tianyi commented on the pull request:

https://github.com/apache/spark/pull/1760#issuecomment-51072215
  
what's wrong with jenkins?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2583] ConnectionManager cannot distingu...

2014-08-04 Thread sarutak

Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/1490#issuecomment-51072948
  
Thanks for your back up @JoshRosen .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1750#issuecomment-51075115
  
QA results for PR 1750:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brcase class Sqrt(child: Expression) extends UnaryExpression 
{brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17853/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2817] add show create table support

2014-08-04 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1760#issuecomment-51075309
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Fix postfixOps warnings in the test suite

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1323#issuecomment-51076281
  
QA results for PR 1323:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17854/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2482: Resolve sbt warnings during build

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1330#issuecomment-51078602
  
QA results for PR 1330:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17855/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2713] Executors of same application in ...

2014-08-04 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1616#issuecomment-51081881
  
Thanks for commenting.  I now realize that my concern about advisory 
locking was a little misguided, since only cooperating Spark processes will be 
coordinating through the lock file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2817] add show create table support

2014-08-04 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1760#issuecomment-51083724
  
LGTM :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2713] Executors of same application in ...

2014-08-04 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1616#issuecomment-51085554
  
This seems like an alright fix and I'd like to get it into a release, but 
I'm concerned that this doesn't correctly handle every possible feature of 
`fetchFile`.

For example, there's [some 
code](https://github.com/li-zhihui/spark/blob/cachefiles/core/src/main/scala/org/apache/spark/util/Utils.scala#L444)
 in `fetchFile` to automatically decompress `.tar.gz` files.  I don't remember 
why this code was added (or whether it's actually correct, since it seems to 
assume that files are downloaded into the current working directory), but I'm 
not sure that `fetchCachedFile` will properly handle that case; it seems like 
it would only copy the `.tar.gz` file without decompressing it in the 
executor's directory.

We could try to special-case fix this by moving the decompression logic 
into `fetchCachedFile`, but I'm worried that it will make `fetchFile` even 
harder to understand.  I think that `fetchFile` might be due for a refactoring.

Also, do you think we should just replace `fetchFile` with 
`fetchCachedFile` and keep the uncached version private?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2806] core - upgrade to json4s-jackson ...

2014-08-04 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1702#issuecomment-51085913
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2806] core - upgrade to json4s-jackson ...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1702#issuecomment-51086273
  
QA tests have started for PR 1702. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17856/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-1473: Feature selection fo...

2014-08-04 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1484#issuecomment-51086836
  
@avulanov I have the same concern about calling `transform` before `fit`. 
There are two options: 1) throw an error, 2) fit on the same dataset and then 
transform (fit_transform in sk-learn). But I don't have a strong preference of 
either one.

I want to add another candidate to what you proposed:

~~~
class ChiSquaredFeatureSelection {
   def fit(dataset: RDD[LabeledPoint], numFeatures: Int): 
ChiSquaredFeatureSelector
}

class ChiSquaredFeatureSelector {
  def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
~~~

We can discuss the class hierarchy later since they are not user-facing.

A problem with all the candidates here is we cannot apply the same 
transformation on `RDD[Vector]`, which is required for prediction. I'm thinking 
about something like the following:

~~~
class ChiSquaredFeatureSelection {
   def fit[T : Vectorized with Labeled](dataset: RDD[T], numFeatures: 
Int): ChiSquaredFeatureSelector
}

class ChiSquaredFeatureSelector {
  def transform[T : Vectorized](dataset: RDD[T]): RDD[T]
}
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-08-04 Thread chutium

Github user chutium commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15766362
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, 
SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self))
 
   /**
+   * :: DeveloperApi ::
+   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by 
applying a schema to this RDD.
+   * It is important to make sure that the structure of every [[Row]] of 
the provided RDD matches
+   * the provided schema. Otherwise, there will be runtime exception.
+   * Example:
+   * {{{
+   *  import org.apache.spark.sql._
+   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+   *
+   *  val schema =
+   *StructType(
+   *  StructField(name, StringType, false) ::
+   *  StructField(age, IntegerType, true) :: Nil)
+   *
--- End diff --

Hi @yhuai , why we need to define schema as a StructType, but not directly 
as a Seq[StructField]? i tried to build a Seq[StructField] from JDBC metadata 
in #1612 https://github.com/apache/spark/pull/1612/files#diff-3 (it followed 
the code of your JsonRDD :)

it seems we do not need this StructType anywhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-04 Thread nchammas

Github user nchammas commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51087364
  
@rxin @pwendell This PR is ready for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2583] ConnectionManager error reporting

2014-08-04 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1758#issuecomment-51087933
  
Jenkins, retest this please @JoshRosen it appears something timed out or 
failed during the tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2583] ConnectionManager error reporting

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1758#issuecomment-51088288
  
QA tests have started for PR 1758. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17857/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2678][Core] Added -- to prevent spark...

2014-08-04 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1715#issuecomment-51089001
  
@pwendell How about for python files? What if I have one.py and two.py that 
reference each other, and I want spark-submit to run the main method of one.py 
but not two.py. Since we don't specify a class, I'm not sure how we can 
distinguish between the two main methods unless we impose the requirement that 
the primary python file must be the first one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-08-04 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r15767384
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -89,6 +88,44 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, 
SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd))(self))
 
   /**
+   * :: DeveloperApi ::
+   * Creates a [[SchemaRDD]] from an [[RDD]] containing [[Row]]s by 
applying a schema to this RDD.
+   * It is important to make sure that the structure of every [[Row]] of 
the provided RDD matches
+   * the provided schema. Otherwise, there will be runtime exception.
+   * Example:
+   * {{{
+   *  import org.apache.spark.sql._
+   *  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+   *
+   *  val schema =
+   *StructType(
+   *  StructField(name, StringType, false) ::
+   *  StructField(age, IntegerType, true) :: Nil)
+   *
--- End diff --

For the completeness of our data types, we need `StructType` 
(`Seq[StructField]` is not a data type). For example, if the type of a filed is 
a struct, we need to have a way to describe that the type of this field is a 
struct. Also, because a row is basically a struct value, it is natural to use 
`StructType` to represent a schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2380: Support displaying accumulator val...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1309#issuecomment-51089309
  
QA tests have started for PR 1309. This patch DID NOT merge cleanly! 
brView progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17858/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2678][Core] Added -- to prevent spark...

2014-08-04 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1715#issuecomment-51089761
  
@andrewor14 I believe Patrick means:

1. For Scala/Java applications, the primary jar should appear as the 1st 
entry of `--jars`
1. For Python applications, the primary Python file should appear as the 
1st entry of `--py-files`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2583] ConnectionManager error reporting

2014-08-04 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1758#discussion_r15768094
  
--- Diff: 
core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
@@ -41,16 +42,26 @@ import org.apache.spark.util.{SystemClock, Utils}
 private[spark] class ConnectionManager(port: Int, conf: SparkConf,
 securityManager: SecurityManager) extends Logging {
 
+  /**
+   * Used by sendMessageReliably to track messages being sent.
+   * @param message the message that was sent
+   * @param connectionManagerId the connection manager that sent this 
message
+   * @param completionHandler callback that's invoked when the send has 
completed or failed
+   */
   class MessageStatus(
   val message: Message,
   val connectionManagerId: ConnectionManagerId,
   completionHandler: MessageStatus = Unit) {
 
+/** This is non-None if message has been ack'd */
 var ackMessage: Option[Message] = None
-var attempted = false
--- End diff --

You'll notice that I removed a bunch of fields here.  `attempted` was never 
read anywhere, and `acked` implied `ackMessage != None`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2678][Core] Added -- to prevent spark...

2014-08-04 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1715#issuecomment-51090902
  
Hm, I see. Even then we still need some kind of separator right? I thought 
the whole point of handling primary resources differently here (either under 
`--primary` or `--jars` or `--py-files`) is to provide backwards compatibility 
in case the user application uses `--`. If we pick a Spark specific enough 
separator, doesn't this issue go away?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2806] core - upgrade to json4s-jackson ...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1702#issuecomment-51091836
  
QA results for PR 1702:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17856/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2806] core - upgrade to json4s-jackson ...

2014-08-04 Thread avati

Github user avati commented on the pull request:

https://github.com/apache/spark/pull/1702#issuecomment-51092475
  
It is not clear how the failure is related to this patch..?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2380: Support displaying accumulator val...

2014-08-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1309#issuecomment-51092478
  
QA tests have started for PR 1309. This patch DID NOT merge cleanly! 
brView progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17859/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2380: Support displaying accumulator val...

2014-08-04 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/1309#discussion_r15769145
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskInfo.scala ---
@@ -42,6 +44,13 @@ class TaskInfo(
   var gettingResultTime: Long = 0
 
   /**
+   * Intermediate updates to accumulables during this task. Note that it 
is valid for the same
+   * accumulable to be updated multiple times in a single task or for two 
accumulables with the
+   * same name but different ID's to exist in a task.
--- End diff --

super nit: no apostrophe in IDs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 405 matches

Mail list logo