date:20140716

[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49126648
  
QA results for PR 1313:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16707/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-16 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1244#issuecomment-49126716
  
We only want to do this if the driver shares the same directory structure 
as the executors. This is an assumption that is incorrect in many deployment 
settings. Really, we should have something like `spark.executor.home` that is 
not the same as `SPARK_HOME`.

I am not 100% sure if we can just rip this functionality out actually. I am 
under the impression that Mesos still depends on something like this, so we 
should double check before we remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1707. Remove unnecessary 3 second sleep ...

2014-07-16 Thread sryza

Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/634#issuecomment-49126877
  
That's a good point.  Changing it for YARN seems like the right thing, and 
80% sounds reasonable to me.

Another thing is the wait time.  Previously it was 6 seconds, but now 
spark.scheduler.maxRegisteredExecutorsWaitingTime defaults to 5 times that.  30 
seconds seems a little excessive to me in general - at least for jobs without 
caching, after a couple seconds the wait outweighs scheduling some non-local 
tasks.  What do you think about decreasing this to 6 seconds in general?  Or at 
least for YARN.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2517] Removed some compiler type erasur...

2014-07-16 Thread rxin

GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1431

[SPARK-2517] Removed some compiler type erasure warnings.

Also took the chance to rename some variables to avoid unintentional 
shadowing. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark deprecate-warning

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1431.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1431


commit 44abdcce6f3409a0fa528252e88aaba5cd615559
Author: Reynold Xin r...@apache.org
Date:   2014-07-16T06:03:10Z

[SPARK-2517] Removed some compiler type erasure warnings.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2517] Removed some compiler type erasur...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1431#issuecomment-49127255
  
QA tests have started for PR 1431. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16716/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SQL] Cleaned up ConstantFolding slightly.

2014-07-16 Thread chenghao-intel

Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1430#issuecomment-49127540
  
The code looks good to me.
I am thinking if we could merge the ConstantFolding  NullPropagation and 
make it `transformExpressionsUp` , since they kind of rely on each other, and 
people may always get confused when place a Null or Constant optimization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2518][SQL] Fix foldability of Substring...

2014-07-16 Thread ueshin

GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/1432

[SPARK-2518][SQL] Fix foldability of Substring expression.

This is a follow-up of #1428.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-2518

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1432.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1432


commit 37d1ace86d909bc4204bbb9f55c56a76aae8c106
Author: Takuya UESHIN ues...@happy-camper.st
Date:   2014-07-16T06:22:17Z

Fix foldability of Substring expression.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SQL] Cleaned up ConstantFolding slightly.

2014-07-16 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1430#issuecomment-49128331
  
See the discussion at #1428.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1259#issuecomment-49128371
  
QA results for PR 1259:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass TaskRunner(brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16710/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2518][SQL] Fix foldability of Substring...

2014-07-16 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1432#issuecomment-49128436
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1407#issuecomment-49130099
  
QA results for PR 1407:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16712/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-16 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-49131064
  
Besides breaking the API, I'm also worried about two things:

1. The increase in storage. We had some discussion before v1.0 about 
whether we should switch to long or not. ALS is not computation heavy for small 
k but communication heavy. I posted some screenshots on the JIRA page, where 
ALS shuffles ~200GB data in each iteration. With Long ids, this number may 
become ~300GB and hence ALS may slow down by 50%. Instead of upgrading the id 
type to Long, I'm actually thinking about downgrading the rating type to Float.

2. Is collision really bad? ALS needs somewhat dense matrix to compute 
good recommendations. If there are 3 billion users but each user only gives 1 
or 2 ratings, ALS is very likely to overfit. In this case, making a random 
projection on the user side would certainly help, while hashing is one of the 
commonly used techniques for random projection. There will be bad 
recommendations no matter whether there exist hash collisions or not. So I'm 
really interested in some measurements on the downside of hash collision.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

2014-07-16 Thread cfregly

GitHub user cfregly opened a pull request:

https://github.com/apache/spark/pull/1434

[SPARK-1981] Add AWS Kinesis streaming support



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cfregly/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1434.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1434


commit b3b0ff118cac3c0a5a10f9912b383bb0665c9a1b
Author: Chris Fregly ch...@fregly.com
Date:   2014-07-16T07:03:04Z

[SPARK-1981] Add AWS Kinesis streaming support




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

2014-07-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1434#issuecomment-49132892
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519. Eliminate pattern-matching on Tupl...

2014-07-16 Thread sryza

GitHub user sryza opened a pull request:

https://github.com/apache/spark/pull/1435

SPARK-2519. Eliminate pattern-matching on Tuple2 in performance-critical...

... aggregation code

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sryza/spark sandy-spark-2519

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1435.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1435


commit 640706a19f96fd242e8619188c82e39cb6386fd3
Author: Sandy Ryza sa...@cloudera.com
Date:   2014-07-16T07:12:46Z

SPARK-2519. Eliminate pattern-matching on Tuple2 in performance-critical 
aggregation code




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1127 Add spark-hbase.

2014-07-16 Thread javadba

Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/194#issuecomment-49133737
  
Hi,   the referenced PR Spark-1416 includes the following comment by 
@MLnick:   

But looking at the HBase PR you referenced, I don't see the value of 
having that live in Spark. And why is it not simply using an OutputFormat 
instead of custom config and writing code? (I might be missing something here, 
but it seems to add complexity and maintenance burden unnecessarily)

Patrick: would you mind to tell us whether that comment were going to be 
affect this PR?  We are going to be providing a significant chunk of HBase 
functionality and would like to know whether to build off of this PR or not. 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519. Eliminate pattern-matching on Tupl...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1435#issuecomment-49133777
  
QA tests have started for PR 1435. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16719/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519. Eliminate pattern-matching on Tupl...

2014-07-16 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1435#issuecomment-49133829
  
Nice. LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2150: Provide direct link to finished ap...

2014-07-16 Thread rahulsinghaliitd

Github user rahulsinghaliitd commented on the pull request:

https://github.com/apache/spark/pull/1094#issuecomment-49133859
  
@tgravescs updated according to your comments and rebased to current HEAD 
of master branch. Thanks for following up on this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2517] Removed some compiler type erasur...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1431#issuecomment-49134307
  
QA results for PR 1431:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16716/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2518][SQL] Fix foldability of Substring...

2014-07-16 Thread chenghao-intel

Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1432#issuecomment-49134887
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add HiveDecimal HiveVarchar support in unwra...

2014-07-16 Thread chenghao-intel

GitHub user chenghao-intel opened a pull request:

https://github.com/apache/spark/pull/1436

Add HiveDecimal  HiveVarchar support in unwrapping data



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chenghao-intel/spark unwrapdata

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1436.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1436


commit 39d6475d8cf357488fd1ec736b4d910f8237fc5b
Author: Cheng Hao hao.ch...@intel.com
Date:   2014-07-16T07:59:50Z

Add HiveDecimal  HiveVarchar support in unwrap data

commit afc39da00f53f15edb466768c24cdd73ec5bc119
Author: Cheng Hao hao.ch...@intel.com
Date:   2014-07-16T08:21:25Z

Polish the code




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add HiveDecimal HiveVarchar support in unwra...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1436#issuecomment-49135679
  
QA tests have started for PR 1436. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16720/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Tightening visibility for various Broadcast re...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1438#issuecomment-49137498
  
QA tests have started for PR 1438. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16723/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SQL] Add HiveDecimal HiveVarchar support in...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1436#issuecomment-49137154
  
QA tests have started for PR 1436. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16722/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2523] [SQL] Hadoop table scan

2014-07-16 Thread chenghao-intel

GitHub user chenghao-intel opened a pull request:

https://github.com/apache/spark/pull/1439

[SPARK-2523] [SQL] Hadoop table scan

In HiveTableScan.scala, ObjectInspector was created for all of the 
partition based records, which probably causes ClassCastException if the object 
inspector is not identical among table  partitions.

This is the follow up with:
https://github.com/apache/spark/pull/1408
https://github.com/apache/spark/pull/1390

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chenghao-intel/spark hadoop_table_scan

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1439.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1439


commit 39d6475d8cf357488fd1ec736b4d910f8237fc5b
Author: Cheng Hao hao.ch...@intel.com
Date:   2014-07-16T07:59:50Z

Add HiveDecimal  HiveVarchar support in unwrap data

commit afc39da00f53f15edb466768c24cdd73ec5bc119
Author: Cheng Hao hao.ch...@intel.com
Date:   2014-07-16T08:21:25Z

Polish the code

commit d66835b420e22f98e10556783102e7dc356e6e6a
Author: Cheng Hao hao.ch...@intel.com
Date:   2014-07-16T08:24:30Z

Fix Bug in TableScan while Paritition SerDe is not compatiable with each 
other




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-16 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1425#issuecomment-49137261
  
@dbtsai The assertions with `===` were all tested to work, but I agree it 
is more robust to allow numerical errors. One downside of this change is that 
`===` reports the values in comparison when something is wrong but now 
`almostEquals` only returns true/false. It would be great if we can make the 
implementation similar to `===`.

Btw, Scalatest 2.x has this tolerance feature, where you can use `+-` to 
indicate a range. We are not using Scalatest 2.x but it is a useful feature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2523] [SQL] Hadoop table scan bug fixin...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49138252
  
QA tests have started for PR 1439. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16724/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-16 Thread chenghao-intel

Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1408#issuecomment-49138529
  
@yhuai @concretevitamin @rxin I've create another PR for this follow up, we 
can discuss this more at:
https://github.com/apache/spark/pull/1439


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-16 Thread chenghao-intel

Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49138992
  
`ObjectInspector` is not required by `Row` in Catalyst any more (not like 
in Shark), and it is tightly coupled with Deserializer  the raw data, so I 
moved the `ObjectInspector` into `TableReader`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...

2014-07-16 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1399#discussion_r14987086
  
--- Diff: sbin/start-thriftserver.sh ---
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Figure out where Spark is installed
+FWDIR=$(cd `dirname $0`/..; pwd)
+
+CLASS=org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
+$FWDIR/bin/spark-class $CLASS $@
--- End diff --

Thanks, I've noticed the discussion. Marked this PR as WIP, will update 
soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519. Eliminate pattern-matching on Tupl...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1435#issuecomment-49141155
  
QA results for PR 1435:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16719/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-16 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1425#discussion_r14988046
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetricsSuite.scala
 ---
@@ -20,8 +20,20 @@ package org.apache.spark.mllib.evaluation
 import org.scalatest.FunSuite
 
 import org.apache.spark.mllib.util.LocalSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
 
 class BinaryClassificationMetricsSuite extends FunSuite with 
LocalSparkContext {
+
+  implicit class SeqDoubleWithAlmostEquals(val x: Seq[Double]) {
+def almostEquals(y: Seq[Double], eps: Double = 1E-6): Boolean =
--- End diff --

1.0e-6 is way bigger than an ulp for a double; 1.0e-12 is more like it. I 
understand a complex calculation might legitimately vary by significantly more 
than an ulp depending on the implementation. As @mengxr says where you mean to 
allow significantly more than machine precision worth of noise, that's probably 
good to do with an explicitly larger epsilon. But this is certainly a good step 
forward already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SQL] Add HiveDecimal HiveVarchar support in...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1436#issuecomment-49143482
  
QA results for PR 1436:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16720/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: fix compile error of streaming project

2014-07-16 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/153#issuecomment-49144416
  
Seems harmless as it only makes the return type of the method explicit. I 
can't see why it would be specific to building with one version of Hadoop 
though. Maybe it isn't?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2452, create a new valid for each instea...

2014-07-16 Thread ScrapCodes

GitHub user ScrapCodes opened a pull request:

https://github.com/apache/spark/pull/1441

SPARK-2452, create a new valid for each  instead of using lineId.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ScrapCodes/spark-1 SPARK-2452/multi-statement

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1441.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1441


commit fb8b5e7d9fc22db0103d578c964dd1c7b1503ee0
Author: Prashant Sharma prash...@apache.org
Date:   2014-07-16T10:15:57Z

SPARK-2452, create a new valid for each  instead of using lineId, because 
Line ids can be same sometimes.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2452, create a new valid for each instea...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1441#issuecomment-49146776
  
QA tests have started for PR 1441. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16726/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49147325
  
QA results for PR 1439:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass HadoopTableReader(@transient attributes: 
Seq[Attribute], brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16724/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2452, create a new valid for each instea...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1441#issuecomment-49148347
  
QA tests have started for PR 1441. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16727/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-16 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-49149396
  
Let me close this PR for now. I will fork or wrap as necessary. Keep it in 
mind, and maybe in a 2.x release this can be revisited. (Matei I ran into more 
problems with the `Rating` class retrofit anyway.)

Yes storage is the downside. Your comments on JIRA about effects of 
serialization in compressing away the difference are promising. I completely 
agree with using `Float` for ratings and even feature vectors.

Yes I understand why random projections are helpful. It doesn't help 
accuracy, but may only trivially hurt it in return for some performance gain. 
If I have just 1 rating, it doesn't make my recs better to arbitrarily add your 
ratings to mine. Sure that's denser, and maybe you're getting less overfitting, 
but it's fitting the wrong input for both of us.

A collision here and there is probably acceptable. One in a million 
customers? OK. 1%? maybe a problem. I agree, you'd have to quantify this to 
decide. If I'm an end user of MLlib bringing even millions of things to my 
model, I have to decide. And if it's a problem, have to maintain a lookup table 
to use it.

It seemed simplest to moot the problem with a much bigger key space and 
engineer around the storage issue. A bit more memory is cheap; accuracy and 
engineer time are expensive. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread CodingCat

Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49153027
  
@mridulm @lirui-intel I separated the noPref tasks and those with the 
unavailable preferencethis will treat the tasks with the unavailable 
preference as non-local ones before their preferences become available




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2290] Worker should directly use its ow...

2014-07-16 Thread CodingCat

Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/1244#issuecomment-49153988
  
@andrewor14 yeah, I agree with you, I just thought in somewhere (document 
in the earlier versions? I cannot find it now), the user has to set this env 
variable? so I said prioritizing worker side SPARK_HOME, if this is not set, 
Spark will try to read application setup about SPARK_HOME (which may generates 
error if the directory structure is not the same)


I also noticed this JIRA https://issues.apache.org/jira/browse/SPARK-2454 
(left some comments there)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1440#issuecomment-49154451
  
QA results for PR 1440:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16725/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-16 Thread chutium

Github user chutium commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-49154812
  
hi, it is really very useful for us, i tried this implementation from 
@willb , in spark-shell, i still got java.lang.UnsupportedOperationException by 
Query Plan, i made some change in SqlParser: 
https://github.com/chutium/spark/commit/1de83a7560f85cd347bca6dde256d551da63a144



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added Parse of SQL SUBSTR()

2014-07-16 Thread chutium

GitHub user chutium opened a pull request:

https://github.com/apache/spark/pull/1442

SPARK-2407: Added Parse of SQL SUBSTR()

follow-up of #1359

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chutium/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1442.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1442


commit 1de83a7560f85cd347bca6dde256d551da63a144
Author: chutium teng@gmail.com
Date:   2014-07-16T11:44:09Z

SPARK-2407: Added Parse of SQL SUBSTR()




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()

2014-07-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1442#issuecomment-49155058
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-16 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-49155240
  
Awesome! Please submit a pull request with that addition.
On Jul 16, 2014 7:53 AM, Teng Qiu notificati...@github.com wrote:

 hi, it is really very useful for us, i tried this implementation from
 @willb https://github.com/willb , in spark-shell, i still got
 java.lang.UnsupportedOperationException by Query Plan, i made some change
 in SqlParser: chutium@1de83a7
 
https://github.com/chutium/spark/commit/1de83a7560f85cd347bca6dde256d551da63a144

 â
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1359#issuecomment-49154812.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1440#issuecomment-49155576
  
Hmm, just realized `Timestamp.toString` normalizes date and time according 
to current timezone and makes almost all timestamp related tests timezone 
sensitive. (Wouldn't notice this if I were in US...) Guess we have to blacklist 
them for now, and this will revert part of #1396.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2452, create a new valid for each instea...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1441#issuecomment-49155771
  
QA results for PR 1441:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16726/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1440#discussion_r14993202
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ---
@@ -249,3 +263,7 @@ case class Cast(child: Expression, dataType: DataType) 
extends UnaryExpression {
 if (evaluated == null) null else cast(evaluated)
   }
 }
+
+object Cast {
+  private[sql] val simpleDateFormat = new SimpleDateFormat(-MM-dd 
HH:mm:ss)
--- End diff --

Hi, `SimpleDateFormat` is not thread-safe, so `def` should be used instead 
of `val`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1440#issuecomment-49156553
  
Confirmed that the following test cases are timezone sensitive and 
blacklisted them (by first remove all timestamp related golden answers, run 
them in my local timezone to generate new golden answers, then manually change 
my timezone settings and rerun these tests):

- `timestamp_1`
- `timestamp_2`
- `timestamp_3` *
- `timestamp_udf` *

[*] Reverted from #1396.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2452, create a new valid for each instea...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1441#issuecomment-49156902
  
QA results for PR 1441:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16727/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1440#discussion_r14993659
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ---
@@ -249,3 +263,7 @@ case class Cast(child: Expression, dataType: DataType) 
extends UnaryExpression {
 if (evaluated == null) null else cast(evaluated)
   }
 }
+
+object Cast {
+  private[sql] val simpleDateFormat = new SimpleDateFormat(-MM-dd 
HH:mm:ss)
--- End diff --

Just checked `Timestamp.java`, it's indeed handled with a thread local 
variable. Thanks for pointing this out!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: discarded exceeded completedDrivers

2014-07-16 Thread CodingCat

Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/1114#issuecomment-49157266
  
a document for newly introduced spark.deploy.retainedDrivers is missing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1440#issuecomment-49157381
  
QA tests have started for PR 1440. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16728/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1440#issuecomment-49158865
  
QA tests have started for PR 1440. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16729/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-16 Thread chutium

Github user chutium commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-49159677
  
PR submitted #1442 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()

2014-07-16 Thread willb

Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1442#issuecomment-49159849
  
That was my thought as well, @egraldlo.  Thanks for submitting this, 
@chutium!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()

2014-07-16 Thread egraldlo

Github user egraldlo commented on the pull request:

https://github.com/apache/spark/pull/1442#issuecomment-49160379
  
thx @willb, maybe protected val SUBSTRING = Keyword(SUBSTRING) as well, 
but this will cause the code redundance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()

2014-07-16 Thread willb

Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1442#issuecomment-49160982
  
@egraldlo, couldn't it be `(SUBSTR | SUBSTRING) ~ // ... ` in that case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()

2014-07-16 Thread egraldlo

Github user egraldlo commented on the pull request:

https://github.com/apache/spark/pull/1442#issuecomment-49161743
  
fine, that's great!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1440#issuecomment-49168910
  
QA results for PR 1440:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16728/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1440#issuecomment-49171137
  
QA results for PR 1440:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16729/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: discarded exceeded completedDrivers

2014-07-16 Thread lianhuiwang

Github user lianhuiwang commented on the pull request:

https://github.com/apache/spark/pull/1114#issuecomment-49173609
  

@CodingCat  thanks,i have created a jire issue
 https://issues.apache.org/jira/browse/SPARK-2524


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519. Eliminate pattern-matching on Tupl...

2014-07-16 Thread markhamstra

Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/1435#issuecomment-49174657
  
Hmmm... not sure that I would go so far as to call it nice.  This does 
make the code slightly more difficult to read and understand, so can we hope 
that you've got some relative performance numbers that justify this compromise, 
@sryza ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-16 Thread yhuai

Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1439#issuecomment-49175346
  
Could you elaborate on when we will see an exception?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2524] missing document about spark.depl...

2014-07-16 Thread lianhuiwang

GitHub user lianhuiwang opened a pull request:

https://github.com/apache/spark/pull/1443

[SPARK-2524] missing document about spark.deploy.retainedDrivers

https://issues.apache.org/jira/browse/SPARK-2524
The configuration on spark.deploy.retainedDrivers is undocumented but 
actually used

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L60

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lianhuiwang/spark SPARK-2524

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1443.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1443


commit f2b597022b4fc4023c238e5b5a9824946f84f84e
Author: lianhuiwang lianhuiwan...@gmail.com
Date:   2014-05-23T14:02:57Z

bugfix worker DriverStateChanged state should match DriverState.FAILED

commit 480ce949a83c0d854078b38f5665f3369cf759eb
Author: lianhuiwang lianhuiwan...@gmail.com
Date:   2014-05-24T15:24:37Z

address aarondav comments

commit 8bbfe76dd8c8af815fa8404eb9a7922e58f938f7
Author: lianhuiwang lianhuiwan...@gmail.com
Date:   2014-06-10T16:01:36Z

Merge remote-tracking branch 'upstream/master'

commit eacf9339a8c062cf3f28343a4f8157d214d25b00
Author: lianhuiwang lianhuiwan...@gmail.com
Date:   2014-07-13T14:13:03Z

Merge remote-tracking branch 'upstream/master'

commit 44a3f50c689849228c42d072bdd355781dbacec6
Author: unknown administra...@taguswang-pc1.tencent.com
Date:   2014-07-16T14:22:18Z

Merge remote-tracking branch 'upstream/master'

commit 5f6bbb7119ecd188af4967ac15f3ff1986ad400d
Author: Wang Lianhui lianhuiwan...@gmail.com
Date:   2014-07-16T14:40:03Z

missing document about spark.deploy.retainedDrivers

The configuration on spark.deploy.retainedDrivers is undocumented but 
actually used

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L60




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2524] missing document about spark.depl...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1443#issuecomment-49176249
  
QA tests have started for PR 1443. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16730/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1440#discussion_r15001961
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala ---
@@ -344,21 +344,52 @@ private[sql] class StringColumnStats extends 
BasicColumnStats(STRING) {
   }
 
   override def contains(row: Row, ordinal: Int) = {
-!(upperBound eq null)  {
+(upperBound ne null)  {
--- End diff --

Nit: Spark style would probably prefer != here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1440#discussion_r15002400
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -93,6 +93,10 @@ class HiveCompatibilitySuite extends HiveQueryFileTest 
with BeforeAndAfter {
 partitions_json,
 
 // Timezone specific test answers.
+timestamp_1,
--- End diff --

Is there someway we could fix the timezone in the test harness instead of 
turning all of these off?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

2014-07-16 Thread viirya

Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/1418#issuecomment-49184593
  
Another example of this problem is the PageRank example bundled in Spark. 
At this time, since the problem of Java serializer still exists, to avoid 
causing StackOverflowError after too many iterations, it is needed to call 
`checkpoint()` on the RDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-16 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49188774
  
You are right that this rule does more than null propagation now.  I'm not 
sure what a better name would be.  `DegenerateExpressionSimplification`?

Regarding moving null propagation into the expressions, you could do it... 
but what would it look like?  You specify which of the children make the entire 
expression null if they are null?  Seems like a lot of refactoring for little 
benefit...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2524] missing document about spark.depl...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1443#issuecomment-49191235
  
QA results for PR 1443:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16730/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2033] Automatically cleanup checkpoint

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/855#issuecomment-49192571
  
QA tests have started for PR 855. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16731/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2517] Removed some compiler type erasur...

2014-07-16 Thread yhuai

Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1431#issuecomment-49193038
  
How about we close this one and merge #1444?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2525][SQL] Remove as many compilation w...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1444#issuecomment-49193205
  
QA tests have started for PR 1444. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16732/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Improve ALS algorithm resource usage

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/929#issuecomment-49193225
  
QA tests have started for PR 929. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16733/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

2014-07-16 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1370#issuecomment-49193554
  
Thanks!  I've merged this into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

2014-07-16 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1370


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519. Eliminate pattern-matching on Tupl...

2014-07-16 Thread sryza

Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/1435#issuecomment-49194701
  
I'm going off of @mateiz 's report on SPARK-2048 that we found [this] to 
be much slower than accessing fields directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519. Eliminate pattern-matching on Tupl...

2014-07-16 Thread markhamstra

Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/1435#issuecomment-49195389
  
Got it.  Thanks.  That also helps to put some bound (for now) on where we 
will make such performance optimizations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1719: spark.*.extraLibraryPath isn't app...

2014-07-16 Thread witgo

Github user witgo closed the pull request at:

https://github.com/apache/spark/pull/1022


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1719: spark.*.extraLibraryPath isn't app...

2014-07-16 Thread witgo

GitHub user witgo reopened a pull request:

https://github.com/apache/spark/pull/1022

SPARK-1719: spark.*.extraLibraryPath isn't applied on yarn

Fix: spark.executor.extraLibraryPath isn't applied on yarn

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/witgo/spark SPARK-1719

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1022.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1022


commit b23e9c3e4085c0a7faf2c51fd350ad1233aa7a40
Author: Prashant Sharma prashan...@imaginea.com
Date:   2014-07-11T18:52:35Z

[SPARK-2437] Rename MAVEN_PROFILES to SBT_MAVEN_PROFILES and add 
SBT_MAVEN_PROPERTIES

NOTE: It is not possible to use both env variable  `SBT_MAVEN_PROFILES`  
and `-P` flag at same time. `-P` if specified takes precedence.

Author: Prashant Sharma prashan...@imaginea.com

Closes #1374 from ScrapCodes/SPARK-2437/rename-MAVEN_PROFILES and squashes 
the following commits:

8694bde [Prashant Sharma] [SPARK-2437] Rename MAVEN_PROFILES to 
SBT_MAVEN_PROFILES and add SBT_MAVEN_PROPERTIES

commit cbff18774b0a2f346901ddf2f566be50561a57c7
Author: Kousuke Saruta saru...@oss.nttdata.co.jp
Date:   2014-07-12T04:10:26Z

[SPARK-2457] Inconsistent description in README about build option

Now, we should use -Pyarn instead of SPARK_YARN when building but README 
says as follows.

For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and 
other Hadoop versions
with YARN, also set `SPARK_YARN=true`:

  # Apache Hadoop 2.0.5-alpha
  $ sbt/sbt -Dhadoop.version=2.0.5-alpha -Pyarn assembly

  # Cloudera CDH 4.2.0 with MapReduce v2
  $ sbt/sbt -Dhadoop.version=2.0.0-cdh4.2.0 -Pyarn assembly

  # Apache Hadoop 2.2.X and newer
  $ sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly

Author: Kousuke Saruta saru...@oss.nttdata.co.jp

Closes #1382 from sarutak/SPARK-2457 and squashes the following commits:

e7b2d64 [Kousuke Saruta] Replaced SPARK_YARN=true with -Pyarn in README

commit 55960869358d4f8aa5b2e3b17d87b0b02ba9acdd
Author: DB Tsai dbt...@dbtsai.com
Date:   2014-07-12T06:04:43Z

[SPARK-1969][MLlib] Online summarizer APIs for mean, variance, min, and max

It basically moved the private ColumnStatisticsAggregator class from 
RowMatrix to public available DeveloperApi with documentation and unitests.

Changes:
1) Moved the private implementation from 
org.apache.spark.mllib.linalg.ColumnStatisticsAggregator to 
org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
2) When creating OnlineSummarizer object, the number of columns is not 
needed in the constructor. It's determined when users add the first sample.
3) Added the APIs documentation for MultivariateOnlineSummarizer.
4) Added the unittests for MultivariateOnlineSummarizer.

Author: DB Tsai dbt...@dbtsai.com

Closes #955 from dbtsai/dbtsai-summarizer and squashes the following 
commits:

b13ac90 [DB Tsai] dbtsai-summarizer

commit d38887b8a0d00a11d7cf9393e7cb0918c3ec7a22
Author: Li Pu l...@twitter.com
Date:   2014-07-12T06:26:47Z

use specialized axpy in RowMatrix for SVD

After running some more tests on large matrix, found that the BV axpy 
(breeze/linalg/Vector.scala, axpy) is slower than the BSV axpy 
(breeze/linalg/operators/SparseVectorOps.scala, sv_dv_axpy), 8s v.s. 2s for 
each multiplication. The BV axpy operates on an iterator while BSV axpy 
directly operates on the underlying array. I think the overhead comes from 
creating the iterator (with a zip) and advancing the pointers.

Author: Li Pu l...@twitter.com
Author: Xiangrui Meng m...@databricks.com
Author: Li Pu li...@outlook.com

Closes #1378 from vrilleup/master and squashes the following commits:

6fb01a3 [Li Pu] use specialized axpy in RowMatrix
5255f2a [Li Pu] Merge remote-tracking branch 'upstream/master'
7312ec1 [Li Pu] very minor comment fix
4c618e9 [Li Pu] Merge pull request #1 from mengxr/vrilleup-master
a461082 [Xiangrui Meng] make superscript show up correctly in doc
861ec48 [Xiangrui Meng] simplify axpy
62969fa [Xiangrui Meng] use BDV directly in symmetricEigs change the 
computation mode to local-svd, local-eigs, and dist-eigs update tests and docs
c273771 [Li Pu] automatically determine SVD compute mode and parameters
7148426 [Li Pu] improve RowMatrix multiply
5543cce [Li Pu] improve svd api
819824b [Li Pu] add flag for dense svd or sparse svd
eb15100 [Li Pu] fix binary compatibility
4c7aec3 [Li Pu] improve comments
e7850ed [Li Pu] use aggregate and axpy
827411b [Li Pu] fix EOF new line
9c80515 [Li Pu] use non-sparse implementation

[GitHub] spark pull request: SPARK-1719: spark.*.extraLibraryPath isn't app...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1022#issuecomment-49196995
  
QA tests have started for PR 1022. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16735/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-16 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1425#discussion_r15013544
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
 ---
@@ -81,9 +82,8 @@ class LogisticRegressionSuite extends FunSuite with 
LocalSparkContext with Match
 val model = lr.run(testRDD)
 
 // Test the weights
-val weight0 = model.weights(0)
-assert(weight0 = -1.60  weight0 = -1.40, weight0 +  not in [-1.6, 
-1.4])
-assert(model.intercept = 1.9  model.intercept = 2.1, 
model.intercept +  not in [1.9, 2.1])
+assert(model.weights(0).almostEquals(-1.5244128696247), weight0 
should be -1.5244128696247)
--- End diff --

We can have higher relative error here instead. If the implementation is 
changed, it's also nice to have a test which can catch the slightly different 
behavior. Also, updating those numbers will not take too much time comparing 
with the implementation work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-16 Thread concretevitamin

Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1439#discussion_r15013611
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -241,4 +252,37 @@ private[hive] object HadoopTableReader {
 val bufferSize = System.getProperty(spark.buffer.size, 65536)
 jobConf.set(io.file.buffer.size, bufferSize)
   }
+
+  /**
+   * Transform the raw data(Writable object) into the Row object for an 
iterable input
+   * @param iter Iterable input which represented as Writable object
+   * @param deserializer Deserializer associated with the input writable 
object
+   * @param attrs Represents the row attribute names and its zero-based 
position in the MutableRow
+   * @param row reusable MutableRow object
+   * 
+   * @return Iterable Row object that transformed from the given iterable 
input.
+   */
+  def fillObject(iter: Iterator[Writable], deserializer: Deserializer, 
+  attrs: Seq[(Attribute, Int)], row: GenericMutableRow): Iterator[Row] 
= {
+val soi = 
deserializer.getObjectInspector().asInstanceOf[StructObjectInspector]
+// get the field references according to the attributes(output of the 
reader) required
+val fieldRefs = attrs.map { case (attr, idx) = 
(soi.getStructFieldRef(attr.name), idx) }
+  
+// Map each tuple to a row object
+iter.map { value =
+  val raw = deserializer.deserialize(value)
+  var idx = 0;
+  while(idx  fieldRefs.length) {
--- End diff --

nit: space after while


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2098: All Spark processes should support...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1256#issuecomment-49197007
  
QA tests have started for PR 1256. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16734/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-16 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1425#discussion_r15013786
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetricsSuite.scala
 ---
@@ -20,8 +20,20 @@ package org.apache.spark.mllib.evaluation
 import org.scalatest.FunSuite
 
 import org.apache.spark.mllib.util.LocalSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
 
 class BinaryClassificationMetricsSuite extends FunSuite with 
LocalSparkContext {
+
+  implicit class SeqDoubleWithAlmostEquals(val x: Seq[Double]) {
+def almostEquals(y: Seq[Double], eps: Double = 1E-6): Boolean =
--- End diff --

Yeah, for one ulp, it might be 10e-15. Lots of time, I manually type the 
numbers or just copy the first couple dights of numbers to save the line space, 
so that's why I chose 1.0e-6. Thus, I can just type around 7 digits of numbers. 

I agree with you that in this case, we may want to explicitly specify with 
larger epsilon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2277: make TaskScheduler track hosts on ...

2014-07-16 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1212


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-16 Thread MLnick

Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-49198901
  
Great - I will review in more detail after that. Would be great to get this
merged before 1.1 freeze so PySpark I/O for inputformat and outputformat is
in for the next release!


On Tue, Jul 15, 2014 at 1:07 AM, kanzhang notificati...@github.com wrote:

 @MLnick https://github.com/MLnick I'll see if I can add couple output
 converter examples as well. Thx.

 â
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1338#issuecomment-48971710.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1719: spark.*.extraLibraryPath isn't app...

2014-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1022#issuecomment-49199676
  
QA tests have started for PR 1022. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16737/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49200080
  
I just noticed that pendingTasksWithNotReadyPrefs is not being used now ? 
It is getting updated but never actually queried from ...
Do we need to maintain it ?

The way I initially thought about this problem was, 
1) When a task has no preferred location by definition : schedule it on any 
node when there are no NODE_LOCAL tasks available for that executor.
2) When a task has preferred location defined, but none available right 
now, treat is as ANY task : so that other PROCESS/NODE/RACK local tasks have 
precedence over it. If/when a node/rack local host pops in, it becomes eligible 
for better schedule preference.

@CodingCat, @kayousterhout @lirui-intel any thoughts ? I might be missing 
somethere here !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2277: make TaskScheduler track hosts on ...

2014-07-16 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1212#issuecomment-49200501
  
Thanks @lirui-intel merged finally :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread CodingCat

Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49200595
  
@mridulm this is exactly what the PR is doing here? no?

yes, it seems that pendingTasksWithNotReadyPrefs is redundant


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread kayousterhout

Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49201615
  
If pendingTasksWithNotReady is never used, why was it added?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Tightening visibility for various Broadcast re...

2014-07-16 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1438#issuecomment-4920
  
Merging this in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread CodingCat

Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49201718
  
oh, just a mistake, I'm removing it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2526: Simplify options in make-distribut...

2014-07-16 Thread pwendell

GitHub user pwendell opened a pull request:

https://github.com/apache/spark/pull/1445

SPARK-2526: Simplify options in make-distribution.sh

Right now we have a bunch of parallel logic in make-distribution.sh
that's just extra work to maintain. We should just pass through
Maven profiles in this case and keep the script simple. See
the JIRA for more details.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pwendell/spark make-distribution.sh

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1445.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1445


commit f1294ea1f1af2479f15d471dcb7bccd29be6169a
Author: Patrick Wendell pwend...@gmail.com
Date:   2014-07-13T20:28:19Z

Simplify options in make-distribution.sh.

Right now we have a bunch of parallel logic in make-distribution.sh
that's just extra work to maintain. We should just pass through
Maven profiles in this case and keep the script simple.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-16 Thread mateiz

Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-49201924
  
Sean, I'd still be okay with adding a LongALS class if you see benefit for 
it in some use cases. Let's just see how it works in comparison.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/1313#discussion_r15015383
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -351,6 +354,14 @@ private[spark] class TaskSetManager(
   for (index - findTaskFromList(execId, 
getPendingTasksForHost(host))) {
 return Some((index, TaskLocality.NODE_LOCAL, false))
   }
+  // Look for no-pref tasks after rack-local tasks since they can run 
anywhere.
--- End diff --

This comment is no longer correct


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

1 2 3 4 >

1 - 100 of 318 matches

Mail list logo