[jira] [Commented] (SPARK-12239) ​SparkR - Not distributing SparkR module in YARN

2016-01-04 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081688#comment-15081688
 ] 

Sen Fang commented on SPARK-12239:
--

A quick note: I tested the specific case fix I proposed before and it seems to 
work fine for 
{code}
sparkR.init()
sparkR.init("local[2]")
sparkR.init("yarn-client")
{code}
inside RStudio. This changes would have no impact to bin/sparkR etc since 
SparkContext is initialized using {{sparkR.init()}}.

Again the code change is very simple:
{code}
submitOps <- getClientModeSparkSubmitOpts(
paste(
  if (nzchar(master)) paste("--master", master),
  Sys.getenv("SPARKR_SUBMIT_ARGS", "sparkr-shell")
),
sparkEnvirMap)
{code}


Given the fact this works
{code}
Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client sparkr-shell")
sc <- spark.init()
{code}
but this doesn't
{code}
sc <- spark.init("yarn-client")
{code}
I feel it is rather clear that this is inconsistent/unintuitive to users.


As always, I am happy to put together a PR whichever way people feel is the 
best long-term solution.

> ​SparkR - Not distributing SparkR module in YARN
> 
>
> Key: SPARK-12239
> URL: https://issues.apache.org/jira/browse/SPARK-12239
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, YARN
>Affects Versions: 1.5.2, 1.5.3
>Reporter: Sebastian YEPES FERNANDEZ
>Priority: Critical
>
> Hello,
> I am trying to use the SparkR in a YARN environment and I have encountered 
> the following problem:
> Every thing work correctly when using bin/sparkR, but if I try running the 
> same jobs using sparkR directly through R it does not work.
> I have managed to track down what is causing the problem, when sparkR is 
> launched through R the "SparkR" module is not distributed to the worker nodes.
> I have tried working around this issue using the setting 
> "spark.yarn.dist.archives", but it does not work as it deploys the 
> file/extracted folder with the extension ".zip" and workers are actually 
> looking for a folder with the name "sparkr"
> Is there currently any way to make this work?
> {code}
> # spark-defaults.conf
> spark.yarn.dist.archives /opt/apps/spark/R/lib/sparkr.zip
> # R
> library(SparkR, lib.loc="/opt/apps/spark/R/lib/")
> sc <- sparkR.init(appName="SparkR", master="yarn-client", 
> sparkEnvir=list(spark.executor.instances="1"))
> sqlContext <- sparkRSQL.init(sc)
> df <- createDataFrame(sqlContext, faithful)
> head(df)
> 15/12/09 09:04:24 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> fr-s-cour-wrk3.alidaho.com): java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
> {code}
> Container stderr:
> {code}
> 15/12/09 09:04:14 INFO storage.MemoryStore: Block broadcast_1 stored as 
> values in memory (estimated size 8.7 KB, free 530.0 MB)
> 15/12/09 09:04:14 INFO r.BufferedStreamThread: Fatal error: cannot open file 
> '/hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02/sparkr/SparkR/worker/daemon.R':
>  No such file or directory
> 15/12/09 09:04:24 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.net.SocketTimeoutException: Accept timed out
>   at java.net.PlainSocketImpl.socketAccept(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
>   at java.net.ServerSocket.implAccept(ServerSocket.java:545)
>   at java.net.ServerSocket.accept(ServerSocket.java:513)
>   at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426)
> {code}
> Worker Node that runned the Container:
> {code}
> # ls -la 
> /hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02
> total 71M
> drwx--x--- 3 yarn hadoop 4.0K Dec  9 09:04 .
> drwx--x--- 7 yarn hadoop 4.0K Dec  9 09:04 ..
> -rw-r--r-- 1 yarn hadoop  110 Dec  9 09:03 container_tokens
> -rw-r--r-- 1 yarn hadoop   12 Dec  9 09:03 .container_tokens.crc
> -rwx-- 1 yarn hadoop  736 Dec  9 09:03 
> default_container_executor_session.sh
> -rw-r--r-- 1 yarn hadoop   16 Dec  9 09:03 
> .default_container_executor_session.sh.crc
> -rwx-- 1 yarn hadoop  790 Dec  9 09:03 default_container_executor.sh
> -rw-r--r-- 1 yarn hadoop   16 Dec  9 09:03 .default_container_executor.sh.crc
> -rwxr-xr-x 1 yarn hadoop  61K Dec  9 09:04 hadoop-lzo-0.6.0.2.3.2.0-2950.jar
> -rwxr-xr-x 1 yarn hadoop 317K Dec  9 09:04 kafka-clients-0.8.2.2.jar
> -rwx-- 1 yarn hadoop 6.0K Dec  9 09:03 launch_container.sh
> -rw-r--r-- 1 yarn hadoop   56 Dec  9 09:03 .launch_container.sh.crc
> 

[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2015-12-29 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15074462#comment-15074462
 ] 

Sen Fang commented on SPARK-5159:
-

As of Spark 1.5.2, we have a similar issue that might be related to this JIRA. 
I haven't test this in 1.6.0 just yet and will report back if it is still an 
issue. The symptoms is that if the thriftserver is started by a user who 
doesn't have permission to access the table directory on HDFS, even if a 
correctly privileged user establish a SQL connection and execute a query, the 
query will fail with the error message that the thriftserver start user doesn't 
have permission to list the folder. However as in HIVE, the listing action 
should have been performed on behalf of user instead. I can report back more 
detailed steps to reproduce this problem when we test it under 1.6.0 to make 
sure this issue still exists.

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12239) ​SparkR - Not distributing SparkR module in YARN

2015-12-28 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15073138#comment-15073138
 ] 

Sen Fang commented on SPARK-12239:
--

Thanks [~sunrui] for your suggestions. Both sound very reasonable to me. If we 
decide on one particular approach, I am happy to put together a PR. Given the 
prevalence of RStudio as IDE for day-to-day R users, I think this bit is 
particular beneficial to streamline.

> ​SparkR - Not distributing SparkR module in YARN
> 
>
> Key: SPARK-12239
> URL: https://issues.apache.org/jira/browse/SPARK-12239
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, YARN
>Affects Versions: 1.5.2, 1.5.3
>Reporter: Sebastian YEPES FERNANDEZ
>Priority: Critical
>
> Hello,
> I am trying to use the SparkR in a YARN environment and I have encountered 
> the following problem:
> Every thing work correctly when using bin/sparkR, but if I try running the 
> same jobs using sparkR directly through R it does not work.
> I have managed to track down what is causing the problem, when sparkR is 
> launched through R the "SparkR" module is not distributed to the worker nodes.
> I have tried working around this issue using the setting 
> "spark.yarn.dist.archives", but it does not work as it deploys the 
> file/extracted folder with the extension ".zip" and workers are actually 
> looking for a folder with the name "sparkr"
> Is there currently any way to make this work?
> {code}
> # spark-defaults.conf
> spark.yarn.dist.archives /opt/apps/spark/R/lib/sparkr.zip
> # R
> library(SparkR, lib.loc="/opt/apps/spark/R/lib/")
> sc <- sparkR.init(appName="SparkR", master="yarn-client", 
> sparkEnvir=list(spark.executor.instances="1"))
> sqlContext <- sparkRSQL.init(sc)
> df <- createDataFrame(sqlContext, faithful)
> head(df)
> 15/12/09 09:04:24 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> fr-s-cour-wrk3.alidaho.com): java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
> {code}
> Container stderr:
> {code}
> 15/12/09 09:04:14 INFO storage.MemoryStore: Block broadcast_1 stored as 
> values in memory (estimated size 8.7 KB, free 530.0 MB)
> 15/12/09 09:04:14 INFO r.BufferedStreamThread: Fatal error: cannot open file 
> '/hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02/sparkr/SparkR/worker/daemon.R':
>  No such file or directory
> 15/12/09 09:04:24 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.net.SocketTimeoutException: Accept timed out
>   at java.net.PlainSocketImpl.socketAccept(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
>   at java.net.ServerSocket.implAccept(ServerSocket.java:545)
>   at java.net.ServerSocket.accept(ServerSocket.java:513)
>   at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426)
> {code}
> Worker Node that runned the Container:
> {code}
> # ls -la 
> /hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02
> total 71M
> drwx--x--- 3 yarn hadoop 4.0K Dec  9 09:04 .
> drwx--x--- 7 yarn hadoop 4.0K Dec  9 09:04 ..
> -rw-r--r-- 1 yarn hadoop  110 Dec  9 09:03 container_tokens
> -rw-r--r-- 1 yarn hadoop   12 Dec  9 09:03 .container_tokens.crc
> -rwx-- 1 yarn hadoop  736 Dec  9 09:03 
> default_container_executor_session.sh
> -rw-r--r-- 1 yarn hadoop   16 Dec  9 09:03 
> .default_container_executor_session.sh.crc
> -rwx-- 1 yarn hadoop  790 Dec  9 09:03 default_container_executor.sh
> -rw-r--r-- 1 yarn hadoop   16 Dec  9 09:03 .default_container_executor.sh.crc
> -rwxr-xr-x 1 yarn hadoop  61K Dec  9 09:04 hadoop-lzo-0.6.0.2.3.2.0-2950.jar
> -rwxr-xr-x 1 yarn hadoop 317K Dec  9 09:04 kafka-clients-0.8.2.2.jar
> -rwx-- 1 yarn hadoop 6.0K Dec  9 09:03 launch_container.sh
> -rw-r--r-- 1 yarn hadoop   56 Dec  9 09:03 .launch_container.sh.crc
> -rwxr-xr-x 1 yarn hadoop 2.2M Dec  9 09:04 
> spark-cassandra-connector_2.10-1.5.0-M3.jar
> -rwxr-xr-x 1 yarn hadoop 7.1M Dec  9 09:04 spark-csv-assembly-1.3.0.jar
> lrwxrwxrwx 1 yarn hadoop  119 Dec  9 09:03 __spark__.jar -> 
> /hadoop/hdfs/disk03/hadoop/yarn/local/usercache/spark/filecache/361/spark-assembly-1.5.3-SNAPSHOT-hadoop2.7.1.jar
> lrwxrwxrwx 1 yarn hadoop   84 Dec  9 09:03 sparkr.zip -> 
> /hadoop/hdfs/disk01/hadoop/yarn/local/usercache/spark/filecache/359/sparkr.zip
> -rwxr-xr-x 1 yarn hadoop 1.8M Dec  9 09:04 
> spark-streaming_2.10-1.5.3-SNAPSHOT.jar
> -rwxr-xr-x 1 yarn hadoop  11M Dec  9 09:04 
> 

[jira] [Commented] (SPARK-12239) ​SparkR - Not distributing SparkR module in YARN

2015-12-26 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072029#comment-15072029
 ] 

Sen Fang commented on SPARK-12239:
--

I recently also got hit by this issue and I would think we can make some 
improvement to make this more robust. This came up exactly when we attempt to 
use SparkR in RStudio/RStudio Server. 

[~sunrui] The workaround you suggest absolutely works but I don't see it in the 
page you cited. Has that page been changed?

[~shivaram] I believe the problem is because {{sparkR.init}} always launches in 
local mode first, e.g:
{code}
Launching java with spark-submit command spark-submit   sparkr-shell 
/var/folders/yw/mfqkln8172l93g6yfnt2k2zwgp/T//RtmpCwXLoF/backend_port432252948808
 
{code}
It currently kind of works for YARN, as documented for {{sparkR.init}}:
{code}
sc <- sparkR.init("yarn-client", "SparkR", "/home/spark", ...
{code}
only because the actual Spark context is initiated by this call
https://github.com/apache/spark/blob/v1.6.0-rc4/R/pkg/R/sparkR.R#L212
However this is already after the deploy step, that you cited above, where 
Spark determines the cluster manager via spark-submit arguments.
https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L228
Therefore you end up with a functional YARN based SparkContext without 
sparkr.zip.

Shouldn't the fix be as easy append the actually master in the spark-submit 
command we build inside R? Or am I missing something?



> ​SparkR - Not distributing SparkR module in YARN
> 
>
> Key: SPARK-12239
> URL: https://issues.apache.org/jira/browse/SPARK-12239
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, YARN
>Affects Versions: 1.5.2, 1.5.3
>Reporter: Sebastian YEPES FERNANDEZ
>Priority: Critical
>
> Hello,
> I am trying to use the SparkR in a YARN environment and I have encountered 
> the following problem:
> Every thing work correctly when using bin/sparkR, but if I try running the 
> same jobs using sparkR directly through R it does not work.
> I have managed to track down what is causing the problem, when sparkR is 
> launched through R the "SparkR" module is not distributed to the worker nodes.
> I have tried working around this issue using the setting 
> "spark.yarn.dist.archives", but it does not work as it deploys the 
> file/extracted folder with the extension ".zip" and workers are actually 
> looking for a folder with the name "sparkr"
> Is there currently any way to make this work?
> {code}
> # spark-defaults.conf
> spark.yarn.dist.archives /opt/apps/spark/R/lib/sparkr.zip
> # R
> library(SparkR, lib.loc="/opt/apps/spark/R/lib/")
> sc <- sparkR.init(appName="SparkR", master="yarn-client", 
> sparkEnvir=list(spark.executor.instances="1"))
> sqlContext <- sparkRSQL.init(sc)
> df <- createDataFrame(sqlContext, faithful)
> head(df)
> 15/12/09 09:04:24 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 
> fr-s-cour-wrk3.alidaho.com): java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
> {code}
> Container stderr:
> {code}
> 15/12/09 09:04:14 INFO storage.MemoryStore: Block broadcast_1 stored as 
> values in memory (estimated size 8.7 KB, free 530.0 MB)
> 15/12/09 09:04:14 INFO r.BufferedStreamThread: Fatal error: cannot open file 
> '/hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02/sparkr/SparkR/worker/daemon.R':
>  No such file or directory
> 15/12/09 09:04:24 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.net.SocketTimeoutException: Accept timed out
>   at java.net.PlainSocketImpl.socketAccept(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
>   at java.net.ServerSocket.implAccept(ServerSocket.java:545)
>   at java.net.ServerSocket.accept(ServerSocket.java:513)
>   at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426)
> {code}
> Worker Node that runned the Container:
> {code}
> # ls -la 
> /hadoop/hdfs/disk02/hadoop/yarn/local/usercache/spark/appcache/application_1445706872927_1168/container_e44_1445706872927_1168_01_02
> total 71M
> drwx--x--- 3 yarn hadoop 4.0K Dec  9 09:04 .
> drwx--x--- 7 yarn hadoop 4.0K Dec  9 09:04 ..
> -rw-r--r-- 1 yarn hadoop  110 Dec  9 09:03 container_tokens
> -rw-r--r-- 1 yarn hadoop   12 Dec  9 09:03 .container_tokens.crc
> -rwx-- 1 yarn hadoop  736 Dec  9 09:03 
> default_container_executor_session.sh
> -rw-r--r-- 1 yarn hadoop   16 Dec  9 09:03 
> 

[jira] [Created] (SPARK-12526) `ifelse`, `when`, `otherwise` unable to take Column as value

2015-12-25 Thread Sen Fang (JIRA)
Sen Fang created SPARK-12526:


 Summary: `ifelse`, `when`, `otherwise` unable to take Column as 
value
 Key: SPARK-12526
 URL: https://issues.apache.org/jira/browse/SPARK-12526
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.2, 1.6.0
Reporter: Sen Fang


When passing a Column to {{ifelse}}, {{when}}, {{otherwise}}, it will error out 
with

{code}
attempt to replicate an object of type 'environment'
{code}

The problems lies in the use of base R {{ifelse}} function, which is vectorized 
version of {{if ... else ...}} idiom, but it is unable to replicate a Column's 
job id as it is an environment.

Considering {{callJMethod}} was never designed to be vectorized, the safe 
option is to replace {{ifelse}} with {{if ... else ...}} instead. However 
technically this is inconsistent to base R's ifelse, which is meant to be 
vectorized.

I can send a PR for review first and discuss further if there is scenario at 
all when `ifelse`, `when`, `otherwise` would be used vectorizedly.

A dummy example is:
{code}
ifelse(lit(1) == lit(1), lit(2), lit(3))
{code}
A concrete example might be:
{code}
ifelse(df$mpg > 0, df$mpg, 0)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-11-22 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021527#comment-15021527
 ] 

Sen Fang commented on SPARK-2336:
-

I finally took a crack on the hybrid spill tree for kNN and results so far 
appear to be promising. For anyone who is still interested, you can find it as 
a spark package at: https://github.com/saurfang/spark-knn

The implementation is written for ml API and scales well in terms of both 
number of observations and number of vector dimensions. The KNN itself is 
flexible and the package comes with KNNClassifier and KNNRegression for 
(optionally weighted) classification and regression.

There are a few implementation details I am still trying to iron out. Otherwise 
I look forward to benchmark it against other implementations such as KNN-join, 
KD-Tree, and LSH.

> Approximate k-NN Models for MLLib
> -
>
> Key: SPARK-2336
> URL: https://issues.apache.org/jira/browse/SPARK-2336
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: clustering, features
>
> After tackling the general k-Nearest Neighbor model as per 
> https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
> also offer approximate k-Nearest Neighbor. A promising approach would involve 
> building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1=2
> This could offer a simple non-linear ML model that can label new data with 
> much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11906) Speculation Tasks Cause ProgressBar UI Overflow

2015-11-21 Thread Sen Fang (JIRA)
Sen Fang created SPARK-11906:


 Summary: Speculation Tasks Cause ProgressBar UI Overflow
 Key: SPARK-11906
 URL: https://issues.apache.org/jira/browse/SPARK-11906
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Sen Fang
Priority: Trivial


When there are speculative tasks in stage, the started tasks + completed tasks 
can be greater than total number of tasks. It leads to the started progress 
block to overflow to next line. Visually the light blue progress block becomes 
no longer visible when it happens.

The fix should be as trivial as to cap the number of started task by total - 
completed task.

https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L322





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11263) lintr Throws Warnings on Commented Code in Documentation

2015-10-22 Thread Sen Fang (JIRA)
Sen Fang created SPARK-11263:


 Summary: lintr Throws Warnings on Commented Code in Documentation
 Key: SPARK-11263
 URL: https://issues.apache.org/jira/browse/SPARK-11263
 Project: Spark
  Issue Type: Task
  Components: SparkR
Reporter: Sen Fang
Priority: Minor


This comes from a discussion in https://github.com/apache/spark/pull/9205

Currently lintr throws many warnings around "style: Commented code should be 
removed."

For example
{code}
R/RDD.R:260:3: style: Commented code should be removed.
# unpersist(rdd) # rdd@@env$isCached == FALSE
  ^~~
R/RDD.R:283:3: style: Commented code should be removed.
# sc <- sparkR.init()
  ^~~
R/RDD.R:284:3: style: Commented code should be removed.
# setCheckpointDir(sc, "checkpoint")
  ^~
{code}

Some of them are legitimate warnings but most of them are simply code examples 
of functions that are not part of public API. For example
{code}
# @examples
#\dontrun{
# sc <- sparkR.init()
# rdd <- parallelize(sc, 1:10, 2L)
# cache(rdd)
#}
{code}


One workaround is to convert them back to Roxygen doc but assign {{#' @rdname 
.ignore}} and Roxygen will skip these functions with message {{Skipping invalid 
path: .ignore.Rd}}

That being said, I feel people usually praise/criticize R package documentation 
is "expert friendly". The convention seems to be providing as much 
documentation as possible but don't export functions that is unstable or 
developer only. If users choose to use them, they acknowledge the risk by using 
{{:::}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11244) sparkR.stop doesn't clean up .sparkRSQLsc in environment

2015-10-21 Thread Sen Fang (JIRA)
Sen Fang created SPARK-11244:


 Summary: sparkR.stop doesn't clean up .sparkRSQLsc in environment
 Key: SPARK-11244
 URL: https://issues.apache.org/jira/browse/SPARK-11244
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.1
Reporter: Sen Fang


Currently {{sparkR.stop}} removes relevant variables from {{.sparkREnv}} for 
SparkContext and backend. However it doesn't clean up {{.sparkRSQLsc}} and 
{{.sparkRHivesc}}.

It results 
{code}
sc <- sparkR.init("local")
sqlContext <- sparkRSQL.init(sc)
sparkR.stop()
sc <- sparkR.init("local")
sqlContext <- sparkRSQL.init(sc)
sqlContext
{code}
producing
{code}
 sqlContext
Error in callJMethod(x, "getClass") : 
  Invalid jobj 1. If SparkR was restarted, Spark operations need to be 
re-executed.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-09-29 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936268#comment-14936268
 ] 

Sen Fang commented on SPARK-:
-

Another idea is do something similar to F# TypeProvider approach: 
http://fsharp.github.io/FSharp.Data/
I haven't looked into this extensively just yet but as far as I understand this 
uses compile time macro to generate classes based on data sources. In that 
sense, it is slightly similar to protobuf where you generate Java class based 
on schema definition. This makes dataframe type safe at the very upstream. With 
a bit of IDE plugin, you will even able to have autocomplete and type check 
when you write code, which would be very nice. I'm not sure if it will be 
scalable to propagate these type information down stream (in aggregation or 
transformed dataframe) though. As I understand, the macro and type provider in 
Scala provides similar capabilities.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> As a Spark user, I want an API that sits somewhere in the middle of the 
> spectrum so I can write most of my applications with that API, and yet it can 
> be optimized well by Spark to achieve performance and stability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8939) YARN EC2 default setting fails with IllegalArgumentException

2015-09-28 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933233#comment-14933233
 ] 

Sen Fang commented on SPARK-8939:
-

FYI, PR has been merged into branch-1.5 in spark-ec2 repo: 
https://github.com/amplab/spark-ec2/pull/10

> YARN EC2 default setting fails with IllegalArgumentException
> 
>
> Key: SPARK-8939
> URL: https://issues.apache.org/jira/browse/SPARK-8939
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>
> I just set it up from scratch using the spark-ec2 script. Then I ran
> {code}
> bin/spark-shell --master yarn
> {code}
> which failed with
> {code}
> 15/07/09 03:44:29 ERROR SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Unknown/unsupported param 
> List(--num-executors, , --executor-memory, 6154m, --executor-memory, 6154m, 
> --executor-cores, 2, --name, Spark shell)
> {code}
> This goes away if I provide `--num-executors`, but we should fix the default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8939) YARN EC2 default setting fails with IllegalArgumentException

2015-09-14 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744792#comment-14744792
 ] 

Sen Fang commented on SPARK-8939:
-

The issue seems to lie in: 
https://github.com/amplab/spark-ec2/blob/branch-1.5/templates/root/spark/conf/spark-env.sh#L7
where {{spark_worker_instances}} is set to empty string 
here:
https://github.com/apache/spark/blob/branch-1.5/ec2/spark_ec2.py#L810
This results the following replacement:
{code}
if [ -n  ]; then
  export SPARK_WORKER_INSTANCES=
fi
{code}
in spark-env.sh. Somehow bash thinks this evaluates to true. This simple fix is 
to surround the variable with quote in the if statement.

However, I'm a bit confused that at
https://github.com/apache/spark/blob/v1.5.0/core/src/main/scala/org/apache/spark/SparkConf.scala#L481
suggests this environment variable has been deprecated in favor for 
`SPARK_EXECUTOR_INSTANCES` but `--worker-instances` are referenced at a lot of 
places thus I'm not sure if we want to remove/deprecate it.

I'm submitting a PR for now since I have tested my fix for YARN and also 
verified it didn't break for regular standalone.

> YARN EC2 default setting fails with IllegalArgumentException
> 
>
> Key: SPARK-8939
> URL: https://issues.apache.org/jira/browse/SPARK-8939
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.0
>Reporter: Andrew Or
>
> I just set it up from scratch using the spark-ec2 script. Then I ran
> {code}
> bin/spark-shell --master yarn
> {code}
> which failed with
> {code}
> 15/07/09 03:44:29 ERROR SparkContext: Error initializing SparkContext.
> java.lang.IllegalArgumentException: Unknown/unsupported param 
> List(--num-executors, , --executor-memory, 6154m, --executor-memory, 6154m, 
> --executor-cores, 2, --name, Spark shell)
> {code}
> This goes away if I provide `--num-executors`, but we should fix the default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10543) Peak Execution Memory Quantile should be Pre-task Basis

2015-09-10 Thread Sen Fang (JIRA)
Sen Fang created SPARK-10543:


 Summary: Peak Execution Memory Quantile should be Pre-task Basis
 Key: SPARK-10543
 URL: https://issues.apache.org/jira/browse/SPARK-10543
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Sen Fang
Priority: Minor


Currently the Peak Execution Memory quantiles seem to be cumulative rather than 
per task basis. For example, I have seen a value of 2TB in one of my jobs on 
the quantile metric but each individual task shows less than 1GB on the bottom 
table.

[~andrewor14] In your PR https://github.com/apache/spark/pull/7770, the 
screenshot shows the Max Peak Execution Memory of 792.5KB while in the bottom 
it's about 50KB per task (unless your workload is skewed)

The fix seems straightforward that we use the `update` rather than `value` from 
the accumulable. I'm happy to provide a PR if people agree this is the right 
behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10543) Peak Execution Memory Quantile should be Per-task Basis

2015-09-10 Thread Sen Fang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sen Fang updated SPARK-10543:
-
Summary: Peak Execution Memory Quantile should be Per-task Basis  (was: 
Peak Execution Memory Quantile should be Pre-task Basis)

> Peak Execution Memory Quantile should be Per-task Basis
> ---
>
> Key: SPARK-10543
> URL: https://issues.apache.org/jira/browse/SPARK-10543
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Sen Fang
>Priority: Minor
>
> Currently the Peak Execution Memory quantiles seem to be cumulative rather 
> than per task basis. For example, I have seen a value of 2TB in one of my 
> jobs on the quantile metric but each individual task shows less than 1GB on 
> the bottom table.
> [~andrewor14] In your PR https://github.com/apache/spark/pull/7770, the 
> screenshot shows the Max Peak Execution Memory of 792.5KB while in the bottom 
> it's about 50KB per task (unless your workload is skewed)
> The fix seems straightforward that we use the `update` rather than `value` 
> from the accumulable. I'm happy to provide a PR if people agree this is the 
> right behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

2015-08-19 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703626#comment-14703626
 ] 

Sen Fang commented on SPARK-3533:
-

I also have a working implementation based on Hadoop 1 + 
MultipleTextOutputFormat under Scala. [~silasdavis] let me know if you have 
time putting up a PR. If not, I can get one together this weekend.

 Add saveAsTextFileByKey() method to RDDs
 

 Key: SPARK-3533
 URL: https://issues.apache.org/jira/browse/SPARK-3533
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Nicholas Chammas

 Users often have a single RDD of key-value pairs that they want to save to 
 multiple locations based on the keys.
 For example, say I have an RDD like this:
 {code}
  a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
  'Frankie']).keyBy(lambda x: x[0])
  a.collect()
 [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
  a.keys().distinct().collect()
 ['B', 'F', 'N']
 {code}
 Now I want to write the RDD out to different paths depending on the keys, so 
 that I have one output directory per distinct key. Each output directory 
 could potentially have multiple {{part-}} files, one per RDD partition.
 So the output would look something like:
 {code}
 /path/prefix/B [/part-1, /part-2, etc]
 /path/prefix/F [/part-1, /part-2, etc]
 /path/prefix/N [/part-1, /part-2, etc]
 {code}
 Though it may be possible to do this with some combination of 
 {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
 {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
 It's not clear if it's even possible at all in PySpark.
 Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
 that makes it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8443) GenerateMutableProjection Exceeds JVM Code Size Limits

2015-06-18 Thread Sen Fang (JIRA)
Sen Fang created SPARK-8443:
---

 Summary: GenerateMutableProjection Exceeds JVM Code Size Limits
 Key: SPARK-8443
 URL: https://issues.apache.org/jira/browse/SPARK-8443
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Sen Fang


GenerateMutableProjection put all expressions columns into a single apply 
function. When there are a lot of columns, the apply function code size exceeds 
the 64kb limit, which is a hard limit on jvm that cannot change.

This comes up when we were aggregating about 100 columns using codegen and 
unsafe feature.

I wrote an unit test that reproduces this issue. 
https://github.com/saurfang/spark/blob/codegen_size_limit/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala

This test currently fails at 2048 expressions. It seems the master is more 
tolerant than branch-1.4 about this because code is more concise.

While the code on master has changed since branch-1.4, I am able to reproduce 
the problem in master. For now I hacked my way in branch-1.4 to workaround this 
problem by wrapping each expression with a separate function then call those 
functions sequentially in apply. The proper way is probably check the length of 
the projectCode and break it up as necessary. (This seems to be easier in 
master actually since we are building code by string rather than quasiquote)

Let me know if anyone has additional thoughts on this, I'm happy to contribute 
a pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8443) GenerateMutableProjection Exceeds JVM Code Size Limits

2015-06-18 Thread Sen Fang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sen Fang updated SPARK-8443:

Description: 
GenerateMutableProjection put all expressions columns into a single apply 
function. When there are a lot of columns, the apply function code size exceeds 
the 64kb limit, which is a hard limit on jvm that cannot change.

This comes up when we were aggregating about 100 columns using codegen and 
unsafe feature.

I wrote an unit test that reproduces this issue. 
https://github.com/saurfang/spark/blob/codegen_size_limit/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala

This test currently fails at 2048 expressions. It seems the master is more 
tolerant than branch-1.4 about this because code is more concise.

While the code on master has changed since branch-1.4, I am able to reproduce 
the problem in master. For now I hacked my way in branch-1.4 to workaround this 
problem by wrapping each expression with a separate function then call those 
functions sequentially in apply. The proper way is probably check the length of 
the projectCode and break it up as necessary. (This seems to be easier in 
master actually since we are building code by string rather than quasiquote)

Let me know if anyone has additional thoughts on this, I'm happy to contribute 
a pull request.

Attaching stack trace produced by unit test
{code}
[info] - code size limit *** FAILED *** (7 seconds, 103 milliseconds)
[info]   com.google.common.util.concurrent.UncheckedExecutionException: 
org.codehaus.janino.JaninoRuntimeException: Code of method 
(Ljava/lang/Object;)Ljava/lang/Object; of class SC$SpecificProjection grows 
beyond 64 KB
[info]   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2263)
[info]   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
[info]   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
[info]   at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
[info]   at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:285)
[info]   at 
org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:50)
[info]   at 
org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2$$anonfun$apply$mcV$sp$2.apply(CodeGenerationSuite.scala:48)
[info]   at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
[info]   at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
[info]   at scala.collection.immutable.Range.foreach(Range.scala:141)
[info]   at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
[info]   at scala.collection.AbstractTraversable.foldLeft(Traversable.scala:105)
[info]   at 
org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply$mcV$sp(CodeGenerationSuite.scala:47)
[info]   at 
org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47)
[info]   at 
org.apache.spark.sql.catalyst.expressions.CodeGenerationSuite$$anonfun$2.apply(CodeGenerationSuite.scala:47)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
[info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)

[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-05-03 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526153#comment-14526153
 ] 

Sen Fang commented on SPARK-2336:
-

Hey Longbao, great to hear from you. To my best understanding, in the paper I 
cited above, they are distributing the input points by pushing them through the 
top tree (figure 4). Of course, for a precise result, this means we need to 
backtrack which isn't very ideal. What they propose is to use a buffer boundary 
like a spill tree. However unlike spill tree, here you would push the input 
targets to both children if it falls within the buffer zone, because the top 
tree was built as a metric tree (they explained the reason being a spill tree 
as top tree has a high memory penalty). So every input now might end up in 
multiple subtrees and you will need to reduceByKey at the end to keep the top K 
neighbors.

Is your implementation available somewhere? I'm having a hard time to find time 
to finish my implementation this month. Would be great if eventually we can 
compare our implementations, validate and benchmark.

 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2015-04-05 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14396306#comment-14396306
 ] 

Sen Fang commented on SPARK-2336:
-

I'm tentatively going to give the hybrid spilltree implementation described in 
OP's post a try. Specifically I'm going to follow the implementation described 
in: http://dx.doi.org/10.1109/WACV.2007.18

According to the paper, the algorithm scales well in terms of number of 
observations, bounded by the available memory across clusters (billions in 
paper's example). The original hybrid spilltree paper claims the algorithm 
scales much better than other alternatives (LSH, metric tree) when it comes to 
number of features (up to hundreds). Furthermore, random projection is often 
used to reduce the dimensionality of feature space. Random projection is out of 
scope in my implementation and should probably implemented as a separate 
dimension reduction technique. (in the paper 4000 features are projected down 
to 100)

The average runtime for training and prediction is generally O(log n). The 
training runtime will suffer if we turn up the buffer size. The search is only 
approximate on overlapping node and the higher the buffer size the more 
accurate the search is. The prediction runtime will suffer when backtracking 
search is needed in a non-overlap node (which is more likely as balance 
threshold increases). 

More specifically, metric tree promises accurate but less performant search 
while spill tree creates trees with overlapping nodes and uses more efficient 
defeatist search whose accuracy is related to the buffer size *tau* (the larger 
it is the more accurate the search but deeper tree). Hybrid spill tree is 
constructed with both overlapping (spill tree) and non-overlapping (metric 
tree) node and uses balance threshold *rho* to balances the tree depth and 
search efficiency.


A high level overview of the algorithm is as follows:
1. Sample M data points (M = number of partitions)
2. Build the top level metric tree
3. Repartition RDD by assigning each point to leaf node of the above tree
4. Build a hybrid spill tree at each partition 
This concludes the training phase of kNN.

Prediction is achieved by identify the subtree it falls into then run 
prediction on that subtree.


Let me know if anyone has any thoughts, concerns or corrections on the things I 
said above.


 Approximate k-NN Models for MLLib
 -

 Key: SPARK-2336
 URL: https://issues.apache.org/jira/browse/SPARK-2336
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Brian Gawalt
Priority: Minor
  Labels: clustering, features

 After tackling the general k-Nearest Neighbor model as per 
 https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
 also offer approximate k-Nearest Neighbor. A promising approach would involve 
 building a kd-tree variant within from each partition, a la
 http://www.autonlab.org/autonweb/14714.html?branch=1language=2
 This could offer a simple non-linear ML model that can label new data with 
 much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org