[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread Yunni
Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16965
  
Looks like the rebase is making it even worse. I will reopen a PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16965
  
**[Test build #73538 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73538/testReport)**
 for PR 16965 at commit 
[`0b46461`](https://github.com/apache/spark/commit/0b4646199cf061d1f358a78122ef8bdf164ac839).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/16965
  
Github isn't handling the merge well, so you might try rebasing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16965
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16965
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73512/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16965
  
**[Test build #73512 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73512/testReport)**
 for PR 16965 at commit 
[`83a1556`](https://github.com/apache/spark/commit/83a155699df4b15f1ab1fc427730613b63f7d1d6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16965
  
**[Test build #73512 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73512/testReport)**
 for PR 16965 at commit 
[`83a1556`](https://github.com/apache/spark/commit/83a155699df4b15f1ab1fc427730613b63f7d1d6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-26 Thread Yunni
Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16965
  
The number of rows would be O(LN). The memory usage will be different as 
the size of each row has changed before and after the explode. Also the 
Catalyst Optimizer may do projections during join which can also change the 
size of each row.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-24 Thread merlintang
Github user merlintang commented on the issue:

https://github.com/apache/spark/pull/16965
  
@Yunni  thanks, where I mention the L is the number of hash tables. 

By this way, the memory usage would be O(L*N). the approximate NN searching 
cost in one partition is O(L*N'). Where N is the number of input dataset, and 
N' is the number of data points in one partition. right? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-24 Thread Yunni
Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16965
  
@merlintang Not exactly. Each row will explode to L rows, where L is the 
number of hash tables. Like the following:
```
++-++
|datasetA|entry|   hashValue|
++-++
|[[-10.0,-10.0],Wr...|0|[-2.0,-2.0,3.0,-2.0]|
|[[-10.0,-10.0],Wr...|1|[0.0,-3.0,-1.0,-2.0]|
|[[-10.0,-9.0],Wra...|0|[-2.0,-2.0,3.0,-2.0]|
|[[-10.0,-9.0],Wra...|1|[0.0,-3.0,-1.0,-2.0]|
|[[-10.0,-8.0],Wra...|0|[-2.0,-2.0,3.0,-1.0]|
|[[-10.0,-8.0],Wra...|1| [0.0,-3.0,0.0,-2.0]|
```
You can look at the code here: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala#L238


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-24 Thread merlintang
Github user merlintang commented on the issue:

https://github.com/apache/spark/pull/16965
  
@Yunni Ok, if we want to move this quicker, we can keep the current AND-OR 
implementation.

(2)(3) you mention that you explode the inner table (dataset). Does it mean 
for each tuple of inner table (says t_i) and multiple hash functions (say h_0, 
h_1, ... h_l) . you create multiple rows like (h_0, t_i), (h_1, t_i), ... (h_l, 
t_i). am i correct?   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-24 Thread Yunni
Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16965
  
@merlintang 
(1) `hashDistance` is only used for multi-probe NN Search. The term 
`numHashTables`, `numHashFunctions` is very hard to interpret in OR-AND cases.

(2) For similarity join, we actually first do explode and then join. The 
join key would be type of vector. 

(3) Yes. However, in order to get rows using hashes, we need to do 
intersections on large sets of rows. While in AND-OR cases, we do union of 
small sets of rows, which is more efficient.

I also suggest we limit the scope to the implementation of 
AND-amplification here. We can open other tickets to discuss memory issues, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-24 Thread merlintang
Github user merlintang commented on the issue:

https://github.com/apache/spark/pull/16965
  
@Yunni Yes, we can use the AND-OR  to increase the possibility by having 
more the numHashTables and numHashFunctions. For the further user extension, if 
users have a hash function with lower possibility, the OR-AND could be used.  

(1) I do not need to change Array[Vector], numHashTables, numHashFunctions, 
we need to change the function to compute the hashDistance (i.e.,hashDistance), 
as well as the sameBucket function in the approxNearestNeighbors.

(3) for the simijoin, I have one question here, if you do a join based on 
the hashed value of input tuples, the joined key would be array(vector). Am i 
right?  if it is, does this meet OR-amplification? please clarify me if I am 
wrong. 

(4) for the index part, I think it would be work. it is pretty similar as 
the routing table idea for the graphx.  thus, I think we can create a other 
data frame with the same partitioner of the input data frame, then, the newly 
created data frame would contain the index for the input tables without 
disturbing the data frame. 

5) the other major concern would be memory overhead, Can we reduce the 
memory usage for the output hash value i.e., array(vector)? Because the users 
said that the current way spent extensive of memory. therefore, one way we can 
do using the bit to respected the hashed value for the min-hash, the other way 
would use the sparse vector.  what do you think ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-23 Thread Yunni
Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16965
  
@merlintang Sorry I still don't quite get why we need to support OR-AND 
when the effective threshold is low. My understanding is that we can always 
tune numHashTables and numHashFunctions for AND-OR to make the possibility as 
good as OR-AND. Please correct me if I am wrong.

My concerns on supporting OR-AND are the followings:
(1) We probably need some backward incompatible API changes. 
`Array[Vector]`, numHashTables, numHashFunctions seems to make less sense for 
OR-AND.
(2) To avoid broadcast join, we will need a very different and complicated 
mechanism for the join step in approxSimilarityJoin for OR-AND.
(3) I am thinking about building index to improve performance for nearest 
neighbor 
(https://docs.google.com/document/d/1opWy2ohXaDWjamV8iC0NKbaZL9JsjZCix2Av5SS3D9g/edit).
 Supporting OR-AND will make the index less efficient when we get records given 
hash buckets.

@jkbradley @sethah @MLnick Any thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-23 Thread merlintang
Github user merlintang commented on the issue:

https://github.com/apache/spark/pull/16965
  
@Yunni  I agree with you that the current NN search and Join are using the 
AND-OR. We can discuss how to use the OR-AND for that two searching as well.  

For the OR-AND option, it is used when the effective threshold is low. 
please refer to the table in the page 31 and 33. 
http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf

You can notice, when the p is lower, the OR-AND can amplify the hash family 
possibility from 0.0985 to 0.5440.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-23 Thread Yunni
Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/16965
  
@merlintang We use AND-OR in both approxNearestNeighbor and 
approxSimilarityJoin, and it's more difficult for approxSimilarityJoin to adopt 
OR-AND than AND-OR.

My understanding: for a (d1, d2, p1, p2)-sensitive hash families, AND-OR 
can increase p1 and decrease p2 just like OR-AND does. What are the use cases 
to use OR-AND rather than AND-OR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-22 Thread merlintang
Github user merlintang commented on the issue:

https://github.com/apache/spark/pull/16965
  
It seems this patch provide the AND-OR amplification. Can we provide the 
option for users to choose the OR-AND amplification as well? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-20 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/16965
  
cc @sethah @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16965
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73016/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16965
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16965
  
**[Test build #73016 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73016/testReport)**
 for PR 16965 at commit 
[`e6f9f95`](https://github.com/apache/spark/commit/e6f9f9541f0b00c14b7c5a201b22aeb400eb9f19).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16965: [Spark-18450][ML] Scala API Change for LSH AND-amplifica...

2017-02-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16965
  
**[Test build #73016 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73016/testReport)**
 for PR 16965 at commit 
[`e6f9f95`](https://github.com/apache/spark/commit/e6f9f9541f0b00c14b7c5a201b22aeb400eb9f19).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org