After all, I switched back to LSH implementation that I used before (
https://github.com/karlhigley/spark-neighbors ). I can run on my dataset
now. If someone has any suggestion, please tell me.
Thanks.

2017-02-12 9:25 GMT+07:00 nguyen duc Tuan <newvalu...@gmail.com>:

> Hi Timur,
> 1) Our data is transformed to dataset of Vector already.
> 2) If I use RandomSignProjectLSH, the job dies after I call
> approximateSimilarJoin. I tried to use Minhash instead, the job is still
> slow. I don't thinks the problem is related to the GC. The time for GC is
> small compare with the time for computation. Here is some screenshots of my
> job.
> Thanks
>
> 2017-02-12 8:01 GMT+07:00 Timur Shenkao <t...@timshenkao.su>:
>
>> Hello,
>>
>> 1) Are you sure that your data is "clean"?  No unexpected missing values?
>> No strings in unusual encoding? No additional or missing columns ?
>> 2) How long does your job run? What about garbage collector parameters?
>> Have you checked what happens with jconsole / jvisualvm ?
>>
>> Sincerely yours, Timur
>>
>> On Sat, Feb 11, 2017 at 12:52 AM, nguyen duc Tuan <newvalu...@gmail.com>
>> wrote:
>>
>>> Hi Nick,
>>> Because we use *RandomSignProjectionLSH*, there is only one parameter
>>> for LSH is the number of hashes. I try with small number of hashes (2) but
>>> the error is still happens. And it happens when I call similarity join.
>>> After transformation, the size of  dataset is about 4G.
>>>
>>> 2017-02-11 3:07 GMT+07:00 Nick Pentreath <nick.pentre...@gmail.com>:
>>>
>>>> What other params are you using for the lsh transformer?
>>>>
>>>> Are the issues occurring during transform or during the similarity join?
>>>>
>>>>
>>>> On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan <newvalu...@gmail.com>
>>>> wrote:
>>>>
>>>>> hi Das,
>>>>> In general, I will apply them to larger datasets, so I want to use
>>>>> LSH, which is more scaleable than the approaches as you suggested. Have 
>>>>> you
>>>>> tried LSH in Spark 2.1.0 before ? If yes, how do you set the
>>>>> parameters/configuration to make it work ?
>>>>> Thanks.
>>>>>
>>>>> 2017-02-10 19:21 GMT+07:00 Debasish Das <debasish.da...@gmail.com>:
>>>>>
>>>>> If it is 7m rows and 700k features (or say 1m features) brute force
>>>>> row similarity will run fine as well...check out spark-4823...you can
>>>>> compare quality with approximate variant...
>>>>> On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" <newvalu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi everyone,
>>>>> Since spark 2.1.0 introduces LSH (http://spark.apache.org/docs/
>>>>> latest/ml-features.html#locality-sensitive-hashing), we want to use
>>>>> LSH to find approximately nearest neighbors. Basically, We have dataset
>>>>> with about 7M rows. we want to use cosine distance to meassure the
>>>>> similarity between items, so we use *RandomSignProjectionLSH* (
>>>>> https://gist.github.com/tuan3w/c968e56ea8ef135096eeedb08af097db)
>>>>> instead of *BucketedRandomProjectionLSH*. I try to tune some
>>>>> configurations such as serialization, memory fraction, executor memory
>>>>> (~6G), number of executors ( ~20), memory overhead ..., but nothing works.
>>>>> I often get error "java.lang.OutOfMemoryError: Java heap space" while
>>>>> running. I know that this implementation is done by engineer at Uber but I
>>>>> don't know right configurations,.. to run the algorithm at scale. Do they
>>>>> need very big memory to run it?
>>>>>
>>>>> Any help would be appreciated.
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>
>>
>

Reply via email to