Dmitriy,
1) Thank you, I'll try 0.8 instead of 0.7.
2) Regarding my problem and seq2sparse. We do not perform text analysis. We
perform user click analysis. My documents are internet users and my terms are
url clicked. Input data contains "user_id<>url1, url2, url1, urlN etc" vectors.
It is realy easy to convert these vectors to sparse TFIDF vectors using
seq2sparse. The frequency of URLs fallows Power Law. Thats why we use
seq2sparse with TFIDF weighting. My goals are:
- to recommend new URL to user
- to reduce the User<>url dimension for both users (U) and urls(V) analysis
(clustering, classification etc).
- to find the similarity between user and url. ( dot_product{Ua[i], Va[j]} )
Is SVD a suitable solution for this problem?
3) I can apply SSVD on a sample (0,1% of my data). But it fails with 100% of
data. (Bt-job stops on a Map phase with "Java heap space" errors or "timeout"
errors).
Input matrix is a sparse matrix 20 000 000 X 150 000 with ~0,03% non-zero
values. (8GB total)
How I use it:
====================
mahout-distribution-0.7/bin/mahout ssvd \
-i /tmp/pabramov/sparse/tfidf-vectors/ \
-o /tmp/pabramov/ssvd \
-k 200 \
-q 1 \
--reduceTasks 150 \
--tempDir /tmp/pabramov/tmp \
-Dmapred.max.split.size=1000000 \
-ow
====================
Can't pass Bt-job... Should I decrease split.size and/or add extra params?
Hadoop has 400 Map and 300 reduce slots with 1 CPU core and 2GB RAM per task.
Q-job completes in 20 minutes.
Many thanks in advance!
Pavel
________________________________________
От: Dmitriy Lyubimov [[email protected]]
Отправлено: 15 ноября 2012 г. 21:53
To: [email protected]
Тема: Re: SSVD fails on seq2sparse output.
On Thu, Nov 15, 2012 at 3:43 AM, Abramov Pavel <[email protected]>wrote:
>
> Many thanks in advance, any suggestion is highly appreciated. I Don't know
> what to do, CF produces inaccurate results for my tasks, SVD is the only
> hope ))
>
I also doubtful about that. (if you trying to factorize our recommendation
space). SVD has proven to be notoriously inadequate for that problem.
ALS-WR would be a much better first stab.
however since you seem to be performing text analysis (seq2sparse), i don't
see immediately how it is related to collaborative filtering -- perhaps if
you told more about your problem, i am sure here are people on this list
who could advise you about perhaps one of the best courses of action.
> Regards,
> Pavel
>
>
>
>
>