On Thu, Nov 15, 2012 at 12:09 PM, Abramov Pavel <[email protected]>wrote:
> Dmitriy,
>
> 1) Thank you, I'll try 0.8 instead of 0.7.
>
> 2) Regarding my problem and seq2sparse. We do not perform text analysis.
> We perform user click analysis. My documents are internet users and my
> terms are url clicked. Input data contains "user_id<>url1, url2, url1, urlN
> etc" vectors. It is realy easy to convert these vectors to sparse TFIDF
> vectors using seq2sparse. The frequency of URLs fallows Power Law. Thats
> why we use seq2sparse with TFIDF weighting. My goals are:
> - to recommend new URL to user
> - to reduce the User<>url dimension for both users (U) and urls(V)
> analysis (clustering, classification etc).
> - to find the similarity between user and url. ( dot_product{Ua[i], Va[j]}
> )
>
> Is SVD a suitable solution for this problem?
>
Like i said, i don't think so.
Somebody just came around with exact same problem another day.
* First off, if your data is sparse (i.e. there's no data for user
affection to a particular url just because user never figured that url
existed) SVD is terrible for that because it cannot tell if user just did
not visit url to-date because he did not know or because he did not like
it. Like i said, ALS-WR is an improvement over this but it still lacks in
a sense that you'd be better off by encoding implicit feedback part and
confidence for the factorizer. See http://research.yahoo.com/pub/2433 which
is now very popular approach. Ask Sebastian Schelter how we do it Mahout.
* second off, your data will still probably too sparse for a good
inference. *I think* it would eventually help if you could grab the content
of the pages and map them into a topical space using LSA or LDA(CVB in
Mahout). Once you have content info behind urls, you'd be able to combine
(boost) factorization and regression on content (e.g. you could train
regression on url content side first to predict average user response, and
then use implicit feedback factorization to guess factors of residual based
on a user). But i guess there's no precooked method here for it. But that
would probably be the most accurate thing to do. (Eventually you may also
want to do some time series ema weighting and autoregression for the result
i guess too which might yield even better approximations for affinities
based on time of the training data as well as current time).
> 3) I can apply SSVD on a sample (0,1% of my data). But it fails with 100%
> of data. (Bt-job stops on a Map phase with "Java heap space" errors or
> "timeout" errors).
> Input matrix is a sparse matrix 20 000 000 X 150 000 with ~0,03% non-zero
> values. (8GB total)
>
> How I use it:
>
> ====================
> mahout-distribution-0.7/bin/mahout ssvd \
> -i /tmp/pabramov/sparse/tfidf-vectors/ \
> -o /tmp/pabramov/ssvd \
> -k 200 \
> -q 1 \
> --reduceTasks 150 \
> --tempDir /tmp/pabramov/tmp \
> -Dmapred.max.split.size=1000000 \
> -ow
> ====================
>
> Can't pass Bt-job... Should I decrease split.size and/or add extra params?
> Hadoop has 400 Map and 300 reduce slots with 1 CPU core and 2GB RAM per
> task.
> Q-job completes in 20 minutes.
>
> Many thanks in advance!
>
> Pavel
>
>
> ________________________________________
> От: Dmitriy Lyubimov [[email protected]]
> Отправлено: 15 ноября 2012 г. 21:53
> To: [email protected]
> Тема: Re: SSVD fails on seq2sparse output.
>
> On Thu, Nov 15, 2012 at 3:43 AM, Abramov Pavel <[email protected]
> >wrote:
>
> >
> > Many thanks in advance, any suggestion is highly appreciated. I Don't
> know
> > what to do, CF produces inaccurate results for my tasks, SVD is the only
> > hope ))
> >
>
> I also doubtful about that. (if you trying to factorize our recommendation
> space). SVD has proven to be notoriously inadequate for that problem.
> ALS-WR would be a much better first stab.
>
> however since you seem to be performing text analysis (seq2sparse), i don't
> see immediately how it is related to collaborative filtering -- perhaps if
> you told more about your problem, i am sure here are people on this list
> who could advise you about perhaps one of the best courses of action.
>
>
> > Regards,
> > Pavel
> >
> >
> >
> >
> >
>