Many thanks for your explanation about SVD and ALS-WR factor models. You
are absolutely right: we don't have "negative feedback" data and
"preference" data.

Unfortunately we can't use content based algorithms ("to grab URL
content") right now. What we have is an item title (2-10 terms) but not a
whole item content.
We use this data to merge different urls (word stemming, pruning
stop-words etc). As a result we interpret different URLs with "similar"
title as a single url. This step reduces items count.
The day will come and we'll combine content filtering and CF ))


Can you please help me with 2 issues regarding ALS-WR:
1) Will "implicitFeedback true" parameter for parallelALS enable the
technique described in "CF for Implicit Feedback Datasets"? (thanks for
link to this paper btw)
2) Is there any detailed description of parallelALS job? I can't run
ALS-WR with my data. It fails during M matrix job on 1st iteration (right
after U matrix job completes). I am not sure it is good idea, but I
decreased max split size to force mappers use less data. W/o this
parameter mappers of M job fail during "initializing" phase.

=================================
mahout parallelALS \
-i /tmp/pabramov/sparse/als_input/  \
-o /tmp/pabramov/sparse/als_output \
--numFeatures 21 \
--numIterations 15 \
--lambda 0.065 \
--tempDir /tmp/pabramov/tmpALS \
--implicitFeedback true \
-Dmapred.max.split.size=4000000 \
-Dmapred.map.child.java.opts="-Xmx3024m -XX:-UseGCOverheadLimit"
=================================



Thanks!

Pavel






16.11.12 0:31 пользователь "Dmitriy Lyubimov" <[email protected]> написал:

>On Thu, Nov 15, 2012 at 12:09 PM, Abramov Pavel
><[email protected]>wrote:
>
>> Dmitriy,
>>
>> 1) Thank you, I'll try 0.8 instead of 0.7.
>>
>> 2) Regarding my problem and seq2sparse. We do not perform text analysis.
>> We perform user click analysis. My documents are internet users and my
>> terms are url clicked. Input data contains "user_id<>url1, url2, url1,
>>urlN
>> etc" vectors. It is realy easy to convert these vectors to sparse TFIDF
>> vectors using seq2sparse. The frequency of URLs fallows Power Law. Thats
>> why we use seq2sparse with TFIDF weighting. My goals are:
>> - to recommend new URL to user
>> - to reduce the User<>url dimension for both users (U) and urls(V)
>> analysis (clustering, classification etc).
>> - to find the similarity between user and url. ( dot_product{Ua[i],
>>Va[j]}
>> )
>>
>> Is SVD a suitable solution for this problem?
>>
>Like i said, i don't think so.
>
>Somebody just came around with exact same problem another day.
>
>* First off,  if your data is sparse (i.e. there's no data for user
>affection to a particular url just because user never figured that url
>existed) SVD is terrible for that because it cannot tell if user just did
>not visit url to-date because he did not know or because he did not like
>it. Like i said, ALS-WR is an improvement over this but it still lacks in
> a sense that you'd be better off by encoding implicit feedback part and
>confidence for the factorizer. See http://research.yahoo.com/pub/2433
>which
>is now very popular approach. Ask Sebastian Schelter how we do it Mahout.
>
>* second off, your data will still probably too sparse for a good
>inference. *I think* it would eventually help if you could grab the
>content
>of the pages and map them into a topical space using LSA or LDA(CVB in
>Mahout). Once you have content info behind urls, you'd be able to combine
>(boost) factorization and regression on content (e.g. you could train
>regression on url content side first to predict average user response, and
>then use implicit feedback factorization to guess factors of residual
>based
>on a user). But i guess there's no precooked method here for it. But that
>would probably be the most accurate thing to do. (Eventually you may also
>want to do some time series ema weighting and autoregression for the
>result
>i guess too which might yield even better approximations for affinities
>based on time of the training data as well as current time).
>
>
>> 3) I can apply SSVD on a sample (0,1% of my data). But it fails with
>>100%
>> of data. (Bt-job stops on a Map phase with "Java heap space" errors or
>> "timeout" errors).
>> Input matrix is a sparse matrix 20 000 000 X 150 000 with ~0,03%
>>non-zero
>> values. (8GB total)
>>
>> How I use it:
>>
>> ====================
>> mahout-distribution-0.7/bin/mahout ssvd \
>> -i /tmp/pabramov/sparse/tfidf-vectors/ \
>> -o /tmp/pabramov/ssvd \
>> -k 200 \
>> -q 1 \
>> --reduceTasks 150 \
>> --tempDir /tmp/pabramov/tmp \
>> -Dmapred.max.split.size=1000000 \
>> -ow
>> ====================
>>
>> Can't pass Bt-job... Should I decrease split.size and/or add extra
>>params?
>> Hadoop has 400 Map and 300 reduce slots with 1 CPU core and 2GB RAM per
>> task.
>> Q-job completes in 20 minutes.
>>
>> Many thanks in advance!
>>
>> Pavel
>>
>>
>> ________________________________________
>> От: Dmitriy Lyubimov [[email protected]]
>> Отправлено: 15 ноября 2012 г. 21:53
>> To: [email protected]
>> Тема: Re: SSVD fails on seq2sparse output.
>>
>> On Thu, Nov 15, 2012 at 3:43 AM, Abramov Pavel <[email protected]
>> >wrote:
>>
>> >
>> > Many thanks in advance, any suggestion is highly appreciated. I Don't
>> know
>> > what to do, CF produces inaccurate results for my tasks, SVD is the
>>only
>> > hope ))
>> >
>>
>> I also doubtful about that. (if you trying to factorize our
>>recommendation
>> space). SVD has proven to be notoriously inadequate for that problem.
>> ALS-WR would be a much better first stab.
>>
>> however since you seem to be performing text analysis (seq2sparse), i
>>don't
>> see immediately how it is related to collaborative filtering -- perhaps
>>if
>> you told more about your problem, i am sure here are people on this list
>> who could advise you about perhaps one of the best courses of action.
>>
>>
>> > Regards,
>> > Pavel
>> >
>> >
>> >
>> >
>> >
>>

Reply via email to