On Sun, Nov 18, 2012 at 8:22 AM, Abramov Pavel <[email protected]>wrote:

> Many thanks for your explanation about SVD and ALS-WR factor models. You
> are absolutely right: we don't have "negative feedback" data and
> "preference" data.
>
> Unfortunately we can't use content based algorithms ("to grab URL
> content") right now. What we have is an item title (2-10 terms) but not a
> whole item content.
> We use this data to merge different urls (word stemming, pruning
> stop-words etc). As a result we interpret different URLs with "similar"
> title as a single url. This step reduces items count.
> The day will come and we'll combine content filtering and CF ))
>
>
> Can you please help me with 2 issues regarding ALS-WR:
> 1) Will "implicitFeedback true" parameter for parallelALS enable the
> technique described in "CF for Implicit Feedback Datasets"? (thanks for
> link to this paper btw)
> 2) Is there any detailed description of parallelALS job? I can't run
> ALS-WR with my data. It fails during M matrix job on 1st iteration (right
> after U matrix job completes). I am not sure it is good idea, but I
> decreased max split size to force mappers use less data. W/o this
> parameter mappers of M job fail during "initializing" phase.
>
> Sebastian, perhaps  you could help with these questions ?

> =================================
> mahout parallelALS \
> -i /tmp/pabramov/sparse/als_input/  \
> -o /tmp/pabramov/sparse/als_output \
> --numFeatures 21 \
> --numIterations 15 \
> --lambda 0.065 \
> --tempDir /tmp/pabramov/tmpALS \
> --implicitFeedback true \
> -Dmapred.max.split.size=4000000 \
> -Dmapred.map.child.java.opts="-Xmx3024m -XX:-UseGCOverheadLimit"
> =================================
>
>
>
> Thanks!
>
> Pavel
>
>
>
>
>
>
> 16.11.12 0:31 пользователь "Dmitriy Lyubimov" <[email protected]> написал:
>
> >On Thu, Nov 15, 2012 at 12:09 PM, Abramov Pavel
> ><[email protected]>wrote:
> >
> >> Dmitriy,
> >>
> >> 1) Thank you, I'll try 0.8 instead of 0.7.
> >>
> >> 2) Regarding my problem and seq2sparse. We do not perform text analysis.
> >> We perform user click analysis. My documents are internet users and my
> >> terms are url clicked. Input data contains "user_id<>url1, url2, url1,
> >>urlN
> >> etc" vectors. It is realy easy to convert these vectors to sparse TFIDF
> >> vectors using seq2sparse. The frequency of URLs fallows Power Law. Thats
> >> why we use seq2sparse with TFIDF weighting. My goals are:
> >> - to recommend new URL to user
> >> - to reduce the User<>url dimension for both users (U) and urls(V)
> >> analysis (clustering, classification etc).
> >> - to find the similarity between user and url. ( dot_product{Ua[i],
> >>Va[j]}
> >> )
> >>
> >> Is SVD a suitable solution for this problem?
> >>
> >Like i said, i don't think so.
> >
> >Somebody just came around with exact same problem another day.
> >
> >* First off,  if your data is sparse (i.e. there's no data for user
> >affection to a particular url just because user never figured that url
> >existed) SVD is terrible for that because it cannot tell if user just did
> >not visit url to-date because he did not know or because he did not like
> >it. Like i said, ALS-WR is an improvement over this but it still lacks in
> > a sense that you'd be better off by encoding implicit feedback part and
> >confidence for the factorizer. See http://research.yahoo.com/pub/2433
> >which
> >is now very popular approach. Ask Sebastian Schelter how we do it Mahout.
> >
> >* second off, your data will still probably too sparse for a good
> >inference. *I think* it would eventually help if you could grab the
> >content
> >of the pages and map them into a topical space using LSA or LDA(CVB in
> >Mahout). Once you have content info behind urls, you'd be able to combine
> >(boost) factorization and regression on content (e.g. you could train
> >regression on url content side first to predict average user response, and
> >then use implicit feedback factorization to guess factors of residual
> >based
> >on a user). But i guess there's no precooked method here for it. But that
> >would probably be the most accurate thing to do. (Eventually you may also
> >want to do some time series ema weighting and autoregression for the
> >result
> >i guess too which might yield even better approximations for affinities
> >based on time of the training data as well as current time).
> >
> >
> >> 3) I can apply SSVD on a sample (0,1% of my data). But it fails with
> >>100%
> >> of data. (Bt-job stops on a Map phase with "Java heap space" errors or
> >> "timeout" errors).
> >> Input matrix is a sparse matrix 20 000 000 X 150 000 with ~0,03%
> >>non-zero
> >> values. (8GB total)
> >>
> >> How I use it:
> >>
> >> ====================
> >> mahout-distribution-0.7/bin/mahout ssvd \
> >> -i /tmp/pabramov/sparse/tfidf-vectors/ \
> >> -o /tmp/pabramov/ssvd \
> >> -k 200 \
> >> -q 1 \
> >> --reduceTasks 150 \
> >> --tempDir /tmp/pabramov/tmp \
> >> -Dmapred.max.split.size=1000000 \
> >> -ow
> >> ====================
> >>
> >> Can't pass Bt-job... Should I decrease split.size and/or add extra
> >>params?
> >> Hadoop has 400 Map and 300 reduce slots with 1 CPU core and 2GB RAM per
> >> task.
> >> Q-job completes in 20 minutes.
> >>
> >> Many thanks in advance!
> >>
> >> Pavel
> >>
> >>
> >> ________________________________________
> >> От: Dmitriy Lyubimov [[email protected]]
> >> Отправлено: 15 ноября 2012 г. 21:53
> >> To: [email protected]
> >> Тема: Re: SSVD fails on seq2sparse output.
> >>
> >> On Thu, Nov 15, 2012 at 3:43 AM, Abramov Pavel <[email protected]
> >> >wrote:
> >>
> >> >
> >> > Many thanks in advance, any suggestion is highly appreciated. I Don't
> >> know
> >> > what to do, CF produces inaccurate results for my tasks, SVD is the
> >>only
> >> > hope ))
> >> >
> >>
> >> I also doubtful about that. (if you trying to factorize our
> >>recommendation
> >> space). SVD has proven to be notoriously inadequate for that problem.
> >> ALS-WR would be a much better first stab.
> >>
> >> however since you seem to be performing text analysis (seq2sparse), i
> >>don't
> >> see immediately how it is related to collaborative filtering -- perhaps
> >>if
> >> you told more about your problem, i am sure here are people on this list
> >> who could advise you about perhaps one of the best courses of action.
> >>
> >>
> >> > Regards,
> >> > Pavel
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
>
>

Reply via email to