Re: SSVD fails on seq2sparse output.

Abramov Pavel Mon, 19 Nov 2012 00:30:13 -0800

Hi Sean, 

> PS I think I mentioned off-list, but this is more or less exactly the
>basis
> of Myrrix (http://myrrix.com). It should be able to handle this scale,
> maybe slightly more easily since it can load only the subset of these
> matrices needed by each worker -- more reducers means less RAM per
>reducer.
> You might also try this out if scale is the issue.




Can Myrrix Computation Level run on FreeBSD? Yes, we use hadoop with
freeBSD )

Regards,
Pavel






18.11.12 23:31 пользователь "Sean Owen" <[email protected]> написал:

>ALS-WR is a great fit for this input. Your pre-processing is a good way to
>add some extra info to the process.
>
>I believe the implicitFeedback=true setting does make it follow the paper
>you cite. It's no longer estimating the input (i.e. not estimating
>'ratings') but using the input values as loss function weights. This works
>nicely.
>
>Yes as Sebastian says, it speeds things up greatly to put the feature
>matrices in memory, but with 20M users that is way bigger than the memory
>allocated to your reducers.
>
>PS I think I mentioned off-list, but this is more or less exactly the
>basis
>of Myrrix (http://myrrix.com). It should be able to handle this scale,
>maybe slightly more easily since it can load only the subset of these
>matrices needed by each worker -- more reducers means less RAM per
>reducer.
>You might also try this out if scale is the issue.
>
>
>On Sun, Nov 18, 2012 at 4:22 PM, Abramov Pavel
><[email protected]>wrote:
>
>> Many thanks for your explanation about SVD and ALS-WR factor models. You
>> are absolutely right: we don't have "negative feedback" data and
>> "preference" data.
>>
>> Unfortunately we can't use content based algorithms ("to grab URL
>> content") right now. What we have is an item title (2-10 terms) but not
>>a
>> whole item content.
>> We use this data to merge different urls (word stemming, pruning
>> stop-words etc). As a result we interpret different URLs with "similar"
>> title as a single url. This step reduces items count.
>> The day will come and we'll combine content filtering and CF ))
>>
>>
>> Can you please help me with 2 issues regarding ALS-WR:
>> 1) Will "implicitFeedback true" parameter for parallelALS enable the
>> technique described in "CF for Implicit Feedback Datasets"? (thanks for
>> link to this paper btw)
>> 2) Is there any detailed description of parallelALS job? I can't run
>> ALS-WR with my data. It fails during M matrix job on 1st iteration
>>(right
>> after U matrix job completes). I am not sure it is good idea, but I
>> decreased max split size to force mappers use less data. W/o this
>> parameter mappers of M job fail during "initializing" phase.
>>
>> =================================
>> mahout parallelALS \
>> -i /tmp/pabramov/sparse/als_input/  \
>> -o /tmp/pabramov/sparse/als_output \
>> --numFeatures 21 \
>> --numIterations 15 \
>> --lambda 0.065 \
>> --tempDir /tmp/pabramov/tmpALS \
>> --implicitFeedback true \
>> -Dmapred.max.split.size=4000000 \
>> -Dmapred.map.child.java.opts="-Xmx3024m -XX:-UseGCOverheadLimit"
>> =================================
>>
>>
>>
>> Thanks!
>>
>> Pavel
>>
>>
>>
>>
>>
>>
>> 16.11.12 0:31 пользователь "Dmitriy Lyubimov" <[email protected]>
>>написал:
>>
>> >On Thu, Nov 15, 2012 at 12:09 PM, Abramov Pavel
>> ><[email protected]>wrote:
>> >
>> >> Dmitriy,
>> >>
>> >> 1) Thank you, I'll try 0.8 instead of 0.7.
>> >>
>> >> 2) Regarding my problem and seq2sparse. We do not perform text
>>analysis.
>> >> We perform user click analysis. My documents are internet users and
>>my
>> >> terms are url clicked. Input data contains "user_id<>url1, url2,
>>url1,
>> >>urlN
>> >> etc" vectors. It is realy easy to convert these vectors to sparse
>>TFIDF
>> >> vectors using seq2sparse. The frequency of URLs fallows Power Law.
>>Thats
>> >> why we use seq2sparse with TFIDF weighting. My goals are:
>> >> - to recommend new URL to user
>> >> - to reduce the User<>url dimension for both users (U) and urls(V)
>> >> analysis (clustering, classification etc).
>> >> - to find the similarity between user and url. ( dot_product{Ua[i],
>> >>Va[j]}
>> >> )
>> >>
>> >> Is SVD a suitable solution for this problem?
>> >>
>> >Like i said, i don't think so.
>> >
>> >Somebody just came around with exact same problem another day.
>> >
>> >* First off,  if your data is sparse (i.e. there's no data for user
>> >affection to a particular url just because user never figured that url
>> >existed) SVD is terrible for that because it cannot tell if user just
>>did
>> >not visit url to-date because he did not know or because he did not
>>like
>> >it. Like i said, ALS-WR is an improvement over this but it still lacks
>>in
>> > a sense that you'd be better off by encoding implicit feedback part
>>and
>> >confidence for the factorizer. See http://research.yahoo.com/pub/2433
>> >which
>> >is now very popular approach. Ask Sebastian Schelter how we do it
>>Mahout.
>> >
>> >* second off, your data will still probably too sparse for a good
>> >inference. *I think* it would eventually help if you could grab the
>> >content
>> >of the pages and map them into a topical space using LSA or LDA(CVB in
>> >Mahout). Once you have content info behind urls, you'd be able to
>>combine
>> >(boost) factorization and regression on content (e.g. you could train
>> >regression on url content side first to predict average user response,
>>and
>> >then use implicit feedback factorization to guess factors of residual
>> >based
>> >on a user). But i guess there's no precooked method here for it. But
>>that
>> >would probably be the most accurate thing to do. (Eventually you may
>>also
>> >want to do some time series ema weighting and autoregression for the
>> >result
>> >i guess too which might yield even better approximations for affinities
>> >based on time of the training data as well as current time).
>> >
>> >
>> >> 3) I can apply SSVD on a sample (0,1% of my data). But it fails with
>> >>100%
>> >> of data. (Bt-job stops on a Map phase with "Java heap space" errors
>>or
>> >> "timeout" errors).
>> >> Input matrix is a sparse matrix 20 000 000 X 150 000 with ~0,03%
>> >>non-zero
>> >> values. (8GB total)
>> >>
>> >> How I use it:
>> >>
>> >> ====================
>> >> mahout-distribution-0.7/bin/mahout ssvd \
>> >> -i /tmp/pabramov/sparse/tfidf-vectors/ \
>> >> -o /tmp/pabramov/ssvd \
>> >> -k 200 \
>> >> -q 1 \
>> >> --reduceTasks 150 \
>> >> --tempDir /tmp/pabramov/tmp \
>> >> -Dmapred.max.split.size=1000000 \
>> >> -ow
>> >> ====================
>> >>
>> >> Can't pass Bt-job... Should I decrease split.size and/or add extra
>> >>params?
>> >> Hadoop has 400 Map and 300 reduce slots with 1 CPU core and 2GB RAM
>>per
>> >> task.
>> >> Q-job completes in 20 minutes.
>> >>
>> >> Many thanks in advance!
>> >>
>> >> Pavel
>> >>
>> >>
>> >> ________________________________________
>> >> От: Dmitriy Lyubimov [[email protected]]
>> >> Отправлено: 15 ноября 2012 г. 21:53
>> >> To: [email protected]
>> >> Тема: Re: SSVD fails on seq2sparse output.
>> >>
>> >> On Thu, Nov 15, 2012 at 3:43 AM, Abramov Pavel
>><[email protected]
>> >> >wrote:
>> >>
>> >> >
>> >> > Many thanks in advance, any suggestion is highly appreciated. I
>>Don't
>> >> know
>> >> > what to do, CF produces inaccurate results for my tasks, SVD is the
>> >>only
>> >> > hope ))
>> >> >
>> >>
>> >> I also doubtful about that. (if you trying to factorize our
>> >>recommendation
>> >> space). SVD has proven to be notoriously inadequate for that problem.
>> >> ALS-WR would be a much better first stab.
>> >>
>> >> however since you seem to be performing text analysis (seq2sparse), i
>> >>don't
>> >> see immediately how it is related to collaborative filtering --
>>perhaps
>> >>if
>> >> you told more about your problem, i am sure here are people on this
>>list
>> >> who could advise you about perhaps one of the best courses of action.
>> >>
>> >>
>> >> > Regards,
>> >> > Pavel
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>>
>>

Re: SSVD fails on seq2sparse output.

Reply via email to