Hi Sean, > PS I think I mentioned off-list, but this is more or less exactly the >basis > of Myrrix (http://myrrix.com). It should be able to handle this scale, > maybe slightly more easily since it can load only the subset of these > matrices needed by each worker -- more reducers means less RAM per >reducer. > You might also try this out if scale is the issue.
Can Myrrix Computation Level run on FreeBSD? Yes, we use hadoop with freeBSD ) Regards, Pavel 18.11.12 23:31 пользователь "Sean Owen" <[email protected]> написал: >ALS-WR is a great fit for this input. Your pre-processing is a good way to >add some extra info to the process. > >I believe the implicitFeedback=true setting does make it follow the paper >you cite. It's no longer estimating the input (i.e. not estimating >'ratings') but using the input values as loss function weights. This works >nicely. > >Yes as Sebastian says, it speeds things up greatly to put the feature >matrices in memory, but with 20M users that is way bigger than the memory >allocated to your reducers. > >PS I think I mentioned off-list, but this is more or less exactly the >basis >of Myrrix (http://myrrix.com). It should be able to handle this scale, >maybe slightly more easily since it can load only the subset of these >matrices needed by each worker -- more reducers means less RAM per >reducer. >You might also try this out if scale is the issue. > > >On Sun, Nov 18, 2012 at 4:22 PM, Abramov Pavel ><[email protected]>wrote: > >> Many thanks for your explanation about SVD and ALS-WR factor models. You >> are absolutely right: we don't have "negative feedback" data and >> "preference" data. >> >> Unfortunately we can't use content based algorithms ("to grab URL >> content") right now. What we have is an item title (2-10 terms) but not >>a >> whole item content. >> We use this data to merge different urls (word stemming, pruning >> stop-words etc). As a result we interpret different URLs with "similar" >> title as a single url. This step reduces items count. >> The day will come and we'll combine content filtering and CF )) >> >> >> Can you please help me with 2 issues regarding ALS-WR: >> 1) Will "implicitFeedback true" parameter for parallelALS enable the >> technique described in "CF for Implicit Feedback Datasets"? (thanks for >> link to this paper btw) >> 2) Is there any detailed description of parallelALS job? I can't run >> ALS-WR with my data. It fails during M matrix job on 1st iteration >>(right >> after U matrix job completes). I am not sure it is good idea, but I >> decreased max split size to force mappers use less data. W/o this >> parameter mappers of M job fail during "initializing" phase. >> >> ================================= >> mahout parallelALS \ >> -i /tmp/pabramov/sparse/als_input/ \ >> -o /tmp/pabramov/sparse/als_output \ >> --numFeatures 21 \ >> --numIterations 15 \ >> --lambda 0.065 \ >> --tempDir /tmp/pabramov/tmpALS \ >> --implicitFeedback true \ >> -Dmapred.max.split.size=4000000 \ >> -Dmapred.map.child.java.opts="-Xmx3024m -XX:-UseGCOverheadLimit" >> ================================= >> >> >> >> Thanks! >> >> Pavel >> >> >> >> >> >> >> 16.11.12 0:31 пользователь "Dmitriy Lyubimov" <[email protected]> >>написал: >> >> >On Thu, Nov 15, 2012 at 12:09 PM, Abramov Pavel >> ><[email protected]>wrote: >> > >> >> Dmitriy, >> >> >> >> 1) Thank you, I'll try 0.8 instead of 0.7. >> >> >> >> 2) Regarding my problem and seq2sparse. We do not perform text >>analysis. >> >> We perform user click analysis. My documents are internet users and >>my >> >> terms are url clicked. Input data contains "user_id<>url1, url2, >>url1, >> >>urlN >> >> etc" vectors. It is realy easy to convert these vectors to sparse >>TFIDF >> >> vectors using seq2sparse. The frequency of URLs fallows Power Law. >>Thats >> >> why we use seq2sparse with TFIDF weighting. My goals are: >> >> - to recommend new URL to user >> >> - to reduce the User<>url dimension for both users (U) and urls(V) >> >> analysis (clustering, classification etc). >> >> - to find the similarity between user and url. ( dot_product{Ua[i], >> >>Va[j]} >> >> ) >> >> >> >> Is SVD a suitable solution for this problem? >> >> >> >Like i said, i don't think so. >> > >> >Somebody just came around with exact same problem another day. >> > >> >* First off, if your data is sparse (i.e. there's no data for user >> >affection to a particular url just because user never figured that url >> >existed) SVD is terrible for that because it cannot tell if user just >>did >> >not visit url to-date because he did not know or because he did not >>like >> >it. Like i said, ALS-WR is an improvement over this but it still lacks >>in >> > a sense that you'd be better off by encoding implicit feedback part >>and >> >confidence for the factorizer. See http://research.yahoo.com/pub/2433 >> >which >> >is now very popular approach. Ask Sebastian Schelter how we do it >>Mahout. >> > >> >* second off, your data will still probably too sparse for a good >> >inference. *I think* it would eventually help if you could grab the >> >content >> >of the pages and map them into a topical space using LSA or LDA(CVB in >> >Mahout). Once you have content info behind urls, you'd be able to >>combine >> >(boost) factorization and regression on content (e.g. you could train >> >regression on url content side first to predict average user response, >>and >> >then use implicit feedback factorization to guess factors of residual >> >based >> >on a user). But i guess there's no precooked method here for it. But >>that >> >would probably be the most accurate thing to do. (Eventually you may >>also >> >want to do some time series ema weighting and autoregression for the >> >result >> >i guess too which might yield even better approximations for affinities >> >based on time of the training data as well as current time). >> > >> > >> >> 3) I can apply SSVD on a sample (0,1% of my data). But it fails with >> >>100% >> >> of data. (Bt-job stops on a Map phase with "Java heap space" errors >>or >> >> "timeout" errors). >> >> Input matrix is a sparse matrix 20 000 000 X 150 000 with ~0,03% >> >>non-zero >> >> values. (8GB total) >> >> >> >> How I use it: >> >> >> >> ==================== >> >> mahout-distribution-0.7/bin/mahout ssvd \ >> >> -i /tmp/pabramov/sparse/tfidf-vectors/ \ >> >> -o /tmp/pabramov/ssvd \ >> >> -k 200 \ >> >> -q 1 \ >> >> --reduceTasks 150 \ >> >> --tempDir /tmp/pabramov/tmp \ >> >> -Dmapred.max.split.size=1000000 \ >> >> -ow >> >> ==================== >> >> >> >> Can't pass Bt-job... Should I decrease split.size and/or add extra >> >>params? >> >> Hadoop has 400 Map and 300 reduce slots with 1 CPU core and 2GB RAM >>per >> >> task. >> >> Q-job completes in 20 minutes. >> >> >> >> Many thanks in advance! >> >> >> >> Pavel >> >> >> >> >> >> ________________________________________ >> >> От: Dmitriy Lyubimov [[email protected]] >> >> Отправлено: 15 ноября 2012 г. 21:53 >> >> To: [email protected] >> >> Тема: Re: SSVD fails on seq2sparse output. >> >> >> >> On Thu, Nov 15, 2012 at 3:43 AM, Abramov Pavel >><[email protected] >> >> >wrote: >> >> >> >> > >> >> > Many thanks in advance, any suggestion is highly appreciated. I >>Don't >> >> know >> >> > what to do, CF produces inaccurate results for my tasks, SVD is the >> >>only >> >> > hope )) >> >> > >> >> >> >> I also doubtful about that. (if you trying to factorize our >> >>recommendation >> >> space). SVD has proven to be notoriously inadequate for that problem. >> >> ALS-WR would be a much better first stab. >> >> >> >> however since you seem to be performing text analysis (seq2sparse), i >> >>don't >> >> see immediately how it is related to collaborative filtering -- >>perhaps >> >>if >> >> you told more about your problem, i am sure here are people on this >>list >> >> who could advise you about perhaps one of the best courses of action. >> >> >> >> >> >> > Regards, >> >> > Pavel >> >> > >> >> > >> >> > >> >> > >> >> > >> >> >> >>
