I filed a jira ticket for this issue at https://issues.apache.org/jira/browse/MAHOUT-493, it's scheduled for 0.5 as I don't have much time to work on mahout these days (have to finish my diploma thesis)
--sebastian https://issues.apache.org/jira/browse/MAHOUT-493 Am 24.08.2010 15:45, schrieb han henry: > For 1) , user's invalid items can store in multiple files, we use use > MapFilesMap to load the data from HDFS, > then we can check the invalid items. > > package org.apache.mahout.cf.taste.hadoop; > > import java.io.Closeable; > import java.io.IOException; > import java.util.ArrayList; > import java.util.List; > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.fs.FileStatus; > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.Path; > import org.apache.hadoop.fs.PathFilter; > import org.apache.hadoop.io.MapFile.Reader; > import org.apache.hadoop.io.Writable; > import org.apache.hadoop.io.WritableComparable; > import org.slf4j.Logger; > import org.slf4j.LoggerFactory; > > public final class MapFilesMap<K extends WritableComparable, V extends > Writable> > implements Closeable > { > private static final Logger log = > LoggerFactory.getLogger(MapFilesMap.class); > > private static final PathFilter PARTS_FILTER = new PathFilter() > { > public boolean accept(Path path) { > return path.getName().startsWith("part-"); > } > }; > private final List<MapFile.Reader> readers; > > public MapFilesMap(FileSystem fs, Path parentDir, Configuration > conf) throws IOException > { > log.info <http://log.info>("Creating MapFileMap from parent > directory {}", parentDir); > this.readers = new ArrayList(); > try { > for (FileStatus status : fs.listStatus(parentDir, PARTS_FILTER)) { > String path = status.getPath().toString(); > log.info <http://log.info>("Adding MapFile.Reader at {}", path); > this.readers.add(new MapFile.Reader(fs, path, conf)); > } > } catch (IOException ioe) { > close(); > throw ioe; > } > if (this.readers.isEmpty()) > throw new IllegalArgumentException("No MapFiles found in " + > parentDir); > } > > public V get(K key, V value) > throws IOException > { > for (MapFile.Reader reader : this.readers) > { > Writable theValue; > if ((theValue = reader.get(key, value)) != null) { > return theValue; > } > } > log.debug("No value for key {}", key); > return null; > } > > public void close() > { > for (MapFile.Reader reader : this.readers) > try { > reader.close(); > } > catch (IOException ioe) > { > } > } > } > > > > 2010/8/24 Sebastian Schelter <[email protected] <mailto:[email protected]>> > > Ok, you guys got me convinced :) > > From a technical point of view two ways to implement that filter > come to > my mind: > > 1) Just load the user/item pairs to filter into memory in the > AggregateAndRecommendReducer (easy but might not be scalable) like Han > Hui suggested > 2) Have the AggregateAndRecommendReducer not pick only the top-K > recommendations but write all predicted preferences to disk. Add > another > M/R step after that which joins recommendations and user/item filter > pairs to allow for custom rescoring/filtering > > --sebastian > > Am 24.08.2010 06:07, schrieb Ted Dunning: > > Sorry to chime in late, but removing items after recommendation > isn't such a > > crazy thing to do. > > > > In particular, it is common to remove previously viewed items > (for a period > > of time). Likewise, it the user says "don't show this again", > it makes > > sense to backstop the actual recommendation system with a UI > limitation that > > does a post-recommendation elimination. > > > > Moreover, this approach has the great benefit that the results > are very > > predictable. Exactly the requested/seen items will be > eliminated and no > > surprising effect on recommendations will occur. > > > > That predictability is exactly the problem, though. Generally > you want a > > bit more systemic effect for negative recommendations. This is > a really > > sticky area, however, because negative recommendations often impart > > information about positive preferences in addition to some level > of negative > > information. > > > > I used an explicit filter at both Musicmatch and at Veoh. Both > systems > > worked well. Especially at Veoh, there was a lot of additional > machinery > > required to handle the related problem of anti-flooding. That > was done at > > the UI level as well. > > > > On Mon, Aug 23, 2010 at 8:16 PM, Sean Owen <[email protected] > <mailto:[email protected]>> wrote: > > > > > >> (Uncanny, I was just minutes before researching Grooveshark for > >> unrelated reasons... Good to hear from any company doing > >> recommendations and is willing to talk about it. I know of a number > >> that can't or won't unfortunately.) > >> > >> Yeah, sounds like we're all on the same page. One key point in > what I > >> think everyone is talking about is that this is not simply removing > >> items *after* recommendations are computed. This risks removing > most > >> or all recommended items. It needs to be done during the process of > >> selecting recommendations. > >> > >> But beyond that, it's a simple idea and just a question of > >> implementation. It's "Rescorer" in the non-Hadoop code, which does > >> more than provide a way to remove items but rather generally > rearrange > >> recommendations according to some logic. I think it's likely > easy and > >> useful to imitate this with a simple optional Mapper/Reducer > phase in > >> this nascent "RecommenderJob" pipeline that Sebastian is now > helping > >> expand into something more configurable and general purpose. > >> > >> Sean > >> > >> On Mon, Aug 23, 2010 at 8:25 PM, Chris Bates > >> <[email protected] > <mailto:[email protected]>> wrote: > >> > >>> Hi all, > >>> > >>> I'm new to this forum and haven't seen the code you are > talking about, so > >>> take this with a grain of salt. The way we handle "banned > items" at > >>> Grooveshark is to post-process the itemID pairs in Hive. If a > user > >>> > >> dislikes > >> > >>> a recommended song/artist, an item pair is stored in HDFS and > then when > >>> > >> the > >> > >>> recs are computed, those banned user-item pairs are taken into > account. > >>> Here is an example query: > >>> > >>> SELECT DISTINCT st.uid, st.simuid, IF(b.uid=st.uid,1,0) as > banned FROM > >>> streams_u2u st LEFT OUTER JOIN bannedsimusers b ON > (b.simuid=st.simuid); > >>> > >>> That query will print out a 1 or a 0 if the recommended item > pair is > >>> > >> banned > >> > >>> or not. Hive also supports case statements (I think), so you > can make a > >>> range of "banned-ness" I guess. Just another solution to the > "dislike" > >>> problem. > >>> > >>> Chris > >>> > >> > > > >
