Re: mapreduce ItemSimilarity input optimization

Serega Sheypak Tue, 19 Aug 2014 10:24:40 -0700

Yes, ecommerce.
>>#2 data includes #1 data, right?
Yes, #1 are "raw" output of ItemSimilarity recommendtions
#2 are recommednations #1 with category filter applied.


I can't drop #1 "look-with" since #2(ith category filter) doesn't have
accessories. Category filter would remove accessory recommendations for
iphone and leave only other iphones.

The idea is to provide different "demensions" of data: look-with, similar
(look-with with category filter applied), recommnedations based on sale
input would have name "also-bought", e.t.c.

"cross-cooccurrence" - what does it mean? Run itemsimilairty with "views",
then with "sales" and provide merged result? where item->item pairs exist
in both outputs?



2014-08-19 20:37 GMT+04:00 Pat Ferrel <[email protected]>:

> emon is a typo
>
> I still don’t understand the difference between these “recommendations” 1)
> "look-with recommendations" = recommended items clicked? 2) similar = items
> viewed by others? The recommendations clicked will lead to viewing an item
> so #2 data includes #1 data, right? I would drop #1 and use only #2 data.
> Besides if you only recommend items that have been recommended you will
> decrease sales because you will never show other items. Over time the
> recommended items will become out of date since you never mix-in new items.
> You may always recommend an iPhone 5 even after it has been discontinued.
>
> If you know the category of an item--filter the recs by that category or
> related categories. You are already doing this in #2 below so if you drop
> #1 there is no problem, correct? Users will not see the water heater with
> the iphone.
>
> Question) Why do you get a water heater with iphone? Unless there is a bug
> somewhere the data says that similar people looked at both. Item view data
> is not very predictive and in any case you will get this type if thing if
> it exists in user behavior. There may even be a correlation between the
> need for an iphone and a water heater that you don’t know about or it may
> just be a coincidence. But for now let’s say it’s an anomaly in the data
> and just filter those out by category.
>
> What I was beginning to say is that it sounds like you have an ECOM site.
> If so do you have purchase data? Purchase data is usually much, much better
> than item view data. People tend to look at a lot of things but when they
> purchase something it means a much higher preference than merely looking at
> something.
>
> The first rule of making a good recommender is find the best action, one
> that shows a user preference in the strongest possible way. For ecommerce
> that usually means a purchase. Then once you have that working you can add
> more actions but only with cross-cooccurrence, adding by weighting will not
> work with this type of recommender, it will only pollute your strong data
> with weaker actions.
>
> On Aug 19, 2014, at 8:18 AM, Serega Sheypak <[email protected]>
> wrote:
>
> Hi, what is "emon"?
> 1. I do create "look-with recommendations". I really it's just "raw" output
> from itemSimilarityJob with booleanData=true and LLR as similarity function
> (your suggestion)
> 2. I do create "similar" recommendations. I do apply category filter before
> serving recommendations
>
> "look-with", means other users watched iPhone case and other accessory with
> iphone. I do have accessory for iPhone here, but also water heating
> device...
> similar - means show only other smarphones as recommendations to iPhone.
>
> Right now the problem is in water heating device in 'look-with' (category
> filter not applied). How can I put away such kind of recommendations and
> why Do I get them?
>
>
>
> 2014-08-19 18:01 GMT+04:00 Pat Ferrel <[email protected]>:
>
> > That sounds much better.
> >
> > Do you have metadata like product category? Electronics vs. home
> > appliance? One easy thing to do if you have categories in your catalog is
> > filter by the same category as the item being viewed.
> >
> > BTW it sounds like you have an emon
> > On Aug 19, 2014, at 12:53 AM, Serega Sheypak <[email protected]>
> > wrote:
> >
> > Hi, I 've used LLR with properties you've suggested.
> > Right now I have a trouble.
> > A trouble:
> > Water heat device (
> >
> http://www.vasko.ru/upload/iblock/58a/58ac7efe640a551f00bec156601e9035.jpg
> > )
> > is recommedned for iPhone. And it has one of the highest score.
> > good things:
> > iPhone cases (
> >
> >
> https://www.apple.com/euro/iphone/c/generic/accessories/images/accessories_iphone_5s_case_colors.jpg
> > )
> > are recommedned for iPhone, It's good
> > Other smartphones are recommended to iPhone, it's good
> > Other iPhones are recommedned to iPhone. It's good. 16GB recommended to
> > 32GB, e.t.c.
> >
> > What could be a reason for recommending "Water heat device " to iPhone?
> > iPhone is one of the most popular item. There should be a lot of people
> > viewing iPhone with "Water heat device "?
> >
> >
> >
> > 2014-08-18 20:15 GMT+04:00 Pat Ferrel <[email protected]>:
> >
> >> Oh, and as to using different algorithms, this is an “ensemble” method.
> > In
> >> the paper they are talking about using widely differing algorithms like
> > ALS
> >> + Cooccurrence + … This technique was used to win the Netflix prize but
> > in
> >> practice the improvements may be to small to warrant running multiple
> >> pipelines. In any case it isn’t the first improvement you may want to
> > try.
> >> For instance your UI will have a drastic effect on how well you recs do,
> >> and there are other much easier techniques that we can talk about once
> > you
> >> get the basics working.
> >>
> >>
> >> On Aug 18, 2014, at 9:04 AM, Pat Ferrel <[email protected]> wrote:
> >>
> >> When beginning to use a recommender from Mahout I always suggest you
> > start
> >> from the defaults. These often give the best results—then tune
> afterwards
> >> to improve.
> >>
> >> Your intuition is correct that multiple actions can be used to improve
> >> results but get the basics working first. The easiest way to use
> multiple
> >> actions is to use spark-itemsimilarity so since you are using mapreduce
> > for
> >> now, just use one action.
> >>
> >> I would not try to combine the results from two similarity measures
> there
> >> is no benefit since LLR is better than any of them, at least I’ve never
> >> seen it loose. Below is my experience with trying many of the similarity
> >> metrics on exactly the same data. I did cross-validation with precision
> >> (MAP, mean average precision). LLR wins in other cases I’ve tried too.
> So
> >> LLR is the only method presently used in the Spark version of
> >> itemsimilarity.
> >>
> >> <map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg>
> >>
> >> If you still get weird results double check your ID mapping. Run a small
> >> bit of data through and spot check the mapping by hand.
> >>
> >> At some point you will want to create a cross-validation test. This is
> >> good as a sort of integration sanity check when making changes to the
> >> recommender. You run cross-validation using standard test data to see if
> >> the score changes drastically between releases. Big changes may indicate
> > a
> >> bug. At the beginning it will help you tune as in the case above where
> it
> >> helped decide on LLR.
> >>
> >>
> >>
> >> On Aug 18, 2014, at 1:43 AM, Serega Sheypak <[email protected]>
> >> wrote:
> >>
> >> Thank you very much. I'll do what you are sayning in bullets 1...5 and
> > try
> >> again.
> >>
> >> I also tried:
> >> 1. calc data using COUSINE_SIMILARITY
> >> 2. calc the same data using COOCCURENCE_SIMILARTY
> >> 3. join #1 and #2 where COOCURENCE >= $threshold
> >>
> >> Where threshold is some emperical integer value. I've used  "2" The idea
> > is
> >> to filter out item pairs which never-ever met together...
> >> Please see this link:
> >>
> >>
> >
> http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html
> >>
> >> If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
> >> approach still make sense, or it's useless waste of time?
> >>
> >> "What do you mean the similar items are terrible? How are you measuring
> >> that? " I have eye testing only,
> >> I did automate preparation->calculation->hbase upload-> web-app serving,
> > I
> >> didn't automate testing.
> >>
> >>
> >>
> >>
> >> 2014-08-18 5:16 GMT+04:00 Pat Ferrel <[email protected]>:
> >>
> >>> the things that stand out:
> >>>
> >>> 1) remove your maxSimilaritiesPerItem option! 50000
> >> maxSimilaritiesPerItem
> >>> will _kill_ performance and give no gain, leave this setting at the
> >> default
> >>> of 500
> >>> 2) use only one action. What do you want the user to do? Do you want
> > them
> >>> to read a page? Then train on item page views. If those pages lead to a
> >>> purchase then you want to recommend purchases so train on user
> > purchases.
> >>> 3) remove your minPrefsPerUser option, this should never be 0 or it
> will
> >>> leave users in the training data that have no data and may contribute
> to
> >>> longer runs with no gain.
> >>> 4) this is a pretty small Hadoop cluster for the size of your data but
> I
> >>> bet changing #1 will noticeably reduce the runtime
> >>> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
> >>> 6) remove your —booleanData option since LLR ignores weights.
> >>>
> >>> Remember that this is not the same as personalized recommendations.
> This
> >>> method alone will show the same “similar items” for all users.
> >>>
> >>> Sorry but both your “recommendation” types sound like the same thing.
> >>> Using both item page view  _and_ clicks on recommended items will both
> >> lead
> >>> to an item page view so you have two actions that lead to the same
> > thing,
> >>> right? Just train on an item page view (unless you really want the user
> >> to
> >>> make a purchase)
> >>>
> >>> What do you mean the similar items are terrible? How are you measuring
> >>> that? Are you doing cross-validation measuring precision or A/B
> testing?
> >>> What looks bad to you may be good, the eyeball test is not always
> >> reliable.
> >>> If they are coming up completely crazy or random then you may have a
> bug
> >> in
> >>> your ID translation logic.
> >>>
> >>> It sounds like you have enough data to produce good results.
> >>>
> >>> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <[email protected]
> >
> >>> wrote:
> >>>
> >>> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too
> > much
> >>> but enough for the start..
> >>> 2. I run it as oozie action.
> >>> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
> >>>     <java>
> >>>         <job-tracker>${jobTracker}</job-tracker>
> >>>         <name-node>${nameNode}</name-node>
> >>>         <prepare>
> >>>             <delete path="${mahoutOutputDir}/primary" />
> >>>             <delete
> >>> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
> >>>         </prepare>
> >>>         <configuration>
> >>>             <property>
> >>>                 <name>mapred.queue.name</name>
> >>>                 <value>default</value>
> >>>             </property>
> >>>
> >>>         </configuration>
> >>>
> >>>
> >>>
> >>
> >
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
> >>>         <arg>--input</arg>
> >>>         <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
> >>> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on
> >> recommendation,
> >>> a kind of try to increase quality of recommender...]-->
> >>>
> >>>         <arg>--output</arg>
> >>>         <arg>${mahoutOutputDir}/primary</arg>
> >>>
> >>>         <arg>--similarityClassname</arg>
> >>>         <arg>SIMILARITY_COSINE</arg>
> >>>
> >>>         <arg>--maxSimilaritiesPerItem</arg>
> >>>         <arg>50000</arg>
> >>>
> >>>         <arg>--minPrefsPerUser</arg>
> >>>         <arg>0</arg>
> >>>
> >>>         <arg>--booleanData</arg>
> >>>         <arg>false</arg>
> >>>
> >>>         <arg>--tempDir</arg>
> >>>         <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
> >>>
> >>>     </java>
> >>>     <ok to="to-narrow-table"/>
> >>>     <error to="kill"/>
> >>> </action>
> >>>
> >>> 3) RANK does it, here is a script:
> >>>
> >>> --user, item, pref previously prepared by hive
> >>> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
> >>> (user_id:chararray, item_id:long, pref:double);
> >>>
> >>> --get distinct user from the whole input
> >>> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
> >>>
> >>> --get distinct item from the whole input
> >>> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
> >>>
> >>> --rank user 1....N
> >>> rankUsers_ = RANK distUserId;
> >>> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
> >>>
> >>> --rank items 1....M
> >>> rankItems_ = RANK distItemId;
> >>> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
> >>>
> >>> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
> >>> joinedUsers = join user_item_pref by user_id, rankUsers by user_id
> USING
> >>> 'skewed';
> >>> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
> >>> item_id using 'replicated';
> >>>
> >>> projPrefs = FOREACH joinedItems GENERATE
> joinedUsers::rankUsers::rank_id
> >>> as user_id,
> >>>                                      rankItems::rank_id
> >>> as item_id,
> >>>                                      joinedUsers::user_item_pref::pref
> >>> as pref;
> >>>
> >>> --store mapping for later remapping from RANK back to natural values
> >>> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers'
> >> using
> >>> PigStorage('\t');
> >>> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems'
> >> using
> >>> PigStorage('\t');
> >>> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into
> >> '$projPrefs'
> >>> using PigStorage('\t');
> >>>
> >>> 4) I've seen this idea in different discussion, that different weight
> > for
> >>> different actions are not good. Sorry, I don't understand what you do
> >>> suggest.
> >>> I have two kind of actions: user viewed item, user clicked on
> > recommended
> >>> item (recommended item produced by my item similarity system).
> >>> I want to produce two kinds of recommendations:
> >>> 1. current item + recommend other items which other users visit in
> >>> conjuction with current item
> >>> 2. similar item: recommend items similar to current viewed item.
> >>> What can I try?
> >>> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
> >>>
> >>> Right now I do get awful recommendations and I can't understand what
> can
> >> I
> >>> try next :((((((((((((
> >>>
> >>>
> >>> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <[email protected]>:
> >>>
> >>>> 1) how many cores in the cluster? The whole idea behind mapreduce is
> > you
> >>>> buy more cpus you get nearly linear decrease in runtime.
> >>>> 2) what is your mahout command line with options, or how are you
> >> invoking
> >>>> mahout. I have seen the Mahout mapreduce recommender take this long so
> >> we
> >>>> should check what you are doing with downsampling.
> >>>> 3) do you really need to RANK your ids, that’s a full sort? When using
> >>> pig
> >>>> I usually get DISTINCT ones and assign an incrementing integer as the
> >>>> Mahout ID corresponding
> >>>> 4) your #2 assigning different weights to different actions usually
> > does
> >>>> not work. I’ve done this before and compared offline metrics and seen
> >>>> precision go down. I’d get this working using only your primary
> actions
> >>>> first. What are you trying to get the user to do? View something, buy
> >>>> something? Use that action as the primary preference and start out
> with
> >> a
> >>>> weight of 1 using LLR. With LLR the weights are not used anyway so
> your
> >>>> data may not produce good results with mixed actions.
> >>>>
> >>>> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
> >>>> 1) output from 2 can be directly ingested and will create output.
> >>>> 2) multiple actions can be used with cross-cooccurrence, not by
> > guessing
> >>>> at weights.
> >>>> 3) output has your application specific IDs preserved.
> >>>> 4) its about 10x faster than mapreduce and will do aways with your ID
> >>>> translation steps
> >>>>
> >>>> One caveat is that your cluster machines will need lots of memory. I
> >> have
> >>>> 8-16g on mine.
> >>>>
> >>>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <[email protected]
> >
> >>>> wrote:
> >>>>
> >>>> 1. I do collect preferences for items using 60days sliding window.
> > today
> >>> -
> >>>> 60 days.
> >>>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for
> >> item
> >>>> view, 5 for clicking recommndation block. The idea is to give more
> > value
> >>>> for recommendations which attact visitor attention). I get ~
> 20.000.000
> >>> of
> >>>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
> >>>> 3. I do use apache pig RANK function to rank all distinct user_id
> >>>> 4. I do the same for item_id
> >>>> 5. I do join input dataset with ranked datasets and provide input to
> >>> mahout
> >>>> with dense interger user_id, item_id
> >>>> 6. I do get mahout output and join integer item_id back to get natural
> >>> key
> >>>> value.
> >>>>
> >>>> step #1-2 takes ~ 40min
> >>>> step #3-5 takes ~1 hour
> >>>> mahout calc takes ~3hours
> >>>>
> >>>>
> >>>>
> >>>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <[email protected]>:
> >>>>
> >>>>> This really doesn't sound right.  It should be possible to process
> >>>> almost a
> >>>>> thousand times that much data every night without that much problem.
> >>>>>
> >>>>> How are you preparing the input data?
> >>>>>
> >>>>> How are you converting to Mahout id's?
> >>>>>
> >>>>> Even using python, you should be able to do the conversion in just a
> >> few
> >>>>> minutes without any parallelism whatsoever.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
> >>>> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi, We are trying calculate ItemSimilarity.
> >>>>>> Right now we have 2*10^7 input lines. I do provide input data as raw
> >>>> text
> >>>>>> each day to recalculate item similarities. We do get +100..1000 new
> >>>> items
> >>>>>> each day.
> >>>>>> 1. It takes too much time to prepare input data.
> >>>>>> 2. It takes too much time to convert user_id, item_id to mahout ids
> >>>>>>
> >>>>>> Is there any poissibility to provide data to mahout mapreduce
> >>>>>> ItemSimilarity using some binary format with compression?
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
>
>

Re: mapreduce ItemSimilarity input optimization

Reply via email to