Re: Performance Issue using item-based approach!

Ted Dunning Fri, 18 Apr 2014 10:24:26 -0700

You can always run Hadoop in a local mode.  Nothing prevents a single node
from being a cluster.  :-)



On Thu, Apr 17, 2014 at 7:43 AM, Najum Ali <[email protected]> wrote:

> Ted,
>
> Is it also possible to use ItemSimilarityJob in a non-distributed
> environment?
>
> Am 17.04.2014 um 16:22 schrieb Ted Dunning <[email protected]>:
>
> > Najum,
> >
> > You should also be able to use the ItemSimilarityJob to compute a limited
> > indicator set.
> >
> > This is stepping off of the path you have been on, but it would allow you
> > to deploy the recommender via a search engine.
> >
> > That makes a lot of code simply vanish.  THis is also a well trod
> > production path.
> >
> >
> >
> >
> > On Thu, Apr 17, 2014 at 3:57 AM, Najum Ali <[email protected]>
> wrote:
> >
> >> @Sebastian
> >>
> >> wow … you are right. The original csv file is about 21mb and the
> >> corresponding precomputed item-item similarity file is about 260mb!!
> >> And yes, there are wide more than 50 "most similar items“ for an item ..
> >>
> >> Trying to restrict this to 50 (or something like that) most similar
> items
> >> for an item could do the trick as you said.
> >> Ok I will give it try and reply later.
> >>
> >> By the way, what´s about the SampingCandidateItemsStrategy or something
> >> like this, by using this Constructor:
> >> *GenericItemBasedRecommender
> >> <
> https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.html#GenericItemBasedRecommender(org.apache.mahout.cf.taste.model.DataModel,%20org.apache.mahout.cf.taste.similarity.ItemSimilarity,%20org.apache.mahout.cf.taste.recommender.CandidateItemsStrategy,%20org.apache.mahout.cf.taste.recommender.MostSimilarItemsCandidateItemsStrategy)
> >*
> >> (DataModel<
> https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/model/DataModel.html
> >
> >> dataModel, ItemSimilarity<
> https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/similarity/ItemSimilarity.html
> >
> >> similarity, CandidateItemsStrategy<
> https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/recommender/CandidateItemsStrategy.html
> >
> >> candidateItemsStrategy,MostSimilarItemsCandidateItemsStrategy<
> https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/recommender/MostSimilarItemsCandidateItemsStrategy.html
> >
> >> mostSimilarItemsCandidateItemsStrategy)
> >>
> >>
> >> Am 17.04.2014 um 12:41 schrieb Sebastian Schelter <[email protected]>:
> >>
> >> Hi Najum,
> >>
> >> I think I found the problem. Remember: Two items are similar whenever at
> >> least one user interacted with both of them ("the items co-occur").
> >>
> >> In the movielens dataset this is true for almost all pairs of items,
> >> unfortunately. From 3076 items, more than 11 million similarities are
> >> created. A common approach for that (which is not yet implemented in our
> >> precomputation unfortunately) is to only retain the top-k similar items
> per
> >> item.
> >>
> >> A solution would be to take the csv file that is created by the
> >> MultithreadedBatchItemSimilarities and postprocess it so that only the
> 50
> >> most similar items per item are retained. That should help with your
> >> problem.
> >>
> >> Unfortunately, we don't have code for that yet, maybe you want to try to
> >> write that yourself?
> >>
> >> Best,
> >> Sebastian
> >>
> >> PS: The user-based recommender restricts the number of similar users, I
> >> guess thats why it is so fast here.
> >>
> >>
> >> On 04/17/2014 12:18 PM, Najum Ali wrote:
> >>
> >> Ok, here you go:
> >>
> >> I have created a simple class with main-method (no server and other
> stuff):
> >>
> >> public class RecommenderTest {
> >> public static void main(String[] args) throws IOException,
> TasteException {
> >> DataModel dataModel = new FileDataModel(new
> >>
> >>
> File("/Users/najum/Documents/recommender-console/src/main/webapp/resources/preference_csv/1mil.csv"));
> >> ItemSimilarity similarity = new LogLikelihoodSimilarity(dataModel);
> >> ItemBasedRecommender recommender = new
> >> GenericItemBasedRecommender(dataModel,
> >> similarity);
> >>
> >> String pathToPreComputedFile = preComputeSimilarities(recommender,
> >> dataModel.getNumItems());
> >>
> >> InputStream inputStream = new FileInputStream(new
> >> File(pathToPreComputedFile));
> >> BufferedReader bufferedReader = new BufferedReader(new
> >> InputStreamReader(inputStream));
> >> Collection<GenericItemSimilarity.ItemItemSimilarity> correlations =
> >>
> >>
> bufferedReader.lines().map(mapToItemItemSimilarity).collect(Collectors.toList());
> >> ItemSimilarity precomputedSimilarity = new
> >> GenericItemSimilarity(correlations);
> >> ItemBasedRecommender recommenderWithPrecomputation = new
> >> GenericItemBasedRecommender(dataModel, precomputedSimilarity);
> >>
> >> recommend(recommender);
> >> recommend(recommenderWithPrecomputation);
> >> }
> >>
> >> private static String preComputeSimilarities(ItemBasedRecommender
> >> recommender,
> >> int simItemsPerItem) throws TasteException {
> >> String pathToAbsolutePath = "";
> >> try {
> >> File resultFile = new File(System.getProperty("java.io.tmpdir"),
> >> "similarities.csv");
> >> if (resultFile.exists()) {
> >> resultFile.delete();
> >> }
> >> BatchItemSimilarities batchJob = new
> >> MultithreadedBatchItemSimilarities(recommender, simItemsPerItem);
> >> int numSimilarities =
> >>
> batchJob.computeItemSimilarities(Runtime.getRuntime().availableProcessors(),
> >> 1,
> >> new FileSimilarItemsWriter(resultFile));
> >> pathToAbsolutePath = resultFile.getAbsolutePath();
> >> System.out.println("Computed " + numSimilarities + " similarities and
> >> saved them
> >> to " + pathToAbsolutePath);
> >> } catch (IOException e) {
> >> System.out.println("Error while writing pre computed similarities to
> >> file");
> >> }
> >> return pathToAbsolutePath;
> >> }
> >>
> >> private static void recommend(ItemBasedRecommender recommender) throws
> >> TasteException {
> >> long start = System.nanoTime();
> >> List<RecommendedItem> recommendations = recommender.recommend(1, 10);
> >> long end = System.nanoTime();
> >> System.out.println("Created recommendations in " +
> >> getCalculationTimeInMilliseconds(start, end) + " ms. Recommendations:" +
> >> recommendations);
> >> }
> >>
> >> private static double getCalculationTimeInMilliseconds(long start, long
> >> end) {
> >> double calculationTime = (end - start);
> >> return (calculationTime / 1_000_000);
> >> }
> >>
> >>
> >> private static Function<String,
> GenericItemSimilarity.ItemItemSimilarity>
> >> mapToItemItemSimilarity = (line) -> {
> >> String[] row = line.split(",");
> >> return new GenericItemSimilarity.ItemItemSimilarity(
> >> Long.parseLong(row[0]), Long.parseLong(row[1]),
> >> Double.parseDouble(row[2]));
> >> };
> >> }
> >>
> >> And thats the Output-log:
> >>
> >> 3 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel -
> >> Creating FileDataModel for file
> >>
> >>
> /Users/najum/Documents/recommender-console/src/main/webapp/resources/preference_csv/1mil.csv
> >> 63 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel
> -
> >> Reading file info...
> >> 1207 [main] INFO
> org.apache.mahout.cf.taste.impl.model.file.FileDataModel -
> >> Processed 1000000 lines
> >> 1208 [main] INFO
> org.apache.mahout.cf.taste.impl.model.file.FileDataModel
> >> - Read
> >> lines: 1000209
> >> 1475 [main] INFO org.apache.mahout.cf.taste.impl.model.GenericDataModel
> -
> >> Processed 6040 users
> >> 1599 [main] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - Queued 3706 items in 38 batches
> >> 10928 [pool-1-thread-8] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 7 processed 5 batches
> >> 10928 [pool-1-thread-8] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 7 processed 5 batches. done.
> >> 10978 [pool-1-thread-5] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 4 processed 4 batches. done.
> >> 11589 [pool-1-thread-4] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 3 processed 5 batches
> >> 11589 [pool-1-thread-4] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 3 processed 5 batches. done.
> >> 11592 [pool-1-thread-6] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 5 processed 5 batches
> >> 11592 [pool-1-thread-6] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 5 processed 5 batches. done.
> >> 11707 [pool-1-thread-7] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 6 processed 5 batches
> >> 11707 [pool-1-thread-7] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 6 processed 5 batches. done.
> >> 11730 [pool-1-thread-3] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 2 processed 4 batches. done.
> >> 11849 [pool-1-thread-1] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 0 processed 5 batches
> >> 11849 [pool-1-thread-1] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 0 processed 5 batches. done.
> >> 11854 [pool-1-thread-2] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 1 processed 5 batches
> >> 11854 [pool-1-thread-2] INFO
> >>
> >>
> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities
> >> - worker 1 processed 5 batches. done.
> >> Computed 9174333 similarities and saved them to
> >> /var/folders/9g/4h38v1tj3ps9j21skc72b56r0000gn/T/similarities.csv
> >> Created recommendations in *1683.613
> >> ms*. Recommendations:[RecommendedItem[item:3890, value:4.6771617],
> >> RecommendedItem[item:3530, value:4.662509], RecommendedItem[item:127,
> >> value:4.660716], RecommendedItem[item:3323, value:4.660716],
> >> RecommendedItem[item:3382, value:4.660716], RecommendedItem[item:3123,
> >> value:4.603366], RecommendedItem[item:3233, value:4.5707765],
> >> RecommendedItem[item:1434, value:4.553473], RecommendedItem[item:989,
> >> value:4.5263577], RecommendedItem[item:2343, value:4.524066]]
> >> Created recommendations in* 985.679
> >> ms.* Recommendations:[RecommendedItem[item:3530, value:5.0],
> >> RecommendedItem[item:3382, value:5.0], RecommendedItem[item:3890,
> >> value:4.6771617], RecommendedItem[item:127, value:4.660716],
> >> RecommendedItem[item:3323, value:4.660716], RecommendedItem[item:3123,
> >> value:4.603366], RecommendedItem[item:3233, value:4.5707765],
> >> RecommendedItem[item:1434, value:4.553473], RecommendedItem[item:989,
> >> value:4.5263577], RecommendedItem[item:2343, value:4.524066]]
> >>
> >> Again almost same results. Although what I also don´t understand is, why
> >> am I
> >> getting different RecommendItems?
> >> That really frustrates me…
> >>
> >> You can find the Java file in the attachment.
> >>
> >>
> >>
> >> Greetings from Germany,
> >> Najum
> >>
> >> Am 17.04.2014 um 11:44 schrieb Sebastian Schelter <[email protected]
> >> <mailto:[email protected] <[email protected]>>>:
> >>
> >> Yes, just to make sure the problem is in the mahout code and not in the
> >> surrounding environment.
> >>
> >> On 04/17/2014 11:43 AM, Najum Ali wrote:
> >>
> >> @Sebastian
> >> What do u mean with a standalone recommender? A simple offline java main
> >> program?
> >>
> >> Am 17.04.2014 um 11:41 schrieb Sebastian Schelter <[email protected]
> >> <mailto:[email protected] <[email protected]>>>:
> >>
> >> Could you take the output of the precomputation, feed it into a
> standalone
> >> recommender and test it there?
> >>
> >>
> >> On 04/17/2014 11:37 AM, Najum Ali wrote:
> >>
> >> @sebastian
> >>
> >> Are you sure that the precomputation is done only once and not in every
> >> request?
> >>
> >> Yes, a @Bean annotated Object is in Spring per default a singleton
> >> instance.
> >> I also just tested it out using a System.out.println()
> >> Here is my log:
> >>
> >> System.out.println("----> precomputation done!“ is called before
> returning
> >> the
> >> GenericItemSimilarity.
> >>
> >> The first two recommendations are Item-based -> pearson similarity
> >> The thrid and 4th log are also item-based using pre computed similarity
> >> The last log is the userbased recommender using pearson
> >>
> >> Look at the huge time difference!
> >>
> >> Am 17.04.2014 um 11:23 schrieb Sebastian Schelter <[email protected]
> >> <mailto:[email protected] <[email protected]>>
> >> <mailto:[email protected] <[email protected]>>>:
> >>
> >> Najum,
> >>
> >> this is really strange, feeding an ItemBased Recommender with
> precomputed
> >> similarities should give you superfast recommendations.
> >>
> >> Are you sure that the precomputation is done only once and not in every
> >> request?
> >>
> >> --sebastian
> >>
> >> On 04/17/2014 11:17 AM, Najum Ali wrote:
> >>
> >> Hi guys,
> >>
> >> I have created a precomputed item-item-similarity collection for a
> >> GenericItemBasedRecommender.
> >> Using the 1M MovieLens data, my item-based recommender is only 40-50%
> >> faster
> >> than without precomputation (like 589.5ms instead 1222.9ms).
> >> But the user-based recommender instead is really fast, it´s like 24.2ms?
> >> How can
> >> this happen?
> >>
> >> Here are more details to my Implementation:
> >>
> >> CSV File: 1M pref, 6040 Users, 3706 Items
> >>
> >> For my Implementation I´m using screenshots, because having the good
> >> highlighting.
> >> My Recommender runs inside a Webserver (Jetty) using Spring 4 and
> Java8. I
> >> receive Recommendations as Webservice (JSON).
> >>
> >> For DataModel, I´m using FileDataModel.
> >>
> >>
> >> This code below creates me a precomputed ItemSimilarity when I start the
> >> Webserver and the property isItemPreComputationEnabled is set to true:
> >>
> >>
> >> For time measuring I´m using AOP. I´m measuring the whole time from
> >> entering my
> >> Controller to sending the response.
> >> based on System.nanoTime(); and getting the diff. It´s the same time
> >> measure for
> >> user based.
> >>
> >> I haved tried to cache the recommender and the similarity with no big
> >> difference. I also tried to use CandidateItemsStrategy and
> >> MostSimilarItemsCandidateItemsStrategy, but also no performance boost.
> >>
> >> public RecommenderBuilder createRecommenderBuilder(ItemSimilarity
> >> similarity)
> >> throws TasteException {
> >> final int numberOfUsers = dataModel.getNumUsers();
> >> final int numberOfItems = dataModel.getNumItems();
> >> CandidateItemsStrategy candidateItemsStrategy = new
> >> SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
> >> MostSimilarItemsCandidateItemsStrategy mostSimilarStrategy = new
> >> SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
> >> return model -> new GenericItemBasedRecommender(model,
> >> similarity,candidateItemsStrategy,mostSimilarStrategy);
> >> }
> >>
> >> I dont know why item-based is taking so much longer then user-based.
> >> User-based
> >> is like fast as hell. I even tried a DataSet using 100k Prefs, and
> >> 10Million
> >> (Movielens). Everytime the user-based is soo much faster for any
> >> similarity.
> >>
> >> Hope you anyone can help me to understand this. Maybe I´m doing
> something
> >> wrong.
> >>
> >> Thanks!! :))
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>
>

Re: Performance Issue using item-based approach!

Reply via email to