Re: Performance Issue using item-based approach!

Ted Dunning Fri, 18 Apr 2014 10:31:32 -0700

Shouldn't, yes.

But for a toy dataset, it might work out.



On Fri, Apr 18, 2014 at 10:25 AM, Sebastian Schelter <
[email protected]> wrote:

> You can, but you shouldn't :)
>
> On 04/18/2014 07:23 PM, Ted Dunning wrote:
>
>> You can always run Hadoop in a local mode.  Nothing prevents a single node
>> from being a cluster.  :-)
>>
>>
>> On Thu, Apr 17, 2014 at 7:43 AM, Najum Ali <[email protected]>
>> wrote:
>>
>>  Ted,
>>>
>>> Is it also possible to use ItemSimilarityJob in a non-distributed
>>> environment?
>>>
>>> Am 17.04.2014 um 16:22 schrieb Ted Dunning <[email protected]>:
>>>
>>>  Najum,
>>>>
>>>> You should also be able to use the ItemSimilarityJob to compute a
>>>> limited
>>>> indicator set.
>>>>
>>>> This is stepping off of the path you have been on, but it would allow
>>>> you
>>>> to deploy the recommender via a search engine.
>>>>
>>>> That makes a lot of code simply vanish.  THis is also a well trod
>>>> production path.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Apr 17, 2014 at 3:57 AM, Najum Ali <[email protected]>
>>>>
>>> wrote:
>>>
>>>>
>>>>  @Sebastian
>>>>>
>>>>> wow … you are right. The original csv file is about 21mb and the
>>>>> corresponding precomputed item-item similarity file is about 260mb!!
>>>>> And yes, there are wide more than 50 "most similar items“ for an item
>>>>> ..
>>>>>
>>>>> Trying to restrict this to 50 (or something like that) most similar
>>>>>
>>>> items
>>>
>>>> for an item could do the trick as you said.
>>>>> Ok I will give it try and reply later.
>>>>>
>>>>> By the way, what´s about the SampingCandidateItemsStrategy or something
>>>>> like this, by using this Constructor:
>>>>> *GenericItemBasedRecommender
>>>>> <
>>>>>
>>>> https://builds.apache.org/job/mahout-quality/javadoc/org/
>>> apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.
>>> html#GenericItemBasedRecommender(org.apache.mahout.cf.taste.
>>> model.DataModel,%20org.apache.mahout.cf.taste.similarity.
>>> ItemSimilarity,%20org.apache.mahout.cf.taste.recommender.
>>> CandidateItemsStrategy,%20org.apache.mahout.cf.taste.recommender.
>>> MostSimilarItemsCandidateItemsStrategy)
>>>
>>>> *
>>>>
>>>>> (DataModel<
>>>>>
>>>> https://builds.apache.org/job/mahout-quality/javadoc/org/
>>> apache/mahout/cf/taste/model/DataModel.html
>>>
>>>>
>>>>  dataModel, ItemSimilarity<
>>>>>
>>>> https://builds.apache.org/job/mahout-quality/javadoc/org/
>>> apache/mahout/cf/taste/similarity/ItemSimilarity.html
>>>
>>>>
>>>>  similarity, CandidateItemsStrategy<
>>>>>
>>>> https://builds.apache.org/job/mahout-quality/javadoc/org/
>>> apache/mahout/cf/taste/recommender/CandidateItemsStrategy.html
>>>
>>>>
>>>>  candidateItemsStrategy,MostSimilarItemsCandidateItemsStrategy<
>>>>>
>>>> https://builds.apache.org/job/mahout-quality/javadoc/org/
>>> apache/mahout/cf/taste/recommender/MostSimilarItemsCandidateItems
>>> Strategy.html
>>>
>>>>
>>>>  mostSimilarItemsCandidateItemsStrategy)
>>>>>
>>>>>
>>>>> Am 17.04.2014 um 12:41 schrieb Sebastian Schelter <[email protected]>:
>>>>>
>>>>> Hi Najum,
>>>>>
>>>>> I think I found the problem. Remember: Two items are similar whenever
>>>>> at
>>>>> least one user interacted with both of them ("the items co-occur").
>>>>>
>>>>> In the movielens dataset this is true for almost all pairs of items,
>>>>> unfortunately. From 3076 items, more than 11 million similarities are
>>>>> created. A common approach for that (which is not yet implemented in
>>>>> our
>>>>> precomputation unfortunately) is to only retain the top-k similar items
>>>>>
>>>> per
>>>
>>>> item.
>>>>>
>>>>> A solution would be to take the csv file that is created by the
>>>>> MultithreadedBatchItemSimilarities and postprocess it so that only the
>>>>>
>>>> 50
>>>
>>>> most similar items per item are retained. That should help with your
>>>>> problem.
>>>>>
>>>>> Unfortunately, we don't have code for that yet, maybe you want to try
>>>>> to
>>>>> write that yourself?
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> PS: The user-based recommender restricts the number of similar users, I
>>>>> guess thats why it is so fast here.
>>>>>
>>>>>
>>>>> On 04/17/2014 12:18 PM, Najum Ali wrote:
>>>>>
>>>>> Ok, here you go:
>>>>>
>>>>> I have created a simple class with main-method (no server and other
>>>>>
>>>> stuff):
>>>
>>>>
>>>>> public class RecommenderTest {
>>>>> public static void main(String[] args) throws IOException,
>>>>>
>>>> TasteException {
>>>
>>>> DataModel dataModel = new FileDataModel(new
>>>>>
>>>>>
>>>>>  File("/Users/najum/Documents/recommender-console/src/main/
>>> webapp/resources/preference_csv/1mil.csv"));
>>>
>>>> ItemSimilarity similarity = new LogLikelihoodSimilarity(dataModel);
>>>>> ItemBasedRecommender recommender = new
>>>>> GenericItemBasedRecommender(dataModel,
>>>>> similarity);
>>>>>
>>>>> String pathToPreComputedFile = preComputeSimilarities(recommender,
>>>>> dataModel.getNumItems());
>>>>>
>>>>> InputStream inputStream = new FileInputStream(new
>>>>> File(pathToPreComputedFile));
>>>>> BufferedReader bufferedReader = new BufferedReader(new
>>>>> InputStreamReader(inputStream));
>>>>> Collection<GenericItemSimilarity.ItemItemSimilarity> correlations =
>>>>>
>>>>>
>>>>>  bufferedReader.lines().map(mapToItemItemSimilarity).
>>> collect(Collectors.toList());
>>>
>>>> ItemSimilarity precomputedSimilarity = new
>>>>> GenericItemSimilarity(correlations);
>>>>> ItemBasedRecommender recommenderWithPrecomputation = new
>>>>> GenericItemBasedRecommender(dataModel, precomputedSimilarity);
>>>>>
>>>>> recommend(recommender);
>>>>> recommend(recommenderWithPrecomputation);
>>>>> }
>>>>>
>>>>> private static String preComputeSimilarities(ItemBasedRecommender
>>>>> recommender,
>>>>> int simItemsPerItem) throws TasteException {
>>>>> String pathToAbsolutePath = "";
>>>>> try {
>>>>> File resultFile = new File(System.getProperty("java.io.tmpdir"),
>>>>> "similarities.csv");
>>>>> if (resultFile.exists()) {
>>>>> resultFile.delete();
>>>>> }
>>>>> BatchItemSimilarities batchJob = new
>>>>> MultithreadedBatchItemSimilarities(recommender, simItemsPerItem);
>>>>> int numSimilarities =
>>>>>
>>>>>  batchJob.computeItemSimilarities(Runtime.getRuntime().
>>> availableProcessors(),
>>>
>>>> 1,
>>>>> new FileSimilarItemsWriter(resultFile));
>>>>> pathToAbsolutePath = resultFile.getAbsolutePath();
>>>>> System.out.println("Computed " + numSimilarities + " similarities and
>>>>> saved them
>>>>> to " + pathToAbsolutePath);
>>>>> } catch (IOException e) {
>>>>> System.out.println("Error while writing pre computed similarities to
>>>>> file");
>>>>> }
>>>>> return pathToAbsolutePath;
>>>>> }
>>>>>
>>>>> private static void recommend(ItemBasedRecommender recommender) throws
>>>>> TasteException {
>>>>> long start = System.nanoTime();
>>>>> List<RecommendedItem> recommendations = recommender.recommend(1, 10);
>>>>> long end = System.nanoTime();
>>>>> System.out.println("Created recommendations in " +
>>>>> getCalculationTimeInMilliseconds(start, end) + " ms.
>>>>> Recommendations:" +
>>>>> recommendations);
>>>>> }
>>>>>
>>>>> private static double getCalculationTimeInMilliseconds(long start,
>>>>> long
>>>>> end) {
>>>>> double calculationTime = (end - start);
>>>>> return (calculationTime / 1_000_000);
>>>>> }
>>>>>
>>>>>
>>>>> private static Function<String,
>>>>>
>>>> GenericItemSimilarity.ItemItemSimilarity>
>>>
>>>> mapToItemItemSimilarity = (line) -> {
>>>>> String[] row = line.split(",");
>>>>> return new GenericItemSimilarity.ItemItemSimilarity(
>>>>> Long.parseLong(row[0]), Long.parseLong(row[1]),
>>>>> Double.parseDouble(row[2]));
>>>>> };
>>>>> }
>>>>>
>>>>> And thats the Output-log:
>>>>>
>>>>> 3 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel
>>>>> -
>>>>> Creating FileDataModel for file
>>>>>
>>>>>
>>>>>  /Users/najum/Documents/recommender-console/src/main/
>>> webapp/resources/preference_csv/1mil.csv
>>>
>>>> 63 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel
>>>>>
>>>> -
>>>
>>>> Reading file info...
>>>>> 1207 [main] INFO
>>>>>
>>>> org.apache.mahout.cf.taste.impl.model.file.FileDataModel -
>>>
>>>> Processed 1000000 lines
>>>>> 1208 [main] INFO
>>>>>
>>>> org.apache.mahout.cf.taste.impl.model.file.FileDataModel
>>>
>>>> - Read
>>>>> lines: 1000209
>>>>> 1475 [main] INFO org.apache.mahout.cf.taste.
>>>>> impl.model.GenericDataModel
>>>>>
>>>> -
>>>
>>>> Processed 6040 users
>>>>> 1599 [main] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - Queued 3706 items in 38 batches
>>>>> 10928 [pool-1-thread-8] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 7 processed 5 batches
>>>>> 10928 [pool-1-thread-8] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 7 processed 5 batches. done.
>>>>> 10978 [pool-1-thread-5] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 4 processed 4 batches. done.
>>>>> 11589 [pool-1-thread-4] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 3 processed 5 batches
>>>>> 11589 [pool-1-thread-4] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 3 processed 5 batches. done.
>>>>> 11592 [pool-1-thread-6] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 5 processed 5 batches
>>>>> 11592 [pool-1-thread-6] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 5 processed 5 batches. done.
>>>>> 11707 [pool-1-thread-7] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 6 processed 5 batches
>>>>> 11707 [pool-1-thread-7] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 6 processed 5 batches. done.
>>>>> 11730 [pool-1-thread-3] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 2 processed 4 batches. done.
>>>>> 11849 [pool-1-thread-1] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 0 processed 5 batches
>>>>> 11849 [pool-1-thread-1] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 0 processed 5 batches. done.
>>>>> 11854 [pool-1-thread-2] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 1 processed 5 batches
>>>>> 11854 [pool-1-thread-2] INFO
>>>>>
>>>>>
>>>>>  org.apache.mahout.cf.taste.impl.similarity.precompute.
>>> MultithreadedBatchItemSimilarities
>>>
>>>> - worker 1 processed 5 batches. done.
>>>>> Computed 9174333 similarities and saved them to
>>>>> /var/folders/9g/4h38v1tj3ps9j21skc72b56r0000gn/T/similarities.csv
>>>>> Created recommendations in *1683.613
>>>>> ms*. Recommendations:[RecommendedItem[item:3890, value:4.6771617],
>>>>> RecommendedItem[item:3530, value:4.662509], RecommendedItem[item:127,
>>>>> value:4.660716], RecommendedItem[item:3323, value:4.660716],
>>>>> RecommendedItem[item:3382, value:4.660716], RecommendedItem[item:3123,
>>>>> value:4.603366], RecommendedItem[item:3233, value:4.5707765],
>>>>> RecommendedItem[item:1434, value:4.553473], RecommendedItem[item:989,
>>>>> value:4.5263577], RecommendedItem[item:2343, value:4.524066]]
>>>>> Created recommendations in* 985.679
>>>>> ms.* Recommendations:[RecommendedItem[item:3530, value:5.0],
>>>>> RecommendedItem[item:3382, value:5.0], RecommendedItem[item:3890,
>>>>> value:4.6771617], RecommendedItem[item:127, value:4.660716],
>>>>> RecommendedItem[item:3323, value:4.660716], RecommendedItem[item:3123,
>>>>> value:4.603366], RecommendedItem[item:3233, value:4.5707765],
>>>>> RecommendedItem[item:1434, value:4.553473], RecommendedItem[item:989,
>>>>> value:4.5263577], RecommendedItem[item:2343, value:4.524066]]
>>>>>
>>>>> Again almost same results. Although what I also don´t understand is,
>>>>> why
>>>>> am I
>>>>> getting different RecommendItems?
>>>>> That really frustrates me…
>>>>>
>>>>> You can find the Java file in the attachment.
>>>>>
>>>>>
>>>>>
>>>>> Greetings from Germany,
>>>>> Najum
>>>>>
>>>>> Am 17.04.2014 um 11:44 schrieb Sebastian Schelter <[email protected]
>>>>> <mailto:[email protected] <[email protected]>>>:
>>>>>
>>>>> Yes, just to make sure the problem is in the mahout code and not in the
>>>>> surrounding environment.
>>>>>
>>>>> On 04/17/2014 11:43 AM, Najum Ali wrote:
>>>>>
>>>>> @Sebastian
>>>>> What do u mean with a standalone recommender? A simple offline java
>>>>> main
>>>>> program?
>>>>>
>>>>> Am 17.04.2014 um 11:41 schrieb Sebastian Schelter <[email protected]
>>>>> <mailto:[email protected] <[email protected]>>>:
>>>>>
>>>>> Could you take the output of the precomputation, feed it into a
>>>>>
>>>> standalone
>>>
>>>> recommender and test it there?
>>>>>
>>>>>
>>>>> On 04/17/2014 11:37 AM, Najum Ali wrote:
>>>>>
>>>>> @sebastian
>>>>>
>>>>> Are you sure that the precomputation is done only once and not in every
>>>>> request?
>>>>>
>>>>> Yes, a @Bean annotated Object is in Spring per default a singleton
>>>>> instance.
>>>>> I also just tested it out using a System.out.println()
>>>>> Here is my log:
>>>>>
>>>>> System.out.println("----> precomputation done!“ is called before
>>>>>
>>>> returning
>>>
>>>> the
>>>>> GenericItemSimilarity.
>>>>>
>>>>> The first two recommendations are Item-based -> pearson similarity
>>>>> The thrid and 4th log are also item-based using pre computed similarity
>>>>> The last log is the userbased recommender using pearson
>>>>>
>>>>> Look at the huge time difference!
>>>>>
>>>>> Am 17.04.2014 um 11:23 schrieb Sebastian Schelter <[email protected]
>>>>> <mailto:[email protected] <[email protected]>>
>>>>> <mailto:[email protected] <[email protected]>>>:
>>>>>
>>>>> Najum,
>>>>>
>>>>> this is really strange, feeding an ItemBased Recommender with
>>>>>
>>>> precomputed
>>>
>>>> similarities should give you superfast recommendations.
>>>>>
>>>>> Are you sure that the precomputation is done only once and not in every
>>>>> request?
>>>>>
>>>>> --sebastian
>>>>>
>>>>> On 04/17/2014 11:17 AM, Najum Ali wrote:
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> I have created a precomputed item-item-similarity collection for a
>>>>> GenericItemBasedRecommender.
>>>>> Using the 1M MovieLens data, my item-based recommender is only 40-50%
>>>>> faster
>>>>> than without precomputation (like 589.5ms instead 1222.9ms).
>>>>> But the user-based recommender instead is really fast, it´s like
>>>>> 24.2ms?
>>>>> How can
>>>>> this happen?
>>>>>
>>>>> Here are more details to my Implementation:
>>>>>
>>>>> CSV File: 1M pref, 6040 Users, 3706 Items
>>>>>
>>>>> For my Implementation I´m using screenshots, because having the good
>>>>> highlighting.
>>>>> My Recommender runs inside a Webserver (Jetty) using Spring 4 and
>>>>>
>>>> Java8. I
>>>
>>>> receive Recommendations as Webservice (JSON).
>>>>>
>>>>> For DataModel, I´m using FileDataModel.
>>>>>
>>>>>
>>>>> This code below creates me a precomputed ItemSimilarity when I start
>>>>> the
>>>>> Webserver and the property isItemPreComputationEnabled is set to true:
>>>>>
>>>>>
>>>>> For time measuring I´m using AOP. I´m measuring the whole time from
>>>>> entering my
>>>>> Controller to sending the response.
>>>>> based on System.nanoTime(); and getting the diff. It´s the same time
>>>>> measure for
>>>>> user based.
>>>>>
>>>>> I haved tried to cache the recommender and the similarity with no big
>>>>> difference. I also tried to use CandidateItemsStrategy and
>>>>> MostSimilarItemsCandidateItemsStrategy, but also no performance boost.
>>>>>
>>>>> public RecommenderBuilder createRecommenderBuilder(ItemSimilarity
>>>>> similarity)
>>>>> throws TasteException {
>>>>> final int numberOfUsers = dataModel.getNumUsers();
>>>>> final int numberOfItems = dataModel.getNumItems();
>>>>> CandidateItemsStrategy candidateItemsStrategy = new
>>>>> SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
>>>>> MostSimilarItemsCandidateItemsStrategy mostSimilarStrategy = new
>>>>> SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems);
>>>>> return model -> new GenericItemBasedRecommender(model,
>>>>> similarity,candidateItemsStrategy,mostSimilarStrategy);
>>>>> }
>>>>>
>>>>> I dont know why item-based is taking so much longer then user-based.
>>>>> User-based
>>>>> is like fast as hell. I even tried a DataSet using 100k Prefs, and
>>>>> 10Million
>>>>> (Movielens). Everytime the user-based is soo much faster for any
>>>>> similarity.
>>>>>
>>>>> Hope you anyone can help me to understand this. Maybe I´m doing
>>>>>
>>>> something
>>>
>>>> wrong.
>>>>>
>>>>> Thanks!! :))
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>
>

Re: Performance Issue using item-based approach!

Reply via email to