Rose Aysina created SPARK-33487:
-----------------------------------

             Summary: Let ML ALS recommend for BOTH subsets - users nd items
                 Key: SPARK-33487
                 URL: https://issues.apache.org/jira/browse/SPARK-33487
             Project: Spark
          Issue Type: New Feature
          Components: ML
    Affects Versions: 3.0.1
            Reporter: Rose Aysina


Currently ALS in Spark ML supports next methods for getting recommendations:
 * {{recommendForAllUsers(numItems: Int): DataFrame}}
 * {{recommendForAllItems(numUsers: Int): DataFrame}}
 * {{recommendForUserSubset(dataset: Dataset[_], numItems: Int): DataFrame}}
 * {{recommendForItemSubset(dataset: Dataset[_], numUsers: Int): DataFrame}}

 

*Feature request:* to add a method that recommends subset of items for subset 
of items, i.e. both users and items are selected from provided subsets. 

*Why it is important:* in real-time recommender systems you usually make 
predict for current users (that's why we need subset of users). And you can 
just recommend all items that you have, but only those who satisfy some 
business filters (that's why we need subset of items). 

*For example:* consider real-time news recommender system. Predict is done for 
small subset of users (say, for example, visitors for last minute), but it is 
not allowed to recommend old news or news not related to user country or etc, 
so at each predict we have some "white" list of items.

So that's why it will be extremely useful to control *BOTH* for which users 
make recommendations *AND* which items include in these recommendations. 

*Related issues:* -SPARK-20679- , but there is just subsets either on users 
*OR* items. 

*What we do now:* just make additional filtering after 
{{recommendForUserSubset}} call, but this method has significant cost - we must 
receive all items recommendations, i.e. *{{numItems = # all available items}}* 
and then filter and only then select top-k among them.

*Why it is bad:* usually subset of items allowed to recommend right now is much 
smaller than the amount of all seen items in an origin data (in my real dataset 
it is 220k vs 500). 

*Design:* I am sorry - I am not familiar with Spark internals so I offer 
solution based just on my human logic :) 
{code:scala}
def recommendForUserItemSubsets(userDataset: Dataset[_], 
                                itemDataset: Dataset[_], 
                                numItems: Int): DataFrame = {
    val userFactorSubset = getSourceFactorSubset(dataset, userFactors, 
$(userCol))
    val itemFactorSubset = getSourceFactorSubset(dataset, itemFactors, 
$(itemCol))
    recommendForAll(userFactorSubset, itemFactorSubset, $(userCol), $(itemCol), 
numItems, $(blockSize))
}
{code}
 

I will be glad to receive some feedback, is it reasonable request or not and 
maybe more efficient workarounds. 

 

Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to