Re: What is the most efficient and scalable way to get all the recommendation results from ALS model ?

2016-03-20 Thread Hiroyuki Yamada
Could anyone give me some advices or recommendations or usual ways to do
this ?

I am trying to get all (probably top 100) product recommendations for each
user from a model (MatrixFactorizationModel),
but I haven't figured out yet to do it efficiently.

So far,
calling predict (predictAll in pyspark) method with user-product matrix
uses too much memory and couldn't complete due to a lack of memory,
and
calling predict for each user (or for each some users like 100 uses or so)
takes too much time to get all the recommendations.

I am using spark 1.4.1 and running 5-node cluster with 8GB RAM each.
I only use small-sized data set so far, like about 5 users and 5000
products with only about 10 ratings.

Thanks.


On Sat, Mar 19, 2016 at 7:58 PM, Hiroyuki Yamada  wrote:

> Hi,
>
> I'm testing Collaborative Filtering with Milib.
> Making a model by ALS.trainImplicit (or train) seems scalable as far as I
> have tested,
> but I'm wondering how I can get all the recommendation results efficiently.
>
> The predictAll method can get all the results,
> but it needs the whole user-product matrix in memory as an input.
> So if there are 1 million users and 1 million products, then the number of
> elements is too large (1 million x 1 million)
> and the amount of memory to hold them is more than a few TB even when the
> element size in only 4B,
> which is not a realistic size of memory even now.
>
> # (100*100)*4/1000/1000/1000/1000 => near equals 4TB)
>
> We can, of course, use predict method per user,
> but, as far as I tried, it is very slow to get 1 million users' results.
>
> Do I miss something ?
> Are there any other better ways to get all the recommendation results in
> scalable and efficient way ?
>
> Best regards,
> Hiro
>
>
>


What is the most efficient and scalable way to get all the recommendation results from ALS model ?

2016-03-19 Thread Hiroyuki Yamada
Hi,

I'm testing Collaborative Filtering with Milib.
Making a model by ALS.trainImplicit (or train) seems scalable as far as I
have tested,
but I'm wondering how I can get all the recommendation results efficiently.

The predictAll method can get all the results,
but it needs the whole user-product matrix in memory as an input.
So if there are 1 million users and 1 million products, then the number of
elements is too large (1 million x 1 million)
and the amount of memory to hold them is more than a few TB even when the
element size in only 4B,
which is not a realistic size of memory even now.

# (100*100)*4/1000/1000/1000/1000 => near equals 4TB)

We can, of course, use predict method per user,
but, as far as I tried, it is very slow to get 1 million users' results.

Do I miss something ?
Are there any other better ways to get all the recommendation results in
scalable and efficient way ?

Best regards,
Hiro


spark-submit with cluster deploy mode fails with ClassNotFoundException (jars are not passed around properley?)

2016-03-11 Thread Hiroyuki Yamada
Hi,

I am trying to work with spark-submit with cluster deploy mode in single
node,
but I keep getting ClassNotFoundException as shown below.
(in this case, snakeyaml.jar is not found from the spark cluster)

===

16/03/12 14:19:12 INFO Remoting: Starting remoting
16/03/12 14:19:12 INFO Remoting: Remoting started; listening on
addresses :[akka.tcp://Driver@192.168.1.2:52993]
16/03/12 14:19:12 INFO util.Utils: Successfully started service
'Driver' on port 52993.
16/03/12 14:19:12 INFO worker.WorkerWatcher: Connecting to worker
akka.tcp://sparkWorker@192.168.1.2:52985/user/Worker
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: org/yaml/snakeyaml/Yaml
at 
com.analytics.config.YamlConfigLoader.loadConfig(YamlConfigLoader.java:30)
at 
com.analytics.api.DeclarativeAnalyticsFactory.create(DeclarativeAnalyticsFactory.java:21)
at com.analytics.program.QueryExecutor.main(QueryExecutor.java:12)
... 6 more
Caused by: java.lang.ClassNotFoundException: org.yaml.snakeyaml.Yaml
at java.lang.ClassLoader.findClass(ClassLoader.java:530)
at 
org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at 
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
at 
org.apache.spark.util.ChildFirstURLClassLoader.liftedTree1$1(MutableURLClassLoader.scala:75)
at 
org.apache.spark.util.ChildFirstURLClassLoader.loadClass(MutableURLClassLoader.scala:71)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more
16/03/12 14:19:12 INFO util.Utils: Shutdown hook called



I can submit a job successfully with client mode, but I can't with cluster
mode,
so, it is a matter of not properly passing jars (snakeyaml) to the cluster.

The actual command I tried is:

$ spark-submit --master spark://192.168.1.2:6066 --deploy-mode cluster
--jars all-the-jars(with comma separated) --class
com.analytics.program.QueryExecutor analytics.jar
(of course, snakeyaml.jar is specified after --jars)

I tried spark.executor.extraClassPath and spark.driver.extraClassPath in
spark-defaults.conf to specifiy snakeyaml.jar,
but none of those worked.


I also found couple of similar issues posted in the mailing list or other
sites,
but, it is not responded back properly or it didn't work to me.

<
https://mail-archives.apache.org/mod_mbox/spark-user/201505.mbox/%3CCAGSyEuApEkfO_2-iiiuyS2eeg=w_jkf83vcceguns4douod...@mail.gmail.com%3E
>
<
http://stackoverflow.com/questions/34272426/how-to-give-dependent-jars-to-spark-submit-in-cluster-mode
>
<
https://support.datastax.com/hc/en-us/articles/207442243-Spark-submit-fails-with-class-not-found-when-deploying-in-cluster-mode
>


Could anyone give me a help ?

Best regards,
Hiro


Re: which is a more appropriate form of ratings ?

2016-02-25 Thread Hiroyuki Yamada
Thanks very much, Nick and Sabarish.
That helps me a lot.

Regards,
*Hiro*

On Thu, Feb 25, 2016 at 8:52 PM, Nick Pentreath 
wrote:

> Yes, ALS requires the aggregated version (A). You can use decimal or whole
> numbers for the rating, depending on your application, as for implicit data
> they are not "ratings" but rather "weights".
>
> A common approach is to apply different weightings to different user
> events (such as 1.0 for a page view, 5.0 for a purchase, 2.0 for a like,
> etc). That allows all user event data to be aggregated together in a fairly
> principled manner. The weights however need to be specified upfront in
> order to do that aggregation (they could be selected via cross-validation,
> domain knowledge or the relative frequency of each event within a dataset,
> for example).
>
>
> On Thu, 25 Feb 2016 at 13:26 Sabarish Sasidharan 
> wrote:
>
>> I believe the ALS algo expects the ratings to be aggregated (A). I don't
>> see why you have to use decimals for rating.
>>
>> Regards
>> Sab
>>
>> On Thu, Feb 25, 2016 at 4:50 PM, Hiroyuki Yamada 
>> wrote:
>>
>>> Hello.
>>>
>>> I just started working on CF in MLlib.
>>> I am using trainImplicit because I only have implicit ratings like page
>>> views.
>>>
>>> I am wondering which is a more appropriate form of ratings.
>>> Let's assume that view count is regarded as a rating and
>>> user 1 sees page 1 3 times and sees page 2 twice and so on.
>>>
>>> In this case, I think ratings can be formatted like the following 2
>>> cases. (of course it is a RDD actually)
>>>
>>> A:
>>> user_id,page_id,rating(page view)
>>> 1,1,0.3
>>> 1,2,0.2
>>> ...
>>>
>>> B:
>>> user_id,page_id,rating(page view)
>>> 1,1,0.1
>>> 1,1,0.1
>>> 1,1,0.1
>>> 1,2,0.1
>>> 1,2,0.1
>>> ...
>>>
>>> It is allowed to have like B ?
>>> If it is, which is better ? ( is there any difference between them ?)
>>>
>>> Best,
>>> Hiro
>>>
>>>
>>>
>>>
>>


which is a more appropriate form of ratings ?

2016-02-25 Thread Hiroyuki Yamada
Hello.

I just started working on CF in MLlib.
I am using trainImplicit because I only have implicit ratings like page
views.

I am wondering which is a more appropriate form of ratings.
Let's assume that view count is regarded as a rating and
user 1 sees page 1 3 times and sees page 2 twice and so on.

In this case, I think ratings can be formatted like the following 2 cases.
(of course it is a RDD actually)

A:
user_id,page_id,rating(page view)
1,1,0.3
1,2,0.2
...

B:
user_id,page_id,rating(page view)
1,1,0.1
1,1,0.1
1,1,0.1
1,2,0.1
1,2,0.1
...

It is allowed to have like B ?
If it is, which is better ? ( is there any difference between them ?)

Best,
Hiro


Re: What is the point of alpha value in Collaborative Filtering in MLlib ?

2016-02-25 Thread Hiroyuki Yamada
Hello Sean,

Thank you very much for the quick response.
That helps me a lot to understand it better !

Best regards,
Hiro

On Thu, Feb 25, 2016 at 6:59 PM, Sean Owen  wrote:

> This isn't specific to Spark; it's from the original paper.
>
> alpha doesn't do a whole lot, and it is a global hyperparam. It
> controls the relative weight of observed versus unobserved
> user-product interactions in the factorization. Higher alpha means
> it's much more important to faithfully reproduce the interactions that
> *did* happen as a "1", than reproduce the interactions that *didn't*
> happen as a "0".
>
> I don't think there's a good rule of thumb about what value to pick;
> it can't be less than 0 (less than 1 doesn't make much sense either),
> and you might just try values between 1 and 100 to see what gives the
> best result.
>
> I think that generally sparser input needs higher alpha, and maybe
> someone tells me that really alpha should be a function of the
> sparsity, but I've never seen that done.
>
>
>
> On Thu, Feb 25, 2016 at 6:33 AM, Hiroyuki Yamada 
> wrote:
> > Hi, I've been doing some POC for CF in MLlib.
> > In my environment,  ratings are all implicit so that I try to use it with
> > trainImplicit method (in python).
> >
> > The trainImplicit method takes alpha as one of the arguments to specify a
> > confidence for the ratings as described in
> > <http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
> >,
> > but the alpha value is global for all the ratings so I am not sure why we
> > need this.
> > (If it is per rating, it makes sense to me, though.)
> >
> > What is the difference in setting different alpha values for exactly the
> > same data set ?
> >
> > I would be very appreciated if someone give me a reasonable explanation
> for
> > this.
> >
> > Best regards,
> > Hiro
>


Re: What is the point of alpha value in Collaborative Filtering in MLlib ?

2016-02-24 Thread Hiroyuki Yamada
Hi, I've been doing some POC for CF in MLlib.
In my environment,  ratings are all implicit so that I try to use it with
trainImplicit method (in python).

The trainImplicit method takes alpha as one of the arguments to specify a
confidence for the ratings as described in <
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html>,
but the alpha value is global for all the ratings so I am not sure why we
need this.
(If it is per rating, it makes sense to me, though.)

What is the difference in setting different alpha values for exactly the
same data set ?

I would be very appreciated if someone give me a reasonable explanation for
this.

Best regards,
Hiro