Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
In my experiment, if I do not call gc() explicitly, the shuffle files will not 
be cleaned until the whole job finish… I don’t know why, maybe the rdd could 
not be GCed implicitly.
In my situation, a full gc in driver takes about 10 seconds, so I start a 
thread in driver to do GC  like this : (do GC every 120 seconds)

while (true) {
System.gc();
Thread.sleep(120 * 1000);
}


it works well now.
Do you have more elegant ways to clean the shuffle files?

Best Regards,
Sendong Li



> 在 2015年4月1日,上午5:09,Xiangrui Meng  写道:
> 
> Hey Guoqiang and Sendong,
> 
> Could you comment on the overhead of calling gc() explicitly? The shuffle 
> files should get cleaned in a few seconds after checkpointing, but it is 
> certainly possible to accumulates TBs of files in a few seconds. In this 
> case, calling gc() may work the same as waiting for a few seconds after each 
> checkpoint. Is it correct?
> 
> Best,
> Xiangrui
> 
> On Tue, Mar 31, 2015 at 8:58 AM, lisendong  <mailto:lisend...@163.com>> wrote:
> guoqiang ’s method works very well …
> 
> it only takes 1TB disk now.
> 
> thank you very much!
> 
> 
> 
>> 在 2015年3月31日,下午4:47,GuoQiang Li mailto:wi...@qq.com>> 写道:
>> 
>> You can try to enforce garbage collection:
>> 
>> /** Run GC and make sure it actually has run */
>> def runGC() {
>>   val weakRef = new WeakReference(new Object())
>>   val startTime = System.currentTimeMillis
>>   System.gc() // Make a best effort to run the garbage collection. It 
>> *usually* runs GC.
>>   // Wait until a weak reference object has been GCed
>>   System.runFinalization()
>>   while (weakRef.get != null) {
>> System.gc()
>> System.runFinalization()
>> Thread.sleep(200)
>> if (System.currentTimeMillis - startTime > 1) {
>>   throw new Exception("automatically cleanup error")
>> }
>>   }
>> }
>> 
>> 
>> -- 原始邮件 --
>> 发件人: "lisendong"mailto:lisend...@163.com>>; 
>> 发送时间: 2015年3月31日(星期二) 下午3:47
>> 收件人: "Xiangrui Meng"mailto:men...@gmail.com>>; 
>> 抄送: "Xiangrui Meng"mailto:m...@databricks.com>>; 
>> "user"mailto:user@spark.apache.org>>; "Sean 
>> Owen"mailto:so...@cloudera.com>>; "GuoQiang 
>> Li"mailto:wi...@qq.com>>; 
>> 主题: Re: different result from implicit ALS with explicit ALS
>> 
>> I have update my spark source code to 1.3.1.
>> 
>> the checkpoint works well. 
>> 
>> BUT the shuffle data still could not be delete automatically…the disk usage 
>> is still 30TB…
>> 
>> I have set the spark.cleaner.referenceTracking.blocking.shuffle to true.
>> 
>> Do you know how to solve my problem?
>> 
>> Sendong Li
>> 
>> 
>> 
>>> 在 2015年3月31日,上午12:11,Xiangrui Meng >> <mailto:men...@gmail.com>> 写道:
>>> 
>>> setCheckpointInterval was added in the current master and branch-1.3. 
>>> Please help check whether it works. It will be included in the 1.3.1 and 
>>> 1.4.0 release. -Xiangrui
>>> 
>>> On Mon, Mar 30, 2015 at 7:27 AM, lisendong >> <mailto:lisend...@163.com>> wrote:
>>> hi, xiangrui:
>>> I found the ALS of spark 1.3.0 forget to do checkpoint() in explicit ALS:
>>> the code is :
>>> https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
>>>  
>>> <https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala>
>>> 
>>> 
>>> the checkpoint is very important in my situation, because my task will 
>>> produce 1TB shuffle data in each iteration, it the shuffle data is not 
>>> deleted in each iteration(using checkpoint()), the task will produce 30TB 
>>> data…
>>> 
>>> 
>>> So I change the ALS code, and re-compile by myself, but it seems the 
>>> checkpoint does not take effects, and the task still occupy 30TB disk… ( I 
>>> only add two lines to the ALS.scala) :
>>> 
>>> 
>>> 
>>> 
>>> 
>>> and the driver’s log seems strange, why the log is printed together...
>>> 
>>> 
>>> thank you very much!
>>> 
>>> 
>>>> 在 2015年2月26日,下午11:33,163 mailto:lisend...@163.com>> 写道:
>>>> 
>>>> Thank you very much for your opinion:)
>>>> 
>>>> In our case

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread Xiangrui Meng
Hey Guoqiang and Sendong,

Could you comment on the overhead of calling gc() explicitly? The shuffle
files should get cleaned in a few seconds after checkpointing, but it is
certainly possible to accumulates TBs of files in a few seconds. In this
case, calling gc() may work the same as waiting for a few seconds after
each checkpoint. Is it correct?

Best,
Xiangrui

On Tue, Mar 31, 2015 at 8:58 AM, lisendong  wrote:

> guoqiang ’s method works very well …
>
> it only takes 1TB disk now.
>
> thank you very much!
>
>
>
> 在 2015年3月31日,下午4:47,GuoQiang Li  写道:
>
> You can try to enforce garbage collection:
>
> /** Run GC and make sure it actually has run */
> def runGC() {
>   val weakRef = new WeakReference(new Object())
>   val startTime = System.currentTimeMillis
>   System.gc() // Make a best effort to run the garbage collection. It
> *usually* runs GC.
>   // Wait until a weak reference object has been GCed
>   System.runFinalization()
>   while (weakRef.get != null) {
> System.gc()
> System.runFinalization()
> Thread.sleep(200)
> if (System.currentTimeMillis - startTime > 1) {
>   throw new Exception("automatically cleanup error")
> }
>   }
> }
>
>
> -- 原始邮件 --
> *发件人:* "lisendong";
> *发送时间:* 2015年3月31日(星期二) 下午3:47
> *收件人:* "Xiangrui Meng";
> *抄送:* "Xiangrui Meng"; "user";
> "Sean Owen"; "GuoQiang Li";
> *主题:* Re: different result from implicit ALS with explicit ALS
>
> I have update my spark source code to 1.3.1.
>
> the checkpoint works well.
>
> BUT the shuffle data still could not be delete automatically…the disk
> usage is still 30TB…
>
> I have set the spark.cleaner.referenceTracking.blocking.shuffle to true.
>
> Do you know how to solve my problem?
>
> Sendong Li
>
>
>
> 在 2015年3月31日,上午12:11,Xiangrui Meng  写道:
>
> setCheckpointInterval was added in the current master and branch-1.3.
> Please help check whether it works. It will be included in the 1.3.1 and
> 1.4.0 release. -Xiangrui
>
> On Mon, Mar 30, 2015 at 7:27 AM, lisendong  wrote:
>
>> hi, xiangrui:
>> I found the ALS of spark 1.3.0 forget to do checkpoint() in explicit ALS:
>> the code is :
>>
>> https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
>> 
>>
>> the checkpoint is very important in my situation, because my task will
>> produce 1TB shuffle data in each iteration, it the shuffle data is not
>> deleted in each iteration(using checkpoint()), the task will produce 30TB
>> data…
>>
>>
>> So I change the ALS code, and re-compile by myself, but it seems the
>> checkpoint does not take effects, and the task still occupy 30TB disk… ( I
>> only add two lines to the ALS.scala) :
>>
>> 
>>
>>
>>
>> and the driver’s log seems strange, why the log is printed together...
>> 
>>
>> thank you very much!
>>
>>
>> 在 2015年2月26日,下午11:33,163  写道:
>>
>> Thank you very much for your opinion:)
>>
>> In our case, maybe it 's dangerous to treat un-observed item as negative
>> interaction(although we could give them small confidence, I think they are
>> still incredible...)
>>
>> I will do more experiments and give you feedback:)
>>
>> Thank you;)
>>
>>
>> 在 2015年2月26日,23:16,Sean Owen  写道:
>>
>> I believe that's right, and is what I was getting at. yes the implicit
>> formulation ends up implicitly including every possible interaction in
>> its loss function, even unobserved ones. That could be the difference.
>>
>> This is mostly an academic question though. In practice, you have
>> click-like data and should be using the implicit version for sure.
>>
>> However you can give negative implicit feedback to the model. You
>> could consider no-click as a mild, observed, negative interaction.
>> That is: supply a small negative value for these cases. Unobserved
>> pairs are not part of the data set. I'd be careful about assuming the
>> lack of an action carries signal.
>>
>> On Thu, Feb 26, 2015 at 3:07 PM, 163  wrote:
>> oh my god, I think I understood...
>> In my case, there are three kinds of user-item pairs:
>>
>> Display and click pair(positive pair)
>> Display but no-click pair(negative pair)
>> No-display pair(unobserved pair)
>>
>> Explicit ALS only consider the first and the second kinds
>> But implicit ALS consider all the three kinds of pair(and con

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
guoqiang ??s method works very well ??

it only takes 1TB disk now.

thank you very much!



> ?? 2015??3??314:47??GuoQiang Li  ??
> 
> You can try to enforce garbage collection:
> 
> /** Run GC and make sure it actually has run */
> def runGC() {
>   val weakRef = new WeakReference(new Object())
>   val startTime = System.currentTimeMillis
>   System.gc() // Make a best effort to run the garbage collection. It 
> *usually* runs GC.
>   // Wait until a weak reference object has been GCed
>   System.runFinalization()
>   while (weakRef.get != null) {
> System.gc()
> System.runFinalization()
> Thread.sleep(200)
> if (System.currentTimeMillis - startTime > 1) {
>   throw new Exception("automatically cleanup error")
> }
>   }
> }
> 
> 
> --  --
> ??: "lisendong"mailto:lisend...@163.com>>; 
> : 2015??3??31??(??) 3:47
> ??: "Xiangrui Meng"mailto:men...@gmail.com>>; 
> : "Xiangrui Meng"mailto:m...@databricks.com>>; 
> "user"mailto:user@spark.apache.org>>; "Sean 
> Owen"mailto:so...@cloudera.com>>; "GuoQiang 
> Li"mailto:wi...@qq.com>>; 
> : Re: different result from implicit ALS with explicit ALS
> 
> I have update my spark source code to 1.3.1.
> 
> the checkpoint works well. 
> 
> BUT the shuffle data still could not be delete automatically??the disk usage 
> is still 30TB??
> 
> I have set the spark.cleaner.referenceTracking.blocking.shuffle to true.
> 
> Do you know how to solve my problem?
> 
> Sendong Li
> 
> 
> 
>> ?? 2015??3??3112:11??Xiangrui Meng > <mailto:men...@gmail.com>> ??
>> 
>> setCheckpointInterval was added in the current master and branch-1.3. Please 
>> help check whether it works. It will be included in the 1.3.1 and 1.4.0 
>> release. -Xiangrui
>> 
>> On Mon, Mar 30, 2015 at 7:27 AM, lisendong > <mailto:lisend...@163.com>> wrote:
>> hi, xiangrui:
>> I found the ALS of spark 1.3.0 forget to do checkpoint() in explicit ALS:
>> the code is :
>> https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
>>  
>> <https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala>
>> 
>> 
>> the checkpoint is very important in my situation, because my task will 
>> produce 1TB shuffle data in each iteration, it the shuffle data is not 
>> deleted in each iteration(using checkpoint()), the task will produce 30TB 
>> data??
>> 
>> 
>> So I change the ALS code, and re-compile by myself, but it seems the 
>> checkpoint does not take effects, and the task still occupy 30TB disk?? ( I 
>> only add two lines to the ALS.scala) :
>> 
>> 
>> 
>> 
>> 
>> and the driver??s log seems strange, why the log is printed together...
>> 
>> 
>> thank you very much!
>> 
>> 
>>> ?? 2015??2??2611:33??163 >> <mailto:lisend...@163.com>> ??
>>> 
>>> Thank you very much for your opinion:)
>>> 
>>> In our case, maybe it 's dangerous to treat un-observed item as negative 
>>> interaction(although we could give them small confidence, I think they are 
>>> still incredible...)
>>> 
>>> I will do more experiments and give you feedback:)
>>> 
>>> Thank you;)
>>> 
>>> 
>>>> ?? 2015??2??2623:16??Sean Owen >>> <mailto:so...@cloudera.com>> ??
>>>> 
>>>> I believe that's right, and is what I was getting at. yes the implicit
>>>> formulation ends up implicitly including every possible interaction in
>>>> its loss function, even unobserved ones. That could be the difference.
>>>> 
>>>> This is mostly an academic question though. In practice, you have
>>>> click-like data and should be using the implicit version for sure.
>>>> 
>>>> However you can give negative implicit feedback to the model. You
>>>> could consider no-click as a mild, observed, negative interaction.
>>>> That is: supply a small negative value for these cases. Unobserved
>>>> pairs are not part of the data set. I'd be careful about assuming the
>>>> lack of an action carries signal.
>>>> 
>>>>> On Thu, Feb 26, 2015 at

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
Thank you, @GuoQiang
I will try to add runGC() to the ALS.scala, and if it works for deleting the 
shuffle data, I will tell you :-)



> ?? 2015??3??314:47??GuoQiang Li  ??
> 
> You can try to enforce garbage collection:
> 
> /** Run GC and make sure it actually has run */
> def runGC() {
>   val weakRef = new WeakReference(new Object())
>   val startTime = System.currentTimeMillis
>   System.gc() // Make a best effort to run the garbage collection. It 
> *usually* runs GC.
>   // Wait until a weak reference object has been GCed
>   System.runFinalization()
>   while (weakRef.get != null) {
> System.gc()
> System.runFinalization()
> Thread.sleep(200)
> if (System.currentTimeMillis - startTime > 1) {
>   throw new Exception("automatically cleanup error")
> }
>   }
> }
> 
> 
> --  --
> ??: "lisendong"mailto:lisend...@163.com>>; 
> : 2015??3??31??(??) 3:47
> ??: "Xiangrui Meng"mailto:men...@gmail.com>>; 
> : "Xiangrui Meng"mailto:m...@databricks.com>>; 
> "user"mailto:user@spark.apache.org>>; "Sean 
> Owen"mailto:so...@cloudera.com>>; "GuoQiang 
> Li"mailto:wi...@qq.com>>; 
> : Re: different result from implicit ALS with explicit ALS
> 
> I have update my spark source code to 1.3.1.
> 
> the checkpoint works well. 
> 
> BUT the shuffle data still could not be delete automatically??the disk usage 
> is still 30TB??
> 
> I have set the spark.cleaner.referenceTracking.blocking.shuffle to true.
> 
> Do you know how to solve my problem?
> 
> Sendong Li
> 
> 
> 
>> ?? 2015??3??3112:11??Xiangrui Meng > <mailto:men...@gmail.com>> ??
>> 
>> setCheckpointInterval was added in the current master and branch-1.3. Please 
>> help check whether it works. It will be included in the 1.3.1 and 1.4.0 
>> release. -Xiangrui
>> 
>> On Mon, Mar 30, 2015 at 7:27 AM, lisendong > <mailto:lisend...@163.com>> wrote:
>> hi, xiangrui:
>> I found the ALS of spark 1.3.0 forget to do checkpoint() in explicit ALS:
>> the code is :
>> https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
>>  
>> <https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala>
>> 
>> 
>> the checkpoint is very important in my situation, because my task will 
>> produce 1TB shuffle data in each iteration, it the shuffle data is not 
>> deleted in each iteration(using checkpoint()), the task will produce 30TB 
>> data??
>> 
>> 
>> So I change the ALS code, and re-compile by myself, but it seems the 
>> checkpoint does not take effects, and the task still occupy 30TB disk?? ( I 
>> only add two lines to the ALS.scala) :
>> 
>> 
>> 
>> 
>> 
>> and the driver??s log seems strange, why the log is printed together...
>> 
>> 
>> thank you very much!
>> 
>> 
>>> ?? 2015??2??2611:33??163 >> <mailto:lisend...@163.com>> ??
>>> 
>>> Thank you very much for your opinion:)
>>> 
>>> In our case, maybe it 's dangerous to treat un-observed item as negative 
>>> interaction(although we could give them small confidence, I think they are 
>>> still incredible...)
>>> 
>>> I will do more experiments and give you feedback:)
>>> 
>>> Thank you;)
>>> 
>>> 
>>>> ?? 2015??2??2623:16??Sean Owen >>> <mailto:so...@cloudera.com>> ??
>>>> 
>>>> I believe that's right, and is what I was getting at. yes the implicit
>>>> formulation ends up implicitly including every possible interaction in
>>>> its loss function, even unobserved ones. That could be the difference.
>>>> 
>>>> This is mostly an academic question though. In practice, you have
>>>> click-like data and should be using the implicit version for sure.
>>>> 
>>>> However you can give negative implicit feedback to the model. You
>>>> could consider no-click as a mild, observed, negative interaction.
>>>> That is: supply a small negative value for these cases. Unobserved
>>>> pairs are not part of the data set. I'd be careful about assuming the
>>>> lack of an action carries signal.
>>>> 
>>>

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
I have update my spark source code to 1.3.1.

the checkpoint works well. 

BUT the shuffle data still could not be delete automatically…the disk usage is 
still 30TB…

I have set the spark.cleaner.referenceTracking.blocking.shuffle to true.

Do you know how to solve my problem?

Sendong Li



> 在 2015年3月31日,上午12:11,Xiangrui Meng  写道:
> 
> setCheckpointInterval was added in the current master and branch-1.3. Please 
> help check whether it works. It will be included in the 1.3.1 and 1.4.0 
> release. -Xiangrui
> 
> On Mon, Mar 30, 2015 at 7:27 AM, lisendong  > wrote:
> hi, xiangrui:
> I found the ALS of spark 1.3.0 forget to do checkpoint() in explicit ALS:
> the code is :
> https://github.com/apache/spark/blob/db34690466d67f9c8ac6a145fddb5f7ea30a8d8d/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
>  
> 
> 
> 
> the checkpoint is very important in my situation, because my task will 
> produce 1TB shuffle data in each iteration, it the shuffle data is not 
> deleted in each iteration(using checkpoint()), the task will produce 30TB 
> data…
> 
> 
> So I change the ALS code, and re-compile by myself, but it seems the 
> checkpoint does not take effects, and the task still occupy 30TB disk… ( I 
> only add two lines to the ALS.scala) :
> 
> 
> 
> 
> 
> and the driver’s log seems strange, why the log is printed together...
> 
> 
> thank you very much!
> 
> 
>> 在 2015年2月26日,下午11:33,163 mailto:lisend...@163.com>> 写道:
>> 
>> Thank you very much for your opinion:)
>> 
>> In our case, maybe it 's dangerous to treat un-observed item as negative 
>> interaction(although we could give them small confidence, I think they are 
>> still incredible...)
>> 
>> I will do more experiments and give you feedback:)
>> 
>> Thank you;)
>> 
>> 
>>> 在 2015年2月26日,23:16,Sean Owen >> > 写道:
>>> 
>>> I believe that's right, and is what I was getting at. yes the implicit
>>> formulation ends up implicitly including every possible interaction in
>>> its loss function, even unobserved ones. That could be the difference.
>>> 
>>> This is mostly an academic question though. In practice, you have
>>> click-like data and should be using the implicit version for sure.
>>> 
>>> However you can give negative implicit feedback to the model. You
>>> could consider no-click as a mild, observed, negative interaction.
>>> That is: supply a small negative value for these cases. Unobserved
>>> pairs are not part of the data set. I'd be careful about assuming the
>>> lack of an action carries signal.
>>> 
 On Thu, Feb 26, 2015 at 3:07 PM, 163 >>> > wrote:
 oh my god, I think I understood...
 In my case, there are three kinds of user-item pairs:
 
 Display and click pair(positive pair)
 Display but no-click pair(negative pair)
 No-display pair(unobserved pair)
 
 Explicit ALS only consider the first and the second kinds
 But implicit ALS consider all the three kinds of pair(and consider the 
 third
 kind as the second pair, because their preference value are all zero and
 confidence are all 1)
 
 So the result are different. right?
 
 Could you please give me some advice, which ALS should I use?
 If I use the implicit ALS, how to distinguish the second and the third kind
 of pair:)
 
 My opinion is in my case, I should use explicit ALS ...
 
 Thank you so much
 
 在 2015年2月26日,22:41,Xiangrui Meng >>> > 写道:
 
 Lisen, did you use all m-by-n pairs during training? Implicit model
 penalizes unobserved ratings, while explicit model doesn't. -Xiangrui
 
> On Feb 26, 2015 6:26 AM, "Sean Owen"  > wrote:
> 
> +user
> 
>> On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen > > wrote:
>> 
>> I think I may have it backwards, and that you are correct to keep the 0
>> elements in train() in order to try to reproduce the same result.
>> 
>> The second formulation is called 'weighted regularization' and is used
>> for both implicit and explicit feedback, as far as I can see in the code.
>> 
>> Hm, I'm actually not clear why these would produce different results.
>> Different code paths are used to be sure, but I'm not yet sure why they
>> would give different results.
>> 
>> In general you wouldn't use train() for data like this though, and would
>> never set alpha=0.
>> 
>>> On Thu, Feb 26, 2015 at 2:15 PM, lisendong >> > wrote:
>>> 
>>> I want to confirm the loss function you use (sorry I’m not so familiar
>>> with scala code so I did not understand the source code of mllib)
>>> 
>>> According to the papers :
>>>

Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread 163
Thank you very much for your opinion:)

In our case, maybe it 's dangerous to treat un-observed item as negative 
interaction(although we could give them small confidence, I think they are 
still incredible...)

I will do more experiments and give you feedback:)

Thank you;)


> 在 2015年2月26日,23:16,Sean Owen  写道:
> 
> I believe that's right, and is what I was getting at. yes the implicit
> formulation ends up implicitly including every possible interaction in
> its loss function, even unobserved ones. That could be the difference.
> 
> This is mostly an academic question though. In practice, you have
> click-like data and should be using the implicit version for sure.
> 
> However you can give negative implicit feedback to the model. You
> could consider no-click as a mild, observed, negative interaction.
> That is: supply a small negative value for these cases. Unobserved
> pairs are not part of the data set. I'd be careful about assuming the
> lack of an action carries signal.
> 
>> On Thu, Feb 26, 2015 at 3:07 PM, 163  wrote:
>> oh my god, I think I understood...
>> In my case, there are three kinds of user-item pairs:
>> 
>> Display and click pair(positive pair)
>> Display but no-click pair(negative pair)
>> No-display pair(unobserved pair)
>> 
>> Explicit ALS only consider the first and the second kinds
>> But implicit ALS consider all the three kinds of pair(and consider the third
>> kind as the second pair, because their preference value are all zero and
>> confidence are all 1)
>> 
>> So the result are different. right?
>> 
>> Could you please give me some advice, which ALS should I use?
>> If I use the implicit ALS, how to distinguish the second and the third kind
>> of pair:)
>> 
>> My opinion is in my case, I should use explicit ALS ...
>> 
>> Thank you so much
>> 
>> 在 2015年2月26日,22:41,Xiangrui Meng  写道:
>> 
>> Lisen, did you use all m-by-n pairs during training? Implicit model
>> penalizes unobserved ratings, while explicit model doesn't. -Xiangrui
>> 
>>> On Feb 26, 2015 6:26 AM, "Sean Owen"  wrote:
>>> 
>>> +user
>>> 
 On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen  wrote:
 
 I think I may have it backwards, and that you are correct to keep the 0
 elements in train() in order to try to reproduce the same result.
 
 The second formulation is called 'weighted regularization' and is used
 for both implicit and explicit feedback, as far as I can see in the code.
 
 Hm, I'm actually not clear why these would produce different results.
 Different code paths are used to be sure, but I'm not yet sure why they
 would give different results.
 
 In general you wouldn't use train() for data like this though, and would
 never set alpha=0.
 
> On Thu, Feb 26, 2015 at 2:15 PM, lisendong  wrote:
> 
> I want to confirm the loss function you use (sorry I’m not so familiar
> with scala code so I did not understand the source code of mllib)
> 
> According to the papers :
> 
> 
> in your implicit feedback ALS, the loss function is (ICDM 2008):
> 
> in the explicit feedback ALS, the loss function is (Netflix 2008):
> 
> note that besides the difference of confidence parameter Cui, the
> regularization is also different.  does your code also has this 
> difference?
> 
> Best Regards,
> Sendong Li
> 
> 
>> 在 2015年2月26日,下午9:42,lisendong  写道:
>> 
>> Hi meng, fotero, sowen:
>> 
>> I’m using ALS with spark 1.0.0, the code should be:
>> 
>> https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
>> 
>> I think the following two method should produce the same (or near)
>> result:
>> 
>> MatrixFactorizationModel model = ALS.train(ratings.rdd(), 30, 30, 0.01,
>> -1, 1);
>> 
>> MatrixFactorizationModel model = ALS.trainImplicit(ratings.rdd(), 30,
>> 30, 0.01, -1, 0, 1);
>> 
>> the data I used is display log, the format of log is as following:
>> 
>> user  item  if-click
>> 
>> 
>> 
>> 
>> 
>> 
>> I use 1.0 as score for click pair, and 0 as score for non-click pair.
>> 
>> in the second method, the alpha is set to zero, so the confidence for
>> positive and negative are both 1.0 (right?)
>> 
>> I think the two method should produce similar result, but the result is
>> :  the second method’s result is very bad (the AUC of the first result is
>> 0.7, but the AUC of the second result is only 0.61)
>> 
>> 
>> I could not understand why, could you help me?
>> 
>> 
>> Thank you very much!
>> 
>> Best Regards,
>> Sendong Li
> 
> 
 
>>> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread Sean Owen
I believe that's right, and is what I was getting at. yes the implicit
formulation ends up implicitly including every possible interaction in
its loss function, even unobserved ones. That could be the difference.

This is mostly an academic question though. In practice, you have
click-like data and should be using the implicit version for sure.

However you can give negative implicit feedback to the model. You
could consider no-click as a mild, observed, negative interaction.
That is: supply a small negative value for these cases. Unobserved
pairs are not part of the data set. I'd be careful about assuming the
lack of an action carries signal.

On Thu, Feb 26, 2015 at 3:07 PM, 163  wrote:
> oh my god, I think I understood...
> In my case, there are three kinds of user-item pairs:
>
> Display and click pair(positive pair)
> Display but no-click pair(negative pair)
> No-display pair(unobserved pair)
>
> Explicit ALS only consider the first and the second kinds
> But implicit ALS consider all the three kinds of pair(and consider the third
> kind as the second pair, because their preference value are all zero and
> confidence are all 1)
>
> So the result are different. right?
>
> Could you please give me some advice, which ALS should I use?
> If I use the implicit ALS, how to distinguish the second and the third kind
> of pair:)
>
> My opinion is in my case, I should use explicit ALS ...
>
> Thank you so much
>
> 在 2015年2月26日,22:41,Xiangrui Meng  写道:
>
> Lisen, did you use all m-by-n pairs during training? Implicit model
> penalizes unobserved ratings, while explicit model doesn't. -Xiangrui
>
> On Feb 26, 2015 6:26 AM, "Sean Owen"  wrote:
>>
>> +user
>>
>> On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen  wrote:
>>>
>>> I think I may have it backwards, and that you are correct to keep the 0
>>> elements in train() in order to try to reproduce the same result.
>>>
>>> The second formulation is called 'weighted regularization' and is used
>>> for both implicit and explicit feedback, as far as I can see in the code.
>>>
>>> Hm, I'm actually not clear why these would produce different results.
>>> Different code paths are used to be sure, but I'm not yet sure why they
>>> would give different results.
>>>
>>> In general you wouldn't use train() for data like this though, and would
>>> never set alpha=0.
>>>
>>> On Thu, Feb 26, 2015 at 2:15 PM, lisendong  wrote:

 I want to confirm the loss function you use (sorry I’m not so familiar
 with scala code so I did not understand the source code of mllib)

 According to the papers :


 in your implicit feedback ALS, the loss function is (ICDM 2008):

 in the explicit feedback ALS, the loss function is (Netflix 2008):

 note that besides the difference of confidence parameter Cui, the
 regularization is also different.  does your code also has this difference?

 Best Regards,
 Sendong Li


> 在 2015年2月26日,下午9:42,lisendong  写道:
>
> Hi meng, fotero, sowen:
>
> I’m using ALS with spark 1.0.0, the code should be:
>
> https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
>
> I think the following two method should produce the same (or near)
> result:
>
> MatrixFactorizationModel model = ALS.train(ratings.rdd(), 30, 30, 0.01,
> -1, 1);
>
> MatrixFactorizationModel model = ALS.trainImplicit(ratings.rdd(), 30,
> 30, 0.01, -1, 0, 1);
>
> the data I used is display log, the format of log is as following:
>
> user  item  if-click
>
>
>
>
>
>
> I use 1.0 as score for click pair, and 0 as score for non-click pair.
>
>  in the second method, the alpha is set to zero, so the confidence for
> positive and negative are both 1.0 (right?)
>
> I think the two method should produce similar result, but the result is
> :  the second method’s result is very bad (the AUC of the first result is
> 0.7, but the AUC of the second result is only 0.61)
>
>
> I could not understand why, could you help me?
>
>
> Thank you very much!
>
> Best Regards,
> Sendong Li


>>>
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread 163
oh my god, I think I understood...
In my case, there are three kinds of user-item pairs:

Display and click pair(positive pair)
Display but no-click pair(negative pair)
No-display pair(unobserved pair)

Explicit ALS only consider the first and the second kinds
But implicit ALS consider all the three kinds of pair(and consider the third 
kind as the second pair, because their preference value are all zero and 
confidence are all 1)

So the result are different. right?

Could you please give me some advice, which ALS should I use?
If I use the implicit ALS, how to distinguish the second and the third kind of 
pair:)

My opinion is in my case, I should use explicit ALS ...

Thank you so much

> 在 2015年2月26日,22:41,Xiangrui Meng  写道:
> 
> Lisen, did you use all m-by-n pairs during training? Implicit model penalizes 
> unobserved ratings, while explicit model doesn't. -Xiangrui
> 
> On Feb 26, 2015 6:26 AM, "Sean Owen"  wrote:
> >
> > +user
> >
> > On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen  wrote:
> >>
> >> I think I may have it backwards, and that you are correct to keep the 0 
> >> elements in train() in order to try to reproduce the same result.
> >>
> >> The second formulation is called 'weighted regularization' and is used for 
> >> both implicit and explicit feedback, as far as I can see in the code.
> >>
> >> Hm, I'm actually not clear why these would produce different results. 
> >> Different code paths are used to be sure, but I'm not yet sure why they 
> >> would give different results.
> >>
> >> In general you wouldn't use train() for data like this though, and would 
> >> never set alpha=0.
> >>
> >> On Thu, Feb 26, 2015 at 2:15 PM, lisendong  wrote:
> >>>
> >>> I want to confirm the loss function you use (sorry I’m not so familiar 
> >>> with scala code so I did not understand the source code of mllib)
> >>>
> >>> According to the papers :
> >>>
> >>>
> >>> in your implicit feedback ALS, the loss function is (ICDM 2008):
> >>>
> >>> in the explicit feedback ALS, the loss function is (Netflix 2008):
> >>>
> >>> note that besides the difference of confidence parameter Cui, the 
> >>> regularization is also different.  does your code also has this 
> >>> difference?
> >>>
> >>> Best Regards,
> >>> Sendong Li
> >>>
> >>>
>  在 2015年2月26日,下午9:42,lisendong  写道:
> 
>  Hi meng, fotero, sowen:
> 
>  I’m using ALS with spark 1.0.0, the code should be:
>  https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
> 
>  I think the following two method should produce the same (or near) 
>  result:
> 
>  MatrixFactorizationModel model = ALS.train(ratings.rdd(), 30, 30, 0.01, 
>  -1, 1);
> 
>  MatrixFactorizationModel model = ALS.trainImplicit(ratings.rdd(), 30, 
>  30, 0.01, -1, 0, 1);
> 
>  the data I used is display log, the format of log is as following:
> 
>  user  item  if-click
> 
> 
> 
> 
> 
> 
>  I use 1.0 as score for click pair, and 0 as score for non-click pair.
> 
>   in the second method, the alpha is set to zero, so the confidence for 
>  positive and negative are both 1.0 (right?)
> 
>  I think the two method should produce similar result, but the result is 
>  :  the second method’s result is very bad (the AUC of the first result 
>  is 0.7, but the AUC of the second result is only 0.61)
> 
> 
>  I could not understand why, could you help me?
> 
> 
>  Thank you very much!
> 
>  Best Regards, 
>  Sendong Li
> >>>
> >>>
> >>
> >


Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread Xiangrui Meng
Lisen, did you use all m-by-n pairs during training? Implicit model
penalizes unobserved ratings, while explicit model doesn't. -Xiangrui

On Feb 26, 2015 6:26 AM, "Sean Owen"  wrote:
>
> +user
>
> On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen  wrote:
>>
>> I think I may have it backwards, and that you are correct to keep the 0
elements in train() in order to try to reproduce the same result.
>>
>> The second formulation is called 'weighted regularization' and is used
for both implicit and explicit feedback, as far as I can see in the code.
>>
>> Hm, I'm actually not clear why these would produce different results.
Different code paths are used to be sure, but I'm not yet sure why they
would give different results.
>>
>> In general you wouldn't use train() for data like this though, and would
never set alpha=0.
>>
>> On Thu, Feb 26, 2015 at 2:15 PM, lisendong  wrote:
>>>
>>> I want to confirm the loss function you use (sorry I’m not so familiar
with scala code so I did not understand the source code of mllib)
>>>
>>> According to the papers :
>>>
>>>
>>> in your implicit feedback ALS, the loss function is (ICDM 2008):
>>>
>>> in the explicit feedback ALS, the loss function is (Netflix 2008):
>>>
>>> note that besides the difference of confidence parameter Cui,
the regularization is also different.  does your code also has this
difference?
>>>
>>> Best Regards,
>>> Sendong Li
>>>
>>>
 在 2015年2月26日,下午9:42,lisendong  写道:

 Hi meng, fotero, sowen:

 I’m using ALS with spark 1.0.0, the code should be:

https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala

 I think the following two method should produce the same (or near)
result:

 MatrixFactorizationModel model = ALS.train(ratings.rdd(), 30, 30,
0.01, -1, 1);

 MatrixFactorizationModel model = ALS.trainImplicit(ratings.rdd(), 30,
30, 0.01, -1, 0, 1);

 the data I used is display log, the format of log is as following:

 user  item  if-click






 I use 1.0 as score for click pair, and 0 as score for non-click pair.

  in the second method, the alpha is set to zero, so the confidence for
positive and negative are both 1.0 (right?)

 I think the two method should produce similar result, but the result
is :  the second method’s result is very bad (the AUC of the first result
is 0.7, but the AUC of the second result is only 0.61)


 I could not understand why, could you help me?


 Thank you very much!

 Best Regards,
 Sendong Li
>>>
>>>
>>
>


Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread Sean Owen
+user

On Thu, Feb 26, 2015 at 2:26 PM, Sean Owen  wrote:

> I think I may have it backwards, and that you are correct to keep the 0
> elements in train() in order to try to reproduce the same result.
>
> The second formulation is called 'weighted regularization' and is used for
> both implicit and explicit feedback, as far as I can see in the code.
>
> Hm, I'm actually not clear why these would produce different results.
> Different code paths are used to be sure, but I'm not yet sure why they
> would give different results.
>
> In general you wouldn't use train() for data like this though, and would
> never set alpha=0.
>
> On Thu, Feb 26, 2015 at 2:15 PM, lisendong  wrote:
>
>> I want to confirm the loss function you use (sorry I’m not so familiar
>> with scala code so I did not understand the source code of mllib)
>>
>> According to the papers :
>>
>>
>> in your implicit feedback ALS, the loss function is (ICDM 2008):
>>
>> in the explicit feedback ALS, the loss function is (Netflix 2008):
>>
>> note that besides the difference of confidence parameter Cui, the 
>> regularization
>> is also different.  does your code also has this difference?
>>
>> Best Regards,
>> Sendong Li
>>
>>
>> 在 2015年2月26日,下午9:42,lisendong  写道:
>>
>> Hi meng, fotero, sowen:
>>
>> I’m using ALS with spark 1.0.0, the code should be:
>>
>> https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
>>
>> I think the following two method should produce the same (or near) result:
>>
>> MatrixFactorizationModel model = ALS.train(ratings.rdd(), 30, 30, 0.01, -1, 
>> 1);
>>
>> MatrixFactorizationModel model = ALS.trainImplicit(ratings.rdd(), 30, 30, 
>> 0.01, -1, 0, 1);
>>
>> the data I used is display log, the format of log is as following:
>>
>> user  item  if-click
>>
>>
>>
>>
>>
>>
>> I use 1.0 as score for click pair, and 0 as score for non-click pair.
>>
>>  in the second method, the alpha is set to zero, so the confidence for
>> positive and negative are both 1.0 (right?)
>>
>> I think the two method should produce similar result, but the result is :
>>  the second method’s result is very bad (the AUC of the first result is
>> 0.7, but the AUC of the second result is only 0.61)
>>
>>
>> I could not understand why, could you help me?
>>
>>
>> Thank you very much!
>>
>> Best Regards,
>> Sendong Li
>>
>>
>>
>


different result from implicit ALS with explicit ALS

2015-02-26 Thread lisendong

I’m using ALS with spark 1.0.0, the code should be:
https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala

I think the following two method should produce the same (or near) result:

MatrixFactorizationModel model = ALS.train(ratings.rdd(), 30, 30, 0.01, -1,
1);
MatrixFactorizationModel model = ALS.trainImplicit(ratings.rdd(), 30, 30,
0.01, -1, 0, 1);


the data I used is display log, the format of log is as following:

user  item  if-click


I use 1.0 as score for click pair, and 0 as score for non-click pair.

 in the second method, the alpha is set to zero, so the confidence for
positive and negative are both 1.0 (right?)

I think the two method should produce similar result, but the result is : 
the second method’s result is very bad (the AUC of the first result is 0.7,
but the AUC of the second result is only 0.61)


I could not understand why, could you help me?


Thank you very much!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/different-result-from-implicit-ALS-with-explicit-ALS-tp21823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread lisendong
okay, I have brought this to the user@list

I don’t think the negative pair should be omitted…..


if the score of all of the pairs are 1.0, the result will be worse…I have tried…


Best Regards, 
Sendong Li
> 在 2015年2月26日,下午10:07,Sean Owen  写道:
> 
> Yes, I mean, do not generate a Rating for these data points. What then?
> 
> Also would you care to bring this to the user@ list? it's kind of interesting.
> 
> On Thu, Feb 26, 2015 at 2:02 PM, lisendong  wrote:
>> I set the score of ‘0’ interaction user-item pair to 0.0
>> the code is as following:
>> 
>> if (ifclick > 0) {
>>score = 1.0;
>> }
>> else {
>>score = 0.0;
>> }
>> return new Rating(user_id, photo_id, score);
>> 
>> both method use the same ratings rdd
>> 
>> because of the same random seed(1 in my case), the result is stable.
>> 
>> 
>> Best Regards,
>> Sendong Li
>> 
>> 
>> 在 2015年2月26日,下午9:53,Sean Owen  写道:
>> 
>> 
>> I see why you say that, yes.
>> 
>> Are you actually encoding the '0' interactions, or just omitting them?
>> I think you should do the latter.
>> 
>> Is the AUC stable over many runs or did you just run once?
>> 
>> On Thu, Feb 26, 2015 at 1:42 PM, lisendong  wrote:
>> 
>> Hi meng, fotero, sowen:
>> 
>> I’m using ALS with spark 1.0.0, the code should be:
>> https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
>> 
>> I think the following two method should produce the same (or near) result:
>> 
>> MatrixFactorizationModel model = ALS.train(ratings.rdd(), 30, 30, 0.01, -1,
>> 1);
>> 
>> MatrixFactorizationModel model = ALS.trainImplicit(ratings.rdd(), 30, 30,
>> 0.01, -1, 0, 1);
>> 
>> the data I used is display log, the format of log is as following:
>> 
>> user  item  if-click
>> 
>> 
>> 
>> 
>> 
>> 
>> I use 1.0 as score for click pair, and 0 as score for non-click pair.
>> 
>> in the second method, the alpha is set to zero, so the confidence for
>> positive and negative are both 1.0 (right?)
>> 
>> I think the two method should produce similar result, but the result is :
>> the second method’s result is very bad (the AUC of the first result is 0.7,
>> but the AUC of the second result is only 0.61)
>> 
>> 
>> I could not understand why, could you help me?
>> 
>> 
>> Thank you very much!
>> 
>> Best Regards,
>> Sendong Li
>> 
>> 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org