[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-28 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
This kind of behavior could often be happen and Libffm's early stopping 
strategy is too aggressive.

```
   7  0.43239  0.46952
   8  0.42362  0.46999
   9  0.41394  0.45088 
```


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-28 Thread takuti
Github user takuti commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
Make sense as a compromise in terms of memory consumption. 

I'll note on documentation to clarify the fact that our `-early_stopping` 
option does not return the best of the best model; users expect the option 
returns the best model achieved at the 7-th iteration, but our UDF does not 
behave so as discussed above.


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-28 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
```
iter   tr_logloss   va_logloss
   1  0.49738  0.48776
   2  0.47383  0.47995
   3  0.46366  0.47480
   4  0.45561  0.47231
   5  0.44810  0.47034
   6  0.44037  0.47003
   7  0.43239  0.46952
   8  0.42362  0.46999 <- ffm stops one va_logloss is increased 
but va_logloss might decrease in the next iteration
   9  0.41394  0.47088 <- once 
```

In 8-th iteration, `ready to stop once va_logloss increase`. 
If va_logloss descreases in the 9th iteration, then continue iteration (set 
not ready to finish).
If va_logloss increases in the 9th iteration, then emit the current model  
in the 9th iteration.


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-23 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
It might be better to reconsider `eta0` when enabling `l2norm` by the 
default and by enlarging`max_init_size`. In my experience for FM, init random 
size should be small when the avg feature dimension is large (gradients will be 
large).

I think `1.0` is too aggressive for the default though. `0.2` or `0.5`? 
Better to research other implementations.


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-22 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
@takuti so then, better to enable l2_norm by the default and 
`-disable_l2norm` to disable l2 normalization. My concern is that L2 
normalization performed worse for small datasets with adequate learning rate 
`[0.1,1.0]`. 

FieldAwareFactorizationMachineUDTFTest contains several tests. It's better 
to find that accuracy will not be bad with new default options, enabling L2 
normalization.


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-22 Thread takuti
Github user takuti commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
I'll change default options and consider to implement early stopping option 
as you suggested.

> What happens without `-l2norm` ?

Once we drop instance-wise L2 normalization, a model easily overfits to 
training samples, and prediction accuracy gets exceptionally worse.

**LIBFFM**:

```
$ ./ffm-train -k 4 -t 15 -l 0.2 -r 0.2 -s 1 --no-norm ../tr.sp model
First check if the text file has already been converted to binary format 
(0.0 seconds)
Binary file NOT found. Convert text file to binary file (0.0 seconds)
iter   tr_logloss  tr_time
   1  4.24374  0.0
   2  0.53960  0.1
   3  0.09525  0.2
   4  0.01288  0.2
   5  0.00215  0.3
   6  0.00133  0.3
   7  0.00112  0.3
   8  0.00098  0.4
   9  0.00089  0.4
  10  0.00082  0.5
  11  0.00076  0.5
  12  0.00072  0.6
  13  0.00068  0.6
  14  0.00064  0.6
  15  0.00061  0.7
$ ./ffm-predict ../va.sp model submission.csv
logloss = 1.75623
```

**Hivemall**:

```
Iteration #2 | average loss=0.5186307939402891, current cumulative 
loss=823.0670699832388, previous cumulative loss=6640.3299608989755, change 
rate=0.876050275388452, #trainingExamples=1587
Iteration #3 | average loss=0.06870252595245425, current cumulative 
loss=109.0309086865449, previous cumulative loss=823.0670699832388, change 
rate=0.8675309550547743, #trainingExamples=1587
Iteration #4 | average loss=0.01701292407900819, current cumulative 
loss=26.999510513386, previous cumulative loss=109.0309086865449, change 
rate=0.7523682886014696, #trainingExamples=1587
Iteration #5 | average loss=0.003132377872105223, current cumulative 
loss=4.971083683030989, previous cumulative loss=26.999510513386, change 
rate=0.8158824516256917, #trainingExamples=1587
Iteration #6 | average loss=0.001693780516846469, current cumulative 
loss=2.6880296802353465, previous cumulative loss=4.971083683030989, change 
rate=0.4592668617888987, #trainingExamples=1587
Iteration #7 | average loss=0.0013357168592237345, current cumulative 
loss=2.1197826555880668, previous cumulative loss=2.6880296802353465, change 
rate=0.21139908864307172, #trainingExamples=1587
Iteration #8 | average loss=0.0011459013923848537, current cumulative 
loss=1.8185455097147627, previous cumulative loss=2.1197826555880668, change 
rate=0.1421075623386188, #trainingExamples=1587
Iteration #9 | average loss=0.001017751388111345, current cumulative 
loss=1.6151714529327046, previous cumulative loss=1.8185455097147627, change 
rate=0.11183336116452601, #trainingExamples=1587
Iteration #10 | average loss=9.230266490923267E-4, current cumulative 
loss=1.4648432921095225, previous cumulative loss=1.6151714529327046, change 
rate=0.0930725716766649, #trainingExamples=1587
Iteration #11 | average loss=8.493080071393429E-4, current cumulative 
loss=1.3478518073301373, previous cumulative loss=1.4648432921095225, change 
rate=0.07986621190783184, #trainingExamples=1587
Iteration #12 | average loss=7.898623710141035E-4, current cumulative 
loss=1.2535115827993821, previous cumulative loss=1.3478518073301373, change 
rate=0.0699930244687856, #trainingExamples=1587
Iteration #13 | average loss=7.406521210973545E-4, current cumulative 
loss=1.1754149161815017, previous cumulative loss=1.2535115827993821, change 
rate=0.06230230951952787, #trainingExamples=1587
Iteration #14 | average loss=6.990685420175246E-4, current cumulative 
loss=1.1094217761818115, previous cumulative loss=1.1754149161815017, change 
rate=0.056144548696113294, #trainingExamples=1587
Iteration #15 | average loss=6.633493164996776E-4, current cumulative 
loss=1.0527353652849885, previous cumulative loss=1.1094217761818115, change 
rate=0.051095455410939475, #trainingExamples=1587
Performed 15 iterations of 1,587 training examples on memory (thus 23,805 
training updates in total)
```

```
LogLoss: 1.8970086009757248
```


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-22 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
Also, it's better to revise default `-iters` from 1 to 10 (at least 10 
iterations with early stopping).


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-22 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
BTW, it might be better to implement `early stopping` using validation data.
https://github.com/guestwalk/libffm

We can use a similar approaches to `_validationRatio` used in 
`FactorizationMachineUDTF` instead of preparing validation dataset.


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-22 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
@takuti Thank you for detailed verification. Let's disable linear term by 
the default.

Remove `-disable_wi` and `-enable_wi` (alias `-linear_term` ) to enable 
linear term.

I'm not sure `-l2norm` should be enabled by default. What happens without 
`-l2norm` ?


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-22 Thread takuti
Github user takuti commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
### With linear terms

 Hivemall

```sql
INSERT OVERWRITE TABLE criteo.ffm_model
SELECT
  train_ffm(features, label, '-init_v random -max_init_value 0.5 
-classification -iterations 15 -factors 4 -eta 0.2 -l2norm -optimizer adagrad 
-lambda 0.2 -cv_rate 0.0')
FROM (
  SELECT
features, label
  FROM
criteo.train_vectorized
  CLUSTER BY rand(1)
) t
;
```

```
Iteration #2 | average loss=0.474651712453725, current cumulative 
loss=753.2722676640616, previous cumulative loss=990.2550021169766, change 
rate=0.23931485722999737, #trainingExamples=1587
Iteration #3 | average loss=0.4499051385165006, current cumulative 
loss=713.9994548256865, previous cumulative loss=753.2722676640616, change 
rate=0.05213627863954456, #trainingExamples=1587
Iteration #4 | average loss=0.4342257595710771, current cumulative 
loss=689.1162804392994, previous cumulative loss=713.9994548256865, change 
rate=0.03485041090467212, #trainingExamples=1587
Iteration #5 | average loss=0.4225120903723549, current cumulative 
loss=670.5266874209271, previous cumulative loss=689.1162804392994, change 
rate=0.026975988735198287, #trainingExamples=1587
Iteration #6 | average loss=0.41300825971798527, current cumulative 
loss=655.4441081724426, previous cumulative loss=670.5266874209271, change 
rate=0.022493630054453533, #trainingExamples=1587
Iteration #7 | average loss=0.40491514701335013, current cumulative 
loss=642.6003383101867, previous cumulative loss=655.4441081724426, change 
rate=0.019595522641995967, #trainingExamples=1587
Iteration #8 | average loss=0.3978014571916465, current cumulative 
loss=631.310912563143, previous cumulative loss=642.6003383101867, change 
rate=0.017568347033135524, #trainingExamples=1587
Iteration #9 | average loss=0.3914067263636397, current cumulative 
loss=621.1624747390962, previous cumulative loss=631.310912563143, change 
rate=0.016075182009517044, #trainingExamples=1587
Iteration #10 | average loss=0.3855609819906249, current cumulative 
loss=611.8852784191217, previous cumulative loss=621.1624747390962, change 
rate=0.014935216947661086, #trainingExamples=1587
Iteration #11 | average loss=0.3801467153362753, current cumulative 
loss=603.2928372386689, previous cumulative loss=611.8852784191217, change 
rate=0.01404256889894858, #trainingExamples=1587
Iteration #12 | average loss=0.3750791243746283, current cumulative 
loss=595.2505703825351, previous cumulative loss=603.2928372386689, change 
rate=0.01333061883005943, #trainingExamples=1587
Iteration #13 | average loss=0.37029474458756273, current cumulative 
loss=587.657759660462, previous cumulative loss=595.2505703825351, change 
rate=0.012755654676976761, #trainingExamples=1587
Iteration #14 | average loss=0.36574472099268607, current cumulative 
loss=580.4368722153928, previous cumulative loss=587.657759660462, change 
rate=0.012287572700888608, #trainingExamples=1587
Iteration #15 | average loss=0.3613904840032808, current cumulative 
loss=573.5266981132066, previous cumulative loss=580.4368722153928, change 
rate=0.011905126005885216, #trainingExamples=1587
Performed 15 iterations of 1,587 training examples on memory (thus 23,805 
training updates in total)
```
> LogLoss: 0.4771035166468042

 LIBFFM

```
$ ./ffm-train -k 4 -t 15 -l 0.2 -r 0.2 -s 1 ../tr.sp model
First check if the text file has already been converted to binary format 
(0.0 seconds)
Binary file NOT found. Convert text file to binary file (0.0 seconds)
iter   tr_logloss  tr_time
   1  0.62043  0.0
   2  0.47533  0.1
   3  0.44968  0.1
   4  0.43548  0.2
   5  0.42261  0.2
   6  0.41322  0.3
   7  0.40489  0.3
   8  0.39687  0.4
   9  0.39085  0.4
  10  0.38530  0.4
  11  0.37965  0.5
  12  0.37450  0.5
  13  0.36937  0.6
  14  0.36444  0.6
  15  0.36031  0.7
$ ./ffm-predict ../va.sp model submission.csv
logloss = 0.47818
```

### Without linear terms (i.e., adding `-disable_wi` option)

 Hivemall

```
Iteration #2 | average loss=0.539961924393562, current cumulative 
loss=856.919574012583, previous cumulative loss=1651.6985545424677, change 
rate=0.4811179934516, #trainingExamples=1587
Iteration #3 | average loss=0.5106114115327627, current cumulative 
loss=810.3403101024943, previous cumulative loss=856.919574012583, change 
rate=0.05435663430113771, #trainingExamples=1587
Iteration #4 | average loss=0.4906722901321148, current cumulative 
loss=778.6969244396662, previous cumulative 

[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-17 Thread myui
Github user myui commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
@takuti I advice to check 2-3 updates to investigate how gradient updates 
differ.


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-17 Thread takuti
Github user takuti commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
Note: I've extended LIBFFM code so it uses linear terms: 
https://github.com/takuti/criteo-ffm/commit/9aca61d93ed8f583025729206ed0dbfd54806a44
 However, I cannot observe significant difference between LogLoss achieved 
with/without linear terms.


---


[GitHub] incubator-hivemall issue #149: [WIP][HIVEMALL-201] Evaluate, fix and documen...

2018-05-17 Thread takuti
Github user takuti commented on the issue:

https://github.com/apache/incubator-hivemall/pull/149
  
Evaluation has been conducted at: 
[takuti/criteo-ffm](https://github.com/takuti/criteo-ffm). See the repository 
for detail.

As an example, I have used tiny data provided at 
[guestwalk/kaggle-2014-criteo](https://github.com/guestwalk/kaggle-2014-criteo) 
which is already preprocessed and converted into the LIBFFM format:

- Split 2,000 samples in `train.tiny.csv` to:
  - 1,587 training samples `tr.sp`
  - 412 validation samples `va.sp`

As a consequence, FFM model created by LIBFFM and Hivemall with the 
following (almost similar) configuration showed very similar training loss and 
accuracy as follows.

**LIBFFM**:

```
$ ./ffm-train -k 4 -t 15 -l 0.2 -r 0.2 -s 10 ../tr.sp model
iter   tr_logloss  tr_time
   1  1.04980  0.0
   2  0.53771  0.0
   3  0.50963  0.0
   4  0.48980  0.1
   5  0.47469  0.1
   6  0.46304  0.1
   7  0.45289  0.1
   8  0.44400  0.1
   9  0.43653  0.1
  10  0.42947  0.1
  11  0.42330  0.1
  12  0.41727  0.1
  13  0.41130  0.1
  14  0.40558  0.1
  15  0.40036  0.1
```

 > LogLoss on validation set `va.sp`: 0.47237

**Hivemall**:

```
$ hive --hiveconf hive.root.logger=INFO,console
hive> INSERT OVERWRITE TABLE criteo.ffm_model
> SELECT
>   train_ffm(features, label, '-init_v random -max_init_value 1.0 
-classification -iterations 15 -factors 4 -eta 0.2 -l2norm -optimizer sgd 
-lambda 0.2 -cv_rate 0.0 -disable_wi')
> FROM (
>   SELECT
> features, label
>   FROM
> criteo.train_vectorized
>   CLUSTER BY rand(1)
> ) t
> ;
Record training examples to a file: 
/var/folders/rg/6mhvj7h567x_ys7brmf2bb6wgn/T/hivemall_fm6211397472147242886.sgmt
Iteration #2 | average loss=0.5316043797079182, current cumulative 
loss=843.6561505964662, previous cumulative loss=1214.5909560888044, change 
rate=0.30539895232450376, #trainingExamples=1587
Iteration #3 | average loss=0.5065999656968238, current cumulative 
loss=803.9741455608594, previous cumulative loss=843.6561505964662, change 
rate=0.04703575622313853, #trainingExamples=1587
Iteration #4 | average loss=0.49634490612175397, current cumulative 
loss=787.6993660152235, previous cumulative loss=803.9741455608594, change 
rate=0.0202429140731664, #trainingExamples=1587
Iteration #5 | average loss=0.48804954980765963, current cumulative 
loss=774.5346355447558, previous cumulative loss=787.6993660152235, change 
rate=0.0167128869698916, #trainingExamples=1587
Iteration #6 | average loss=0.48072518575956447, current cumulative 
loss=762.9108698004288, previous cumulative loss=774.5346355447558, change 
rate=0.015007418920848658, #trainingExamples=1587
Iteration #7 | average loss=0.47402279755334875, current cumulative 
loss=752.2741797171644, previous cumulative loss=762.9108698004288, change 
rate=0.01394224476803, #trainingExamples=1587
Iteration #8 | average loss=0.4677507471836629, current cumulative 
loss=742.320435780473, previous cumulative loss=752.2741797171644, change 
rate=0.013231537390308698, #trainingExamples=1587
Iteration #9 | average loss=0.4618142861358177, current cumulative 
loss=732.8992720975427, previous cumulative loss=742.320435780473, change 
rate=0.012691505216375798, #trainingExamples=1587
Iteration #10 | average loss=0.4561878517855827, current cumulative 
loss=723.9701207837197, previous cumulative loss=732.8992720975427, change 
rate=0.012183326759580433, #trainingExamples=1587
Iteration #11 | average loss=0.45087834343992406, current cumulative 
loss=715.5439310391595, previous cumulative loss=723.9701207837197, change 
rate=0.01163886395675921, #trainingExamples=1587
Iteration #12 | average loss=0.4458864402438874, current cumulative 
loss=707.6217806670493, previous cumulative loss=715.5439310391595, change 
rate=0.011071508021324606, #trainingExamples=1587
Iteration #13 | average loss=0.44118468270053807, current cumulative 
loss=700.1600914457539, previous cumulative loss=707.6217806670493, change 
rate=0.010544742156271002, #trainingExamples=1587
Iteration #14 | average loss=0.4367191822212713, current cumulative 
loss=693.0733421851576, previous cumulative loss=700.1600914457539, change 
rate=0.01012161268141256, #trainingExamples=1587
Iteration #15 | average loss=0.4324248854220929, current cumulative 
loss=686.2582931648615, previous cumulative loss=693.0733421851576, change 
rate=0.009833084906727563, #trainingExamples=1587
Performed 15 iterations of 1,587 training examples on memory (thus 23,805 
training