Hi All,


I’m Lijie and now performing some experiments on MADlib. I found that
MADlib runs very slowly on some datasets, so I would like to justify my
settings. Could you help me check the following settings and codes? Sorry
for this long email. I used the latest MADlib 1.18 on PostgreSQL 12.



*(1)  **Could you help check whether the data format and scripts I used are
right for n-dimensional dataset?*



I have some training datasets, and each of them has a dense feature array
(like [0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example, for the
‘forest’ dataset (581K tuples) with a 54-dimensional feature array and a
class label, I first stored it into PostgreSQL using



<code>

     CREATE TABLE forest (

          did serial,

          vec double precision[],

          labeli integer);



      COPY forest (vec, labeli) FROM STDIN;

      ‘[0.1, 0.2, …, 1.0], -1’

      ‘[0.3, 0.1, …, 0.9], 1’

      …

</code>





        Then, to run the Logistic Regression on this dataset, I use the
following code:



<code>

mldb=# \d forest

                               Table "public.forest"

 Column |        Type        |
Modifiers

--------+--------------------+------------------------------------------------------

 did    | integer            | not null default
nextval('forest_did_seq'::regclass)

 vec    | double precision[] |

 labeli | integer            |



mldb=# SELECT madlib.logregr_train(

mldb(#     'forest',                                 -- source table

mldb(#     'forest_logregr_out',                     -- output table

mldb(#     'labeli',                                 -- labels

mldb(#     'vec',                                    -- features

mldb(#     NULL,                                     -- grouping columns

mldb(#     20,                                       -- max number of
iteration

mldb(#     'igd'                                     -- optimizer

mldb(#     );



Time: 198911.350 ms

</code>



After about 199s, I got the output table as:

<code>

mldb=# \d forest_logregr_out

             Table "public.forest_logregr_out"

          Column          |        Type        | Modifiers

--------------------------+--------------------+-----------

 coef                     | double precision[] |

 log_likelihood           | double precision   |

 std_err                  | double precision[] |

 z_stats                  | double precision[] |

 p_values                 | double precision[] |

 odds_ratios              | double precision[] |

 condition_no             | double precision   |

 num_rows_processed       | bigint             |

 num_missing_rows_skipped | bigint             |

 num_iterations           | integer            |

 variance_covariance      | double precision[] |



mldb=# select log_likelihood from forest_logregr_out;

  log_likelihood

------------------

 -426986.83683879

(1 row)

</code>



Is this procedure correct?



*(2)  **Training on a 2,000-dimensional dense dataset (epsilon) is very
slow:*



           While training on a 2,000-dimensional dense dataset
(epsilon_sample_10) with only *10 tuples* as follows, MADlib does not
finish in 5 hours* for only 1 iteration*. The CPU usage is always 100%
during the execution. The dataset is available at
https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql.



<code>

mldb=# \d epsilon_sample_10

                               Table "public.epsilon_sample_10"

 Column |        Type        |
            Modifiers

--------+--------------------+-----------------------------------------------------------------

 did    | integer            | not null default
nextval('epsilon_sample_10_did_seq'::regclass)

 vec    | double precision[] |

 labeli | integer            |



mldb=# SELECT count(*) from epsilon_sample_10;

 count

-------

    10

(1 row)



Time: 1.456 ms



mldb=# SELECT madlib.logregr_train('epsilon_sample_10',
'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd');

</code>



*In this case, it is not possible to train the whole epsilon dataset (with
400,000 tuples) in a reasonable time. I guess that this problem is related
to TOAST, since epsilon has a high dimension and it is compressed by TOAST.
However, are there any other reasons for this so slow execution?*



*(3)  **For MADlib, is the dataset table scanned once or twice in each
iteration?*

I know that, in each iteration, MADlib needs to scan the dataset table once
to perform IGD/SGD on the whole dataset. My question is that, *at the end
of each iteration*, will MADlib scan the table again to compute the
training loss/accuracy?



*(4)  **Is it possible to output the training metrics, such as training
loss and accuracy after each iteration?*

Currently, it seems that MADlib only outputs the log-likelihood at the end
of the SQL execution.



*(5)  **Do MADlib’s Logistic Regression and SVM support sparse datasets?*

I also have some sparse datasets denoted as ‘feature_index_vec_array,
feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’. Can
I train these sparse datasets on MADlib using LR and SVM?



Many thanks for reviewing my questions.





Best regards,



Lijie

Reply via email to