Hi All,
I’m Lijie and now performing some experiments on MADlib. I found that MADlib runs very slowly on some datasets, so I would like to justify my settings. Could you help me check the following settings and codes? Sorry for this long email. I used the latest MADlib 1.18 on PostgreSQL 12. *(1) **Could you help check whether the data format and scripts I used are right for n-dimensional dataset?* I have some training datasets, and each of them has a dense feature array (like [0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example, for the ‘forest’ dataset (581K tuples) with a 54-dimensional feature array and a class label, I first stored it into PostgreSQL using <code> CREATE TABLE forest ( did serial, vec double precision[], labeli integer); COPY forest (vec, labeli) FROM STDIN; ‘[0.1, 0.2, …, 1.0], -1’ ‘[0.3, 0.1, …, 0.9], 1’ … </code> Then, to run the Logistic Regression on this dataset, I use the following code: <code> mldb=# \d forest Table "public.forest" Column | Type | Modifiers --------+--------------------+------------------------------------------------------ did | integer | not null default nextval('forest_did_seq'::regclass) vec | double precision[] | labeli | integer | mldb=# SELECT madlib.logregr_train( mldb(# 'forest', -- source table mldb(# 'forest_logregr_out', -- output table mldb(# 'labeli', -- labels mldb(# 'vec', -- features mldb(# NULL, -- grouping columns mldb(# 20, -- max number of iteration mldb(# 'igd' -- optimizer mldb(# ); Time: 198911.350 ms </code> After about 199s, I got the output table as: <code> mldb=# \d forest_logregr_out Table "public.forest_logregr_out" Column | Type | Modifiers --------------------------+--------------------+----------- coef | double precision[] | log_likelihood | double precision | std_err | double precision[] | z_stats | double precision[] | p_values | double precision[] | odds_ratios | double precision[] | condition_no | double precision | num_rows_processed | bigint | num_missing_rows_skipped | bigint | num_iterations | integer | variance_covariance | double precision[] | mldb=# select log_likelihood from forest_logregr_out; log_likelihood ------------------ -426986.83683879 (1 row) </code> Is this procedure correct? *(2) **Training on a 2,000-dimensional dense dataset (epsilon) is very slow:* While training on a 2,000-dimensional dense dataset (epsilon_sample_10) with only *10 tuples* as follows, MADlib does not finish in 5 hours* for only 1 iteration*. The CPU usage is always 100% during the execution. The dataset is available at https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql. <code> mldb=# \d epsilon_sample_10 Table "public.epsilon_sample_10" Column | Type | Modifiers --------+--------------------+----------------------------------------------------------------- did | integer | not null default nextval('epsilon_sample_10_did_seq'::regclass) vec | double precision[] | labeli | integer | mldb=# SELECT count(*) from epsilon_sample_10; count ------- 10 (1 row) Time: 1.456 ms mldb=# SELECT madlib.logregr_train('epsilon_sample_10', 'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd'); </code> *In this case, it is not possible to train the whole epsilon dataset (with 400,000 tuples) in a reasonable time. I guess that this problem is related to TOAST, since epsilon has a high dimension and it is compressed by TOAST. However, are there any other reasons for this so slow execution?* *(3) **For MADlib, is the dataset table scanned once or twice in each iteration?* I know that, in each iteration, MADlib needs to scan the dataset table once to perform IGD/SGD on the whole dataset. My question is that, *at the end of each iteration*, will MADlib scan the table again to compute the training loss/accuracy? *(4) **Is it possible to output the training metrics, such as training loss and accuracy after each iteration?* Currently, it seems that MADlib only outputs the log-likelihood at the end of the SQL execution. *(5) **Do MADlib’s Logistic Regression and SVM support sparse datasets?* I also have some sparse datasets denoted as ‘feature_index_vec_array, feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’. Can I train these sparse datasets on MADlib using LR and SVM? Many thanks for reviewing my questions. Best regards, Lijie