Re: Long execution time on MADlib

Frank McQuillan Tue, 06 Jul 2021 12:04:59 -0700

Hello,

Thank you for the questions.


(0)
Not sure if you are using Postgres just for development or production, but keep 
in mind that MADlib is designed to run on a distributed MPP database 
(Greenplum) with large datasets. It runs fine on Postgres, but obviously 
Postgres won't scale to very large datasets or it will just be too slow.

Also see jupyter notebooks here
https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Supervised-learning
for other examples in case of use.


(1)
- there are 2 problems with your dataset for logistic regression:

(i)
- as per
http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html
MADlib: Logistic 
Regression<http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html>
Binomial logistic regression models the relationship between a dichotomous 
dependent variable and one or more predictor variables. The dependent variable 
may be a Boolean value or a categorial variable that can be represented with a 
Boolean expression.
madlib.incubator.apache.org


the dependent variable is a boolean or an expression that evaluates to boolean
- your data has dependent variable of -1 but postgres does not evaluate -1 to 
FALSE so you should change the -1 to 0
- i.e., use 0 for FALSE and 1 for TRUE in postgres
https://www.postgresql.org/docs/12/datatype-boolean.html
[https://www.postgresql.org/media/img/about/press/elephant.png]<https://www.postgresql.org/docs/12/datatype-boolean.html>
PostgreSQL: Documentation: 12: 8.6. Boolean 
Type<https://www.postgresql.org/docs/12/datatype-boolean.html>
The key words TRUE and FALSE are the preferred (SQL-compliant) method for 
writing Boolean constants in SQL queries.But you can also use the string 
representations by following the generic string-literal constant syntax 
described in Section 4.1.2.7, for example 'yes'::boolean.. Note that the parser 
automatically understands that TRUE and FALSE are of type boolean, but this is 
not so for NULL ...
www.postgresql.org



(ii)
- an intercept variable is not assumed so it is common to provide an explicit 
intercept term by including a single constant 1 term in the independent 
variable list
- see the example here
http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples
MADlib: Logistic 
Regression<http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples>
Binomial logistic regression models the relationship between a dichotomous 
dependent variable and one or more predictor variables. The dependent variable 
may be a Boolean value or a categorial variable that can be represented with a 
Boolean expression.
madlib.incubator.apache.org



That is why the log_likelihood value is too big, that model is not right.


(2)
if you make the fixes above in (1) it should run OK.  Here are my results on 
PostgreSQL 11.6 using MADlib version: 1.18.0 on the dataset with 10 tuples:


DROP TABLE IF EXISTS epsilon_sample_10v2 CASCADE;

        CREATE TABLE epsilon_sample_10v2 (
       did serial,
       vec double precision[],
       labeli integer
        );

        COPY epsilon_sample_10v2 (vec, labeli) FROM STDIN;
        {1.0,-0.0108282,-0.0196004,0.0422148,...} 0
        {1.0,0.00250835,0.0168447,-0.0102934,...} 1
        etc.

SELECT madlib.logregr_train('epsilon_sample_10v2', 
'epsilon_sample_10v2_logregr_out', 'labeli', 'vec', NULL, 1, 'irls'}

 logregr_train
---------------

(1 row)

Time: 317046.342 ms (05:17.046)

madlib=# select log_likelihood from epsilon_sample_10v2_logregr_out;
  log_likelihood
-------------------
 -6.93147180559945
(1 row)


(3)
-dataset is not scanned again at the end of every iteration to compute training 
loss/accuracy.  It should only scan 1x per iteration for optimization


(4)
- I thought the verbose parameter should do that, but it does not seem to be 
working for me.  Will need to look into it more.


(5)
-logistic regression and SVM do not currently support sparse matrix format
http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html
MADlib: Sparse 
Vectors<http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html>
dict_id_col : TEXT. Name of the id column in the dictionary_tbl. Expected Type: 
INTEGER or BIGINT. NOTE: Values must be continuous ranging from 0 to total 
number of elements in the dictionary - 1.
madlib.incubator.apache.org


Frank

________________________________
From: Lijie Xu <csxuli...@gmail.com>
Sent: Saturday, July 3, 2021 1:21 PM
To: user@madlib.apache.org <user@madlib.apache.org>
Subject: Long execution time on MADlib



Hi All,



I’m Lijie and now performing some experiments on MADlib. I found that MADlib 
runs very slowly on some datasets, so I would like to justify my settings. 
Could you help me check the following settings and codes? Sorry for this long 
email. I used the latest MADlib 1.18 on PostgreSQL 12.



(1)  Could you help check whether the data format and scripts I used are right 
for n-dimensional dataset?



I have some training datasets, and each of them has a dense feature array (like 
[0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example, for the ‘forest’ 
dataset (581K tuples) with a 54-dimensional feature array and a class label, I 
first stored it into PostgreSQL using



<code>

     CREATE TABLE forest (

          did serial,

          vec double precision[],

          labeli integer);



      COPY forest (vec, labeli) FROM STDIN;

      ‘[0.1, 0.2, …, 1.0], -1’

      ‘[0.3, 0.1, …, 0.9], 1’

      …

</code>





        Then, to run the Logistic Regression on this dataset, I use the 
following code:



<code>

mldb=# \d forest

                               Table "public.forest"

 Column |        Type        |                      Modifiers

--------+--------------------+------------------------------------------------------

 did    | integer            | not null default 
nextval('forest_did_seq'::regclass)

 vec    | double precision[] |

 labeli | integer            |



mldb=# SELECT madlib.logregr_train(

mldb(#     'forest',                                 -- source table

mldb(#     'forest_logregr_out',                     -- output table

mldb(#     'labeli',                                 -- labels

mldb(#     'vec',                                    -- features

mldb(#     NULL,                                     -- grouping columns

mldb(#     20,                                       -- max number of iteration

mldb(#     'igd'                                     -- optimizer

mldb(#     );



Time: 198911.350 ms

</code>



After about 199s, I got the output table as:

<code>

mldb=# \d forest_logregr_out

             Table "public.forest_logregr_out"

          Column          |        Type        | Modifiers

--------------------------+--------------------+-----------

 coef                     | double precision[] |

 log_likelihood           | double precision   |

 std_err                  | double precision[] |

 z_stats                  | double precision[] |

 p_values                 | double precision[] |

 odds_ratios              | double precision[] |

 condition_no             | double precision   |

 num_rows_processed       | bigint             |

 num_missing_rows_skipped | bigint             |

 num_iterations           | integer            |

 variance_covariance      | double precision[] |



mldb=# select log_likelihood from forest_logregr_out;

  log_likelihood

------------------

 -426986.83683879

(1 row)

</code>



Is this procedure correct?



(2)  Training on a 2,000-dimensional dense dataset (epsilon) is very slow:



           While training on a 2,000-dimensional dense dataset 
(epsilon_sample_10) with only 10 tuples as follows, MADlib does not finish in 5 
hours for only 1 iteration. The CPU usage is always 100% during the execution. 
The dataset is available at 
https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJerryLead%2FMisc%2Fblob%2Fmaster%2FMADlib%2Ftrain.sql&data=04%7C01%7Cfmcquillan%40vmware.com%7C4b68d873a6434f4a8ddd08d93e603b96%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637609405309019768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ga6INXkQiBvI8RWAfCEZI5uSFOscUdG6RPR0NiuKWjU%3D&reserved=0>.



<code>

mldb=# \d epsilon_sample_10

                               Table "public.epsilon_sample_10"

 Column |        Type        |                            Modifiers

--------+--------------------+-----------------------------------------------------------------

 did    | integer            | not null default 
nextval('epsilon_sample_10_did_seq'::regclass)

 vec    | double precision[] |

 labeli | integer            |



mldb=# SELECT count(*) from epsilon_sample_10;

 count

-------

    10

(1 row)



Time: 1.456 ms



mldb=# SELECT madlib.logregr_train('epsilon_sample_10', 
'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd');

</code>



In this case, it is not possible to train the whole epsilon dataset (with 
400,000 tuples) in a reasonable time. I guess that this problem is related to 
TOAST, since epsilon has a high dimension and it is compressed by TOAST. 
However, are there any other reasons for this so slow execution?



(3)  For MADlib, is the dataset table scanned once or twice in each iteration?

I know that, in each iteration, MADlib needs to scan the dataset table once to 
perform IGD/SGD on the whole dataset. My question is that, at the end of each 
iteration, will MADlib scan the table again to compute the training 
loss/accuracy?



(4)  Is it possible to output the training metrics, such as training loss and 
accuracy after each iteration?

Currently, it seems that MADlib only outputs the log-likelihood at the end of 
the SQL execution.



(5)  Do MADlib’s Logistic Regression and SVM support sparse datasets?

I also have some sparse datasets denoted as ‘feature_index_vec_array, 
feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’. Can I 
train these sparse datasets on MADlib using LR and SVM?



Many thanks for reviewing my questions.





Best regards,



Lijie

Re: Long execution time on MADlib

Reply via email to