Hello, Thank you for the questions.
(0) Not sure if you are using Postgres just for development or production, but keep in mind that MADlib is designed to run on a distributed MPP database (Greenplum) with large datasets. It runs fine on Postgres, but obviously Postgres won't scale to very large datasets or it will just be too slow. Also see jupyter notebooks here https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Supervised-learning for other examples in case of use. (1) - there are 2 problems with your dataset for logistic regression: (i) - as per http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html MADlib: Logistic Regression<http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html> Binomial logistic regression models the relationship between a dichotomous dependent variable and one or more predictor variables. The dependent variable may be a Boolean value or a categorial variable that can be represented with a Boolean expression. madlib.incubator.apache.org the dependent variable is a boolean or an expression that evaluates to boolean - your data has dependent variable of -1 but postgres does not evaluate -1 to FALSE so you should change the -1 to 0 - i.e., use 0 for FALSE and 1 for TRUE in postgres https://www.postgresql.org/docs/12/datatype-boolean.html [https://www.postgresql.org/media/img/about/press/elephant.png]<https://www.postgresql.org/docs/12/datatype-boolean.html> PostgreSQL: Documentation: 12: 8.6. Boolean Type<https://www.postgresql.org/docs/12/datatype-boolean.html> The key words TRUE and FALSE are the preferred (SQL-compliant) method for writing Boolean constants in SQL queries.But you can also use the string representations by following the generic string-literal constant syntax described in Section 4.1.2.7, for example 'yes'::boolean.. Note that the parser automatically understands that TRUE and FALSE are of type boolean, but this is not so for NULL ... www.postgresql.org (ii) - an intercept variable is not assumed so it is common to provide an explicit intercept term by including a single constant 1 term in the independent variable list - see the example here http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples MADlib: Logistic Regression<http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples> Binomial logistic regression models the relationship between a dichotomous dependent variable and one or more predictor variables. The dependent variable may be a Boolean value or a categorial variable that can be represented with a Boolean expression. madlib.incubator.apache.org That is why the log_likelihood value is too big, that model is not right. (2) if you make the fixes above in (1) it should run OK. Here are my results on PostgreSQL 11.6 using MADlib version: 1.18.0 on the dataset with 10 tuples: DROP TABLE IF EXISTS epsilon_sample_10v2 CASCADE; CREATE TABLE epsilon_sample_10v2 ( did serial, vec double precision[], labeli integer ); COPY epsilon_sample_10v2 (vec, labeli) FROM STDIN; {1.0,-0.0108282,-0.0196004,0.0422148,...} 0 {1.0,0.00250835,0.0168447,-0.0102934,...} 1 etc. SELECT madlib.logregr_train('epsilon_sample_10v2', 'epsilon_sample_10v2_logregr_out', 'labeli', 'vec', NULL, 1, 'irls'} logregr_train --------------- (1 row) Time: 317046.342 ms (05:17.046) madlib=# select log_likelihood from epsilon_sample_10v2_logregr_out; log_likelihood ------------------- -6.93147180559945 (1 row) (3) -dataset is not scanned again at the end of every iteration to compute training loss/accuracy. It should only scan 1x per iteration for optimization (4) - I thought the verbose parameter should do that, but it does not seem to be working for me. Will need to look into it more. (5) -logistic regression and SVM do not currently support sparse matrix format http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html MADlib: Sparse Vectors<http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html> dict_id_col : TEXT. Name of the id column in the dictionary_tbl. Expected Type: INTEGER or BIGINT. NOTE: Values must be continuous ranging from 0 to total number of elements in the dictionary - 1. madlib.incubator.apache.org Frank ________________________________ From: Lijie Xu <csxuli...@gmail.com> Sent: Saturday, July 3, 2021 1:21 PM To: user@madlib.apache.org <user@madlib.apache.org> Subject: Long execution time on MADlib Hi All, I’m Lijie and now performing some experiments on MADlib. I found that MADlib runs very slowly on some datasets, so I would like to justify my settings. Could you help me check the following settings and codes? Sorry for this long email. I used the latest MADlib 1.18 on PostgreSQL 12. (1) Could you help check whether the data format and scripts I used are right for n-dimensional dataset? I have some training datasets, and each of them has a dense feature array (like [0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example, for the ‘forest’ dataset (581K tuples) with a 54-dimensional feature array and a class label, I first stored it into PostgreSQL using <code> CREATE TABLE forest ( did serial, vec double precision[], labeli integer); COPY forest (vec, labeli) FROM STDIN; ‘[0.1, 0.2, …, 1.0], -1’ ‘[0.3, 0.1, …, 0.9], 1’ … </code> Then, to run the Logistic Regression on this dataset, I use the following code: <code> mldb=# \d forest Table "public.forest" Column | Type | Modifiers --------+--------------------+------------------------------------------------------ did | integer | not null default nextval('forest_did_seq'::regclass) vec | double precision[] | labeli | integer | mldb=# SELECT madlib.logregr_train( mldb(# 'forest', -- source table mldb(# 'forest_logregr_out', -- output table mldb(# 'labeli', -- labels mldb(# 'vec', -- features mldb(# NULL, -- grouping columns mldb(# 20, -- max number of iteration mldb(# 'igd' -- optimizer mldb(# ); Time: 198911.350 ms </code> After about 199s, I got the output table as: <code> mldb=# \d forest_logregr_out Table "public.forest_logregr_out" Column | Type | Modifiers --------------------------+--------------------+----------- coef | double precision[] | log_likelihood | double precision | std_err | double precision[] | z_stats | double precision[] | p_values | double precision[] | odds_ratios | double precision[] | condition_no | double precision | num_rows_processed | bigint | num_missing_rows_skipped | bigint | num_iterations | integer | variance_covariance | double precision[] | mldb=# select log_likelihood from forest_logregr_out; log_likelihood ------------------ -426986.83683879 (1 row) </code> Is this procedure correct? (2) Training on a 2,000-dimensional dense dataset (epsilon) is very slow: While training on a 2,000-dimensional dense dataset (epsilon_sample_10) with only 10 tuples as follows, MADlib does not finish in 5 hours for only 1 iteration. The CPU usage is always 100% during the execution. The dataset is available at https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJerryLead%2FMisc%2Fblob%2Fmaster%2FMADlib%2Ftrain.sql&data=04%7C01%7Cfmcquillan%40vmware.com%7C4b68d873a6434f4a8ddd08d93e603b96%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637609405309019768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ga6INXkQiBvI8RWAfCEZI5uSFOscUdG6RPR0NiuKWjU%3D&reserved=0>. <code> mldb=# \d epsilon_sample_10 Table "public.epsilon_sample_10" Column | Type | Modifiers --------+--------------------+----------------------------------------------------------------- did | integer | not null default nextval('epsilon_sample_10_did_seq'::regclass) vec | double precision[] | labeli | integer | mldb=# SELECT count(*) from epsilon_sample_10; count ------- 10 (1 row) Time: 1.456 ms mldb=# SELECT madlib.logregr_train('epsilon_sample_10', 'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd'); </code> In this case, it is not possible to train the whole epsilon dataset (with 400,000 tuples) in a reasonable time. I guess that this problem is related to TOAST, since epsilon has a high dimension and it is compressed by TOAST. However, are there any other reasons for this so slow execution? (3) For MADlib, is the dataset table scanned once or twice in each iteration? I know that, in each iteration, MADlib needs to scan the dataset table once to perform IGD/SGD on the whole dataset. My question is that, at the end of each iteration, will MADlib scan the table again to compute the training loss/accuracy? (4) Is it possible to output the training metrics, such as training loss and accuracy after each iteration? Currently, it seems that MADlib only outputs the log-likelihood at the end of the SQL execution. (5) Do MADlib’s Logistic Regression and SVM support sparse datasets? I also have some sparse datasets denoted as ‘feature_index_vec_array, feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’. Can I train these sparse datasets on MADlib using LR and SVM? Many thanks for reviewing my questions. Best regards, Lijie