Hi Lijie, I implemented the logregr with incremental gradient descent a few years ago. Unfortunately at that time we chose to hard-coded the constant step-size. But luckily you can edit the code as you need.
Here are the pointers: https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L818 https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L918 Good luck! Aaron 2021年7月15日(木) 22:14 Lijie Xu <csxuli...@gmail.com>: > Dear Frank, > > Sorry for the late reply and thanks for your great help. I'm doing some > research work on MADlib. I will follow your advice to test MADlib again. > Another question is if MADlib LR supports tuning learning_rate? > > In MADlib SVM, there is a 'params' in 'svm_classification' to tune the > 'init_stepsize' > and 'decay_factor' as follows. > > svm_classification( > source_table, > model_table, > dependent_varname, > independent_varname, > kernel_func, > kernel_params, > grouping_col, > params, > verbose > ) > > However, I did not see this 'params' in LR as: > > logregr_train( source_table, > out_table, > dependent_varname, > independent_varname, > grouping_cols, > max_iter, > optimizer, > tolerance, > verbose > ) > > In addition, I checked the Generalized Linear Models, and > its 'optim_params' parameter seems to only support tuning 'tolerance, > max_iter, and optimizer'. > Is there a way to tune the 'init_stepsize' and 'decay_factor' in LR? > Thanks! > > Best, > Lijie > > On Tue, Jul 6, 2021 at 9:04 PM Frank McQuillan <fmcquil...@vmware.com> > wrote: > >> Hello, >> >> Thank you for the questions. >> >> (0) >> Not sure if you are using Postgres just for development or production, >> but keep in mind that MADlib is designed to run on a distributed MPP >> database (Greenplum) with large datasets. It runs fine on Postgres, but >> obviously Postgres won't scale to very large datasets or it will just be >> too slow. >> >> Also see jupyter notebooks here >> >> https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Supervised-learning >> for other examples in case of use. >> >> >> (1) >> - there are 2 problems with your dataset for logistic regression: >> >> (i) >> - as per >> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html >> MADlib: Logistic Regression >> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html> >> Binomial logistic regression models the relationship between a >> dichotomous dependent variable and one or more predictor variables. The >> dependent variable may be a Boolean value or a categorial variable that can >> be represented with a Boolean expression. >> madlib.incubator.apache.org >> >> >> the dependent variable is a boolean or an expression that evaluates to >> boolean >> - your data has dependent variable of -1 but postgres does not evaluate >> -1 to FALSE so you should change the -1 to 0 >> - i.e., use 0 for FALSE and 1 for TRUE in postgres >> https://www.postgresql.org/docs/12/datatype-boolean.html >> <https://www.postgresql.org/docs/12/datatype-boolean.html> >> PostgreSQL: Documentation: 12: 8.6. Boolean Type >> <https://www.postgresql.org/docs/12/datatype-boolean.html> >> The key words TRUE and FALSE are the preferred (SQL-compliant) method for >> writing Boolean constants in SQL queries.But you can also use the string >> representations by following the generic string-literal constant syntax >> described in Section 4.1.2.7, for example 'yes'::boolean.. Note that the >> parser automatically understands that TRUE and FALSE are of type boolean, >> but this is not so for NULL ... >> www.postgresql.org >> >> >> >> (ii) >> - an intercept variable is not assumed so it is common to provide an >> explicit intercept term by including a single constant 1 term in the >> independent variable list >> - see the example here >> >> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples >> MADlib: Logistic Regression >> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples> >> Binomial logistic regression models the relationship between a >> dichotomous dependent variable and one or more predictor variables. The >> dependent variable may be a Boolean value or a categorial variable that can >> be represented with a Boolean expression. >> madlib.incubator.apache.org >> >> >> >> That is why the log_likelihood value is too big, that model is not right. >> >> >> (2) >> if you make the fixes above in (1) it should run OK. Here are my results >> on PostgreSQL 11.6 using MADlib version: 1.18.0 on the dataset with 10 >> tuples: >> >> >> DROP TABLE IF EXISTS epsilon_sample_10v2 CASCADE; >> >> CREATE TABLE epsilon_sample_10v2 ( >> did serial, >> vec double precision[], >> labeli integer >> ); >> >> COPY epsilon_sample_10v2 (vec, labeli) FROM STDIN; >> {1.0,-0.0108282,-0.0196004,0.0422148,...} 0 >> {1.0,0.00250835,0.0168447,-0.0102934,...} 1 >> etc. >> >> SELECT madlib.logregr_train('epsilon_sample_10v2', >> 'epsilon_sample_10v2_logregr_out', 'labeli', 'vec', NULL, 1, 'irls'} >> >> logregr_train >> --------------- >> >> (1 row) >> >> Time: 317046.342 ms (05:17.046) >> >> madlib=# select log_likelihood from epsilon_sample_10v2_logregr_out; >> log_likelihood >> ------------------- >> -6.93147180559945 >> (1 row) >> >> >> (3) >> -dataset is not scanned again at the end of every iteration to compute >> training loss/accuracy. It should only scan 1x per iteration for >> optimization >> >> >> (4) >> - I thought the verbose parameter should do that, but it does not seem to >> be working for me. Will need to look into it more. >> >> >> (5) >> -logistic regression and SVM do not currently support sparse matrix format >> http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html >> MADlib: Sparse Vectors >> <http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html> >> dict_id_col : TEXT. Name of the id column in the dictionary_tbl. Expected >> Type: INTEGER or BIGINT. NOTE: Values must be continuous ranging from 0 to >> total number of elements in the dictionary - 1. >> madlib.incubator.apache.org >> >> >> Frank >> >> ------------------------------ >> *From:* Lijie Xu <csxuli...@gmail.com> >> *Sent:* Saturday, July 3, 2021 1:21 PM >> *To:* user@madlib.apache.org <user@madlib.apache.org> >> *Subject:* Long execution time on MADlib >> >> >> >> Hi All, >> >> >> >> I’m Lijie and now performing some experiments on MADlib. I found that >> MADlib runs very slowly on some datasets, so I would like to justify my >> settings. Could you help me check the following settings and codes? Sorry >> for this long email. I used the latest MADlib 1.18 on PostgreSQL 12. >> >> >> >> *(1) **Could you help check whether the data format and scripts I used >> are right for n-dimensional dataset?* >> >> >> >> I have some training datasets, and each of them has a dense feature array >> (like [0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example, for the >> ‘forest’ dataset (581K tuples) with a 54-dimensional feature array and a >> class label, I first stored it into PostgreSQL using >> >> >> >> <code> >> >> CREATE TABLE forest ( >> >> did serial, >> >> vec double precision[], >> >> labeli integer); >> >> >> >> COPY forest (vec, labeli) FROM STDIN; >> >> ‘[0.1, 0.2, …, 1.0], -1’ >> >> ‘[0.3, 0.1, …, 0.9], 1’ >> >> … >> >> </code> >> >> >> >> >> >> Then, to run the Logistic Regression on this dataset, I use the >> following code: >> >> >> >> <code> >> >> mldb=# \d forest >> >> Table "public.forest" >> >> Column | Type | >> Modifiers >> >> >> --------+--------------------+------------------------------------------------------ >> >> did | integer | not null default >> nextval('forest_did_seq'::regclass) >> >> vec | double precision[] | >> >> labeli | integer | >> >> >> >> mldb=# SELECT madlib.logregr_train( >> >> mldb(# 'forest', -- source table >> >> mldb(# 'forest_logregr_out', -- output table >> >> mldb(# 'labeli', -- labels >> >> mldb(# 'vec', -- features >> >> mldb(# NULL, -- grouping columns >> >> mldb(# 20, -- max number of >> iteration >> >> mldb(# 'igd' -- optimizer >> >> mldb(# ); >> >> >> >> Time: 198911.350 ms >> >> </code> >> >> >> >> After about 199s, I got the output table as: >> >> <code> >> >> mldb=# \d forest_logregr_out >> >> Table "public.forest_logregr_out" >> >> Column | Type | Modifiers >> >> --------------------------+--------------------+----------- >> >> coef | double precision[] | >> >> log_likelihood | double precision | >> >> std_err | double precision[] | >> >> z_stats | double precision[] | >> >> p_values | double precision[] | >> >> odds_ratios | double precision[] | >> >> condition_no | double precision | >> >> num_rows_processed | bigint | >> >> num_missing_rows_skipped | bigint | >> >> num_iterations | integer | >> >> variance_covariance | double precision[] | >> >> >> >> mldb=# select log_likelihood from forest_logregr_out; >> >> log_likelihood >> >> ------------------ >> >> -426986.83683879 >> >> (1 row) >> >> </code> >> >> >> >> Is this procedure correct? >> >> >> >> *(2) **Training on a 2,000-dimensional dense dataset (epsilon) is very >> slow:* >> >> >> >> While training on a 2,000-dimensional dense dataset >> (epsilon_sample_10) with only *10 tuples* as follows, MADlib does not >> finish in 5 hours* for only 1 iteration*. The CPU usage is always 100% >> during the execution. The dataset is available at >> https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql >> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJerryLead%2FMisc%2Fblob%2Fmaster%2FMADlib%2Ftrain.sql&data=04%7C01%7Cfmcquillan%40vmware.com%7C4b68d873a6434f4a8ddd08d93e603b96%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637609405309019768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ga6INXkQiBvI8RWAfCEZI5uSFOscUdG6RPR0NiuKWjU%3D&reserved=0> >> . >> >> >> >> <code> >> >> mldb=# \d epsilon_sample_10 >> >> Table "public.epsilon_sample_10" >> >> Column | Type | >> Modifiers >> >> >> --------+--------------------+----------------------------------------------------------------- >> >> did | integer | not null default >> nextval('epsilon_sample_10_did_seq'::regclass) >> >> vec | double precision[] | >> >> labeli | integer | >> >> >> >> mldb=# SELECT count(*) from epsilon_sample_10; >> >> count >> >> ------- >> >> 10 >> >> (1 row) >> >> >> >> Time: 1.456 ms >> >> >> >> mldb=# SELECT madlib.logregr_train('epsilon_sample_10', >> 'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd'); >> >> </code> >> >> >> >> *In this case, it is not possible to train the whole epsilon dataset >> (with 400,000 tuples) in a reasonable time. I guess that this problem is >> related to TOAST, since epsilon has a high dimension and it is compressed >> by TOAST. However, are there any other reasons for this so slow execution?* >> >> >> >> *(3) **For MADlib, is the dataset table scanned once or twice in each >> iteration?* >> >> I know that, in each iteration, MADlib needs to scan the dataset table >> once to perform IGD/SGD on the whole dataset. My question is that, *at >> the end of each iteration*, will MADlib scan the table again to compute >> the training loss/accuracy? >> >> >> >> *(4) **Is it possible to output the training metrics, such as training >> loss and accuracy after each iteration?* >> >> Currently, it seems that MADlib only outputs the log-likelihood at the >> end of the SQL execution. >> >> >> >> *(5) **Do MADlib’s Logistic Regression and SVM support sparse datasets?* >> >> I also have some sparse datasets denoted as ‘feature_index_vec_array, >> feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’. Can >> I train these sparse datasets on MADlib using LR and SVM? >> >> >> >> Many thanks for reviewing my questions. >> >> >> >> >> >> Best regards, >> >> >> >> Lijie >> >