Dear Aaron, Thanks for your advice. I will try it.
In addition, after following Frank's guide, I found MADlib LR and SVM can work normally on some low-dimensional (e.g., 18-28) datasets, even with >1 million tuples. However, while working on high-dimensional dataset, such as epsilon dataset with 400,000 tuples and 2,000 features ( https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html), MADlib SVM can finish 20 iterations in a reasonable time but MADlib LR (with IGD) cannot finish 2 iterations in several hours. Any ideas about this problem? Thanks! Best, Lijie On Thu, Jul 15, 2021 at 4:03 PM FENG, Xixuan (Aaron) <xixuan.f...@gmail.com> wrote: > Hi Lijie, > > I implemented the logregr with incremental gradient descent a few years > ago. Unfortunately at that time we chose to hard-coded the constant > step-size. But luckily you can edit the code as you need. > > Here are the pointers: > > https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L818 > > > https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L918 > > Good luck! > Aaron > > 2021年7月15日(木) 22:14 Lijie Xu <csxuli...@gmail.com>: > >> Dear Frank, >> >> Sorry for the late reply and thanks for your great help. I'm doing some >> research work on MADlib. I will follow your advice to test MADlib again. >> Another question is if MADlib LR supports tuning learning_rate? >> >> In MADlib SVM, there is a 'params' in 'svm_classification' to tune the >> 'init_stepsize' >> and 'decay_factor' as follows. >> >> svm_classification( >> source_table, >> model_table, >> dependent_varname, >> independent_varname, >> kernel_func, >> kernel_params, >> grouping_col, >> params, >> verbose >> ) >> >> However, I did not see this 'params' in LR as: >> >> logregr_train( source_table, >> out_table, >> dependent_varname, >> independent_varname, >> grouping_cols, >> max_iter, >> optimizer, >> tolerance, >> verbose >> ) >> >> In addition, I checked the Generalized Linear Models, and >> its 'optim_params' parameter seems to only support tuning 'tolerance, >> max_iter, and optimizer'. >> Is there a way to tune the 'init_stepsize' and 'decay_factor' in LR? >> Thanks! >> >> Best, >> Lijie >> >> On Tue, Jul 6, 2021 at 9:04 PM Frank McQuillan <fmcquil...@vmware.com> >> wrote: >> >>> Hello, >>> >>> Thank you for the questions. >>> >>> (0) >>> Not sure if you are using Postgres just for development or production, >>> but keep in mind that MADlib is designed to run on a distributed MPP >>> database (Greenplum) with large datasets. It runs fine on Postgres, but >>> obviously Postgres won't scale to very large datasets or it will just be >>> too slow. >>> >>> Also see jupyter notebooks here >>> >>> https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Supervised-learning >>> for other examples in case of use. >>> >>> >>> (1) >>> - there are 2 problems with your dataset for logistic regression: >>> >>> (i) >>> - as per >>> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html >>> MADlib: Logistic Regression >>> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html> >>> Binomial logistic regression models the relationship between a >>> dichotomous dependent variable and one or more predictor variables. The >>> dependent variable may be a Boolean value or a categorial variable that can >>> be represented with a Boolean expression. >>> madlib.incubator.apache.org >>> >>> >>> the dependent variable is a boolean or an expression that evaluates to >>> boolean >>> - your data has dependent variable of -1 but postgres does not evaluate >>> -1 to FALSE so you should change the -1 to 0 >>> - i.e., use 0 for FALSE and 1 for TRUE in postgres >>> https://www.postgresql.org/docs/12/datatype-boolean.html >>> <https://www.postgresql.org/docs/12/datatype-boolean.html> >>> PostgreSQL: Documentation: 12: 8.6. Boolean Type >>> <https://www.postgresql.org/docs/12/datatype-boolean.html> >>> The key words TRUE and FALSE are the preferred (SQL-compliant) method >>> for writing Boolean constants in SQL queries.But you can also use the >>> string representations by following the generic string-literal constant >>> syntax described in Section 4.1.2.7, for example 'yes'::boolean.. Note that >>> the parser automatically understands that TRUE and FALSE are of type >>> boolean, but this is not so for NULL ... >>> www.postgresql.org >>> >>> >>> >>> (ii) >>> - an intercept variable is not assumed so it is common to provide an >>> explicit intercept term by including a single constant 1 term in the >>> independent variable list >>> - see the example here >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples >>> MADlib: Logistic Regression >>> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples> >>> Binomial logistic regression models the relationship between a >>> dichotomous dependent variable and one or more predictor variables. The >>> dependent variable may be a Boolean value or a categorial variable that can >>> be represented with a Boolean expression. >>> madlib.incubator.apache.org >>> >>> >>> >>> That is why the log_likelihood value is too big, that model is not right. >>> >>> >>> (2) >>> if you make the fixes above in (1) it should run OK. Here are my >>> results on PostgreSQL 11.6 using MADlib version: 1.18.0 on the dataset with >>> 10 tuples: >>> >>> >>> DROP TABLE IF EXISTS epsilon_sample_10v2 CASCADE; >>> >>> CREATE TABLE epsilon_sample_10v2 ( >>> did serial, >>> vec double precision[], >>> labeli integer >>> ); >>> >>> COPY epsilon_sample_10v2 (vec, labeli) FROM STDIN; >>> {1.0,-0.0108282,-0.0196004,0.0422148,...} 0 >>> {1.0,0.00250835,0.0168447,-0.0102934,...} 1 >>> etc. >>> >>> SELECT madlib.logregr_train('epsilon_sample_10v2', >>> 'epsilon_sample_10v2_logregr_out', 'labeli', 'vec', NULL, 1, 'irls'} >>> >>> logregr_train >>> --------------- >>> >>> (1 row) >>> >>> Time: 317046.342 ms (05:17.046) >>> >>> madlib=# select log_likelihood from epsilon_sample_10v2_logregr_out; >>> log_likelihood >>> ------------------- >>> -6.93147180559945 >>> (1 row) >>> >>> >>> (3) >>> -dataset is not scanned again at the end of every iteration to compute >>> training loss/accuracy. It should only scan 1x per iteration for >>> optimization >>> >>> >>> (4) >>> - I thought the verbose parameter should do that, but it does not seem >>> to be working for me. Will need to look into it more. >>> >>> >>> (5) >>> -logistic regression and SVM do not currently support sparse matrix >>> format >>> http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html >>> MADlib: Sparse Vectors >>> <http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html> >>> dict_id_col : TEXT. Name of the id column in the dictionary_tbl. >>> Expected Type: INTEGER or BIGINT. NOTE: Values must be continuous ranging >>> from 0 to total number of elements in the dictionary - 1. >>> madlib.incubator.apache.org >>> >>> >>> Frank >>> >>> ------------------------------ >>> *From:* Lijie Xu <csxuli...@gmail.com> >>> *Sent:* Saturday, July 3, 2021 1:21 PM >>> *To:* user@madlib.apache.org <user@madlib.apache.org> >>> *Subject:* Long execution time on MADlib >>> >>> >>> >>> Hi All, >>> >>> >>> >>> I’m Lijie and now performing some experiments on MADlib. I found that >>> MADlib runs very slowly on some datasets, so I would like to justify my >>> settings. Could you help me check the following settings and codes? Sorry >>> for this long email. I used the latest MADlib 1.18 on PostgreSQL 12. >>> >>> >>> >>> *(1) **Could you help check whether the data format and scripts I used >>> are right for n-dimensional dataset?* >>> >>> >>> >>> I have some training datasets, and each of them has a dense feature >>> array (like [0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example, for >>> the ‘forest’ dataset (581K tuples) with a 54-dimensional feature array and >>> a class label, I first stored it into PostgreSQL using >>> >>> >>> >>> <code> >>> >>> CREATE TABLE forest ( >>> >>> did serial, >>> >>> vec double precision[], >>> >>> labeli integer); >>> >>> >>> >>> COPY forest (vec, labeli) FROM STDIN; >>> >>> ‘[0.1, 0.2, …, 1.0], -1’ >>> >>> ‘[0.3, 0.1, …, 0.9], 1’ >>> >>> … >>> >>> </code> >>> >>> >>> >>> >>> >>> Then, to run the Logistic Regression on this dataset, I use the >>> following code: >>> >>> >>> >>> <code> >>> >>> mldb=# \d forest >>> >>> Table "public.forest" >>> >>> Column | Type | >>> Modifiers >>> >>> >>> --------+--------------------+------------------------------------------------------ >>> >>> did | integer | not null default >>> nextval('forest_did_seq'::regclass) >>> >>> vec | double precision[] | >>> >>> labeli | integer | >>> >>> >>> >>> mldb=# SELECT madlib.logregr_train( >>> >>> mldb(# 'forest', -- source table >>> >>> mldb(# 'forest_logregr_out', -- output table >>> >>> mldb(# 'labeli', -- labels >>> >>> mldb(# 'vec', -- features >>> >>> mldb(# NULL, -- grouping columns >>> >>> mldb(# 20, -- max number of >>> iteration >>> >>> mldb(# 'igd' -- optimizer >>> >>> mldb(# ); >>> >>> >>> >>> Time: 198911.350 ms >>> >>> </code> >>> >>> >>> >>> After about 199s, I got the output table as: >>> >>> <code> >>> >>> mldb=# \d forest_logregr_out >>> >>> Table "public.forest_logregr_out" >>> >>> Column | Type | Modifiers >>> >>> --------------------------+--------------------+----------- >>> >>> coef | double precision[] | >>> >>> log_likelihood | double precision | >>> >>> std_err | double precision[] | >>> >>> z_stats | double precision[] | >>> >>> p_values | double precision[] | >>> >>> odds_ratios | double precision[] | >>> >>> condition_no | double precision | >>> >>> num_rows_processed | bigint | >>> >>> num_missing_rows_skipped | bigint | >>> >>> num_iterations | integer | >>> >>> variance_covariance | double precision[] | >>> >>> >>> >>> mldb=# select log_likelihood from forest_logregr_out; >>> >>> log_likelihood >>> >>> ------------------ >>> >>> -426986.83683879 >>> >>> (1 row) >>> >>> </code> >>> >>> >>> >>> Is this procedure correct? >>> >>> >>> >>> *(2) **Training on a 2,000-dimensional dense dataset (epsilon) is very >>> slow:* >>> >>> >>> >>> While training on a 2,000-dimensional dense dataset >>> (epsilon_sample_10) with only *10 tuples* as follows, MADlib does not >>> finish in 5 hours* for only 1 iteration*. The CPU usage is always 100% >>> during the execution. The dataset is available at >>> https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJerryLead%2FMisc%2Fblob%2Fmaster%2FMADlib%2Ftrain.sql&data=04%7C01%7Cfmcquillan%40vmware.com%7C4b68d873a6434f4a8ddd08d93e603b96%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637609405309019768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ga6INXkQiBvI8RWAfCEZI5uSFOscUdG6RPR0NiuKWjU%3D&reserved=0> >>> . >>> >>> >>> >>> <code> >>> >>> mldb=# \d epsilon_sample_10 >>> >>> Table "public.epsilon_sample_10" >>> >>> Column | Type | >>> Modifiers >>> >>> >>> --------+--------------------+----------------------------------------------------------------- >>> >>> did | integer | not null default >>> nextval('epsilon_sample_10_did_seq'::regclass) >>> >>> vec | double precision[] | >>> >>> labeli | integer | >>> >>> >>> >>> mldb=# SELECT count(*) from epsilon_sample_10; >>> >>> count >>> >>> ------- >>> >>> 10 >>> >>> (1 row) >>> >>> >>> >>> Time: 1.456 ms >>> >>> >>> >>> mldb=# SELECT madlib.logregr_train('epsilon_sample_10', >>> 'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd'); >>> >>> </code> >>> >>> >>> >>> *In this case, it is not possible to train the whole epsilon dataset >>> (with 400,000 tuples) in a reasonable time. I guess that this problem is >>> related to TOAST, since epsilon has a high dimension and it is compressed >>> by TOAST. However, are there any other reasons for this so slow execution?* >>> >>> >>> >>> *(3) **For MADlib, is the dataset table scanned once or twice in each >>> iteration?* >>> >>> I know that, in each iteration, MADlib needs to scan the dataset table >>> once to perform IGD/SGD on the whole dataset. My question is that, *at >>> the end of each iteration*, will MADlib scan the table again to compute >>> the training loss/accuracy? >>> >>> >>> >>> *(4) **Is it possible to output the training metrics, such as training >>> loss and accuracy after each iteration?* >>> >>> Currently, it seems that MADlib only outputs the log-likelihood at the >>> end of the SQL execution. >>> >>> >>> >>> *(5) **Do MADlib’s Logistic Regression and SVM support sparse >>> datasets?* >>> >>> I also have some sparse datasets denoted as ‘feature_index_vec_array, >>> feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’. Can >>> I train these sparse datasets on MADlib using LR and SVM? >>> >>> >>> >>> Many thanks for reviewing my questions. >>> >>> >>> >>> >>> >>> Best regards, >>> >>> >>> >>> Lijie >>> >>