Re: Long execution time on MADlib

FENG, Xixuan (Aaron) Thu, 15 Jul 2021 07:03:43 -0700

Hi Lijie,

I implemented the logregr with incremental gradient descent a few years
ago. Unfortunately at that time we chose to hard-coded the constant
step-size. But luckily you can edit the code as you need.


Here are the pointers:
https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L818

https://github.com/apache/madlib/blob/2e34c0f45a6e0f3be224ef58a6f4a576eb8eb89a/src/modules/regress/logistic.cpp#L918

Good luck!
Aaron

2021年7月15日(木) 22:14 Lijie Xu <csxuli...@gmail.com>:

> Dear Frank,
>
> Sorry for the late reply and thanks for your great help. I'm doing some
> research work on MADlib. I will follow your advice to test MADlib again.
> Another question is if MADlib LR supports tuning learning_rate?
>
> In MADlib SVM, there is a 'params' in 'svm_classification' to tune the 
> 'init_stepsize'
> and 'decay_factor' as follows.
>
> svm_classification(
>     source_table,
>     model_table,
>     dependent_varname,
>     independent_varname,
>     kernel_func,
>     kernel_params,
>     grouping_col,
>     params,
>     verbose
>     )
>
> However, I did not see this 'params' in LR as:
>
> logregr_train( source_table,
>                out_table,
>                dependent_varname,
>                independent_varname,
>                grouping_cols,
>                max_iter,
>                optimizer,
>                tolerance,
>                verbose
>              )
>
> In addition, I checked the Generalized Linear Models, and
> its 'optim_params' parameter seems to only support tuning 'tolerance,
> max_iter, and optimizer'.
> Is there a way to tune the 'init_stepsize' and 'decay_factor' in LR?
> Thanks!
>
> Best,
> Lijie
>
> On Tue, Jul 6, 2021 at 9:04 PM Frank McQuillan <fmcquil...@vmware.com>
> wrote:
>
>> Hello,
>>
>> Thank you for the questions.
>>
>> (0)
>> Not sure if you are using Postgres just for development or production,
>> but keep in mind that MADlib is designed to run on a distributed MPP
>> database (Greenplum) with large datasets. It runs fine on Postgres, but
>> obviously Postgres won't scale to very large datasets or it will just be
>> too slow.
>>
>> Also see jupyter notebooks here
>>
>> https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Supervised-learning
>> for other examples in case of use.
>>
>>
>> (1)
>> - there are 2 problems with your dataset for logistic regression:
>>
>> (i)
>> - as per
>> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html
>> MADlib: Logistic Regression
>> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html>
>> Binomial logistic regression models the relationship between a
>> dichotomous dependent variable and one or more predictor variables. The
>> dependent variable may be a Boolean value or a categorial variable that can
>> be represented with a Boolean expression.
>> madlib.incubator.apache.org
>>
>>
>> the dependent variable is a boolean or an expression that evaluates to
>> boolean
>> - your data has dependent variable of -1 but postgres does not evaluate
>> -1 to FALSE so you should change the -1 to 0
>> - i.e., use 0 for FALSE and 1 for TRUE in postgres
>> https://www.postgresql.org/docs/12/datatype-boolean.html
>> <https://www.postgresql.org/docs/12/datatype-boolean.html>
>> PostgreSQL: Documentation: 12: 8.6. Boolean Type
>> <https://www.postgresql.org/docs/12/datatype-boolean.html>
>> The key words TRUE and FALSE are the preferred (SQL-compliant) method for
>> writing Boolean constants in SQL queries.But you can also use the string
>> representations by following the generic string-literal constant syntax
>> described in Section 4.1.2.7, for example 'yes'::boolean.. Note that the
>> parser automatically understands that TRUE and FALSE are of type boolean,
>> but this is not so for NULL ...
>> www.postgresql.org
>>
>>
>>
>> (ii)
>> - an intercept variable is not assumed so it is common to provide an
>> explicit intercept term by including a single constant 1 term in the
>> independent variable list
>> - see the example here
>>
>> http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples
>> MADlib: Logistic Regression
>> <http://madlib.incubator.apache.org/docs/latest/group__grp__logreg.html#examples>
>> Binomial logistic regression models the relationship between a
>> dichotomous dependent variable and one or more predictor variables. The
>> dependent variable may be a Boolean value or a categorial variable that can
>> be represented with a Boolean expression.
>> madlib.incubator.apache.org
>>
>>
>>
>> That is why the log_likelihood value is too big, that model is not right.
>>
>>
>> (2)
>> if you make the fixes above in (1) it should run OK.  Here are my results
>> on PostgreSQL 11.6 using MADlib version: 1.18.0 on the dataset with 10
>> tuples:
>>
>>
>> DROP TABLE IF EXISTS epsilon_sample_10v2 CASCADE;
>>
>>         CREATE TABLE epsilon_sample_10v2 (
>>        did serial,
>>        vec double precision[],
>>        labeli integer
>>         );
>>
>>         COPY epsilon_sample_10v2 (vec, labeli) FROM STDIN;
>>         {1.0,-0.0108282,-0.0196004,0.0422148,...} 0
>>         {1.0,0.00250835,0.0168447,-0.0102934,...} 1
>>         etc.
>>
>> SELECT madlib.logregr_train('epsilon_sample_10v2',
>> 'epsilon_sample_10v2_logregr_out', 'labeli', 'vec', NULL, 1, 'irls'}
>>
>>  logregr_train
>> ---------------
>>
>> (1 row)
>>
>> Time: 317046.342 ms (05:17.046)
>>
>> madlib=# select log_likelihood from epsilon_sample_10v2_logregr_out;
>>   log_likelihood
>> -------------------
>>  -6.93147180559945
>> (1 row)
>>
>>
>> (3)
>> -dataset is not scanned again at the end of every iteration to compute
>> training loss/accuracy.  It should only scan 1x per iteration for
>> optimization
>>
>>
>> (4)
>> - I thought the verbose parameter should do that, but it does not seem to
>> be working for me.  Will need to look into it more.
>>
>>
>> (5)
>> -logistic regression and SVM do not currently support sparse matrix format
>> http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html
>> MADlib: Sparse Vectors
>> <http://madlib.incubator.apache.org/docs/latest/group__grp__svec.html>
>> dict_id_col : TEXT. Name of the id column in the dictionary_tbl. Expected
>> Type: INTEGER or BIGINT. NOTE: Values must be continuous ranging from 0 to
>> total number of elements in the dictionary - 1.
>> madlib.incubator.apache.org
>>
>>
>> Frank
>>
>> ------------------------------
>> *From:* Lijie Xu <csxuli...@gmail.com>
>> *Sent:* Saturday, July 3, 2021 1:21 PM
>> *To:* user@madlib.apache.org <user@madlib.apache.org>
>> *Subject:* Long execution time on MADlib
>>
>>
>>
>> Hi All,
>>
>>
>>
>> I’m Lijie and now performing some experiments on MADlib. I found that
>> MADlib runs very slowly on some datasets, so I would like to justify my
>> settings. Could you help me check the following settings and codes? Sorry
>> for this long email. I used the latest MADlib 1.18 on PostgreSQL 12.
>>
>>
>>
>> *(1)  **Could you help check whether the data format and scripts I used
>> are right for n-dimensional dataset?*
>>
>>
>>
>> I have some training datasets, and each of them has a dense feature array
>> (like [0.1, 0.2, …, 1.0]) and a class label (+1/-1). For example, for the
>> ‘forest’ dataset (581K tuples) with a 54-dimensional feature array and a
>> class label, I first stored it into PostgreSQL using
>>
>>
>>
>> <code>
>>
>>      CREATE TABLE forest (
>>
>>           did serial,
>>
>>           vec double precision[],
>>
>>           labeli integer);
>>
>>
>>
>>       COPY forest (vec, labeli) FROM STDIN;
>>
>>       ‘[0.1, 0.2, …, 1.0], -1’
>>
>>       ‘[0.3, 0.1, …, 0.9], 1’
>>
>>       …
>>
>> </code>
>>
>>
>>
>>
>>
>>         Then, to run the Logistic Regression on this dataset, I use the
>> following code:
>>
>>
>>
>> <code>
>>
>> mldb=# \d forest
>>
>>                                Table "public.forest"
>>
>>  Column |        Type        |
>> Modifiers
>>
>>
>> --------+--------------------+------------------------------------------------------
>>
>>  did    | integer            | not null default
>> nextval('forest_did_seq'::regclass)
>>
>>  vec    | double precision[] |
>>
>>  labeli | integer            |
>>
>>
>>
>> mldb=# SELECT madlib.logregr_train(
>>
>> mldb(#     'forest',                                 -- source table
>>
>> mldb(#     'forest_logregr_out',                     -- output table
>>
>> mldb(#     'labeli',                                 -- labels
>>
>> mldb(#     'vec',                                    -- features
>>
>> mldb(#     NULL,                                     -- grouping columns
>>
>> mldb(#     20,                                       -- max number of
>> iteration
>>
>> mldb(#     'igd'                                     -- optimizer
>>
>> mldb(#     );
>>
>>
>>
>> Time: 198911.350 ms
>>
>> </code>
>>
>>
>>
>> After about 199s, I got the output table as:
>>
>> <code>
>>
>> mldb=# \d forest_logregr_out
>>
>>              Table "public.forest_logregr_out"
>>
>>           Column          |        Type        | Modifiers
>>
>> --------------------------+--------------------+-----------
>>
>>  coef                     | double precision[] |
>>
>>  log_likelihood           | double precision   |
>>
>>  std_err                  | double precision[] |
>>
>>  z_stats                  | double precision[] |
>>
>>  p_values                 | double precision[] |
>>
>>  odds_ratios              | double precision[] |
>>
>>  condition_no             | double precision   |
>>
>>  num_rows_processed       | bigint             |
>>
>>  num_missing_rows_skipped | bigint             |
>>
>>  num_iterations           | integer            |
>>
>>  variance_covariance      | double precision[] |
>>
>>
>>
>> mldb=# select log_likelihood from forest_logregr_out;
>>
>>   log_likelihood
>>
>> ------------------
>>
>>  -426986.83683879
>>
>> (1 row)
>>
>> </code>
>>
>>
>>
>> Is this procedure correct?
>>
>>
>>
>> *(2)  **Training on a 2,000-dimensional dense dataset (epsilon) is very
>> slow:*
>>
>>
>>
>>            While training on a 2,000-dimensional dense dataset
>> (epsilon_sample_10) with only *10 tuples* as follows, MADlib does not
>> finish in 5 hours* for only 1 iteration*. The CPU usage is always 100%
>> during the execution. The dataset is available at
>> https://github.com/JerryLead/Misc/blob/master/MADlib/train.sql
>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJerryLead%2FMisc%2Fblob%2Fmaster%2FMADlib%2Ftrain.sql&data=04%7C01%7Cfmcquillan%40vmware.com%7C4b68d873a6434f4a8ddd08d93e603b96%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637609405309019768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ga6INXkQiBvI8RWAfCEZI5uSFOscUdG6RPR0NiuKWjU%3D&reserved=0>
>> .
>>
>>
>>
>> <code>
>>
>> mldb=# \d epsilon_sample_10
>>
>>                                Table "public.epsilon_sample_10"
>>
>>  Column |        Type        |
>>             Modifiers
>>
>>
>> --------+--------------------+-----------------------------------------------------------------
>>
>>  did    | integer            | not null default
>> nextval('epsilon_sample_10_did_seq'::regclass)
>>
>>  vec    | double precision[] |
>>
>>  labeli | integer            |
>>
>>
>>
>> mldb=# SELECT count(*) from epsilon_sample_10;
>>
>>  count
>>
>> -------
>>
>>     10
>>
>> (1 row)
>>
>>
>>
>> Time: 1.456 ms
>>
>>
>>
>> mldb=# SELECT madlib.logregr_train('epsilon_sample_10',
>> 'epsilon_sample_10_logregr_out', 'labeli', 'vec', NULL, 1, 'igd');
>>
>> </code>
>>
>>
>>
>> *In this case, it is not possible to train the whole epsilon dataset
>> (with 400,000 tuples) in a reasonable time. I guess that this problem is
>> related to TOAST, since epsilon has a high dimension and it is compressed
>> by TOAST. However, are there any other reasons for this so slow execution?*
>>
>>
>>
>> *(3)  **For MADlib, is the dataset table scanned once or twice in each
>> iteration?*
>>
>> I know that, in each iteration, MADlib needs to scan the dataset table
>> once to perform IGD/SGD on the whole dataset. My question is that, *at
>> the end of each iteration*, will MADlib scan the table again to compute
>> the training loss/accuracy?
>>
>>
>>
>> *(4)  **Is it possible to output the training metrics, such as training
>> loss and accuracy after each iteration?*
>>
>> Currently, it seems that MADlib only outputs the log-likelihood at the
>> end of the SQL execution.
>>
>>
>>
>> *(5)  **Do MADlib’s Logistic Regression and SVM support sparse datasets?*
>>
>> I also have some sparse datasets denoted as ‘feature_index_vec_array,
>> feature_value_array, label’, such as ‘[1, 3, 5], [0.1, 0.2, 0.3], -1’. Can
>> I train these sparse datasets on MADlib using LR and SVM?
>>
>>
>>
>> Many thanks for reviewing my questions.
>>
>>
>>
>>
>>
>> Best regards,
>>
>>
>>
>> Lijie
>>
>

Re: Long execution time on MADlib

Reply via email to