Hi, I am currently having an error with the MADlib Random Forest function in MADlib1.8.0. Below is the code I tried.
DROP TABLE IF EXISTS rf_output, rf_output_group, rf_output_summary;
SELECT madlib.forest_train('test_rf_data', -- input table name
'rf_output', -- output table name
'id', -- id column
'duration', -- dependent variable
'*', -- list of features
NULL,-- exclude columns
'linkid' -- grouping column
,2::integer -- # of trees
,5::integer, -- # of random features
TRUE::boolean, -- importance
1, -- # of permutations
5, -- max_tree_depth
10, -- min_split
3, -- min_bucket
10 -- number of splits per continuous variable
);
When I tried this with all linkid (the grouping column with 362 linkids), I
got an error as in "error_random_forest.txt" attached here. The error
message is says I have the invalid array length but does not tell any
details what features in the data have this issue.
ERROR: plpy.SPIError: invalid array length (plpython.c:4648)
DETAIL: array_of_bigint: Size should be in [1, 1e7], 0 given
I guessed this is the error for the bigint columns but the only bigint
columns is the "id" column. I once had an error that some features have
identical values in all records, but it is not the case this time because I
changed the sample size for each linkid as 1000 or above.
It seems something is zero from the DETAIL saying "0 given" but I have no
idea what in the data this is referring to.
The schema of the input table is as below;
CREATE TABLE input_table (
id bigint,
linkid varchar(32),
duration double precision,
sat_flg int,
sun_flg int,
holiday_flg int,
semi_holiday_flg int,
renkyu_flg int,
ave_temp numeric,
ave_wind numeric,
precip numeric,
radiation numeric,
ave_speed numeric,
travel_time numeric,
);
Can anybody please let me know what the possible cause of this error? The
MADlib linear regression worked without any problems.
I am using MADlib 1.8.0 on GPDB 4.3.6.1. The OS is CentOS.
Thank you,
Tetsuo
error_random_forest
Description: Binary data
