thanks, I added this info to the jira On Fri, Jul 27, 2018 at 7:23 AM, LUYAO CHEN <luyao_c...@hotmail.com> wrote:
> The similar problem happened in decision tree. ( with the same set of > data ). > > I got the error (dmesg) that " > [ 4289.020198] postmaster[1840]: segfault at 0 ip 00007f17cd5f4ea3 sp > 00007ffdf867dd50 error 4 in libmadlib.so[7f17cd2ec000+64a000]" > > > > > Regards, > Luyao Chen > > ------------------------------ > *From:* Frank McQuillan <fmcquil...@pivotal.io> > *Sent:* Tuesday, July 24, 2018 2:13 PM > > *To:* user@madlib.apache.org > *Subject:* Re: PostgreSQL crashed during random forest training > > Thank you, we created a JIRA to investigate this > https://issues.apache.org/jira/browse/MADLIB-1257 > > On Tue, Jul 24, 2018 at 10:31 AM, LUYAO CHEN <luyao_c...@hotmail.com> > wrote: > > Another observation - It crashed with 84 groups and 73K instance. In this > scenario, I shall have pretty enough memory and disk. > > Also seems during the increasing of the groups, it used a lot of > temporary disk space when the data is over certain groups. > > > Regards, > > ------------------------------ > *From:* LUYAO CHEN <luyao_c...@hotmail.com> > *Sent:* Tuesday, July 24, 2018 9:15 AM > *To:* user@madlib.apache.org > *Subject:* Re: PostgreSQL crashed during random forest training > > > Hi Frank, > > > You may refer to the enclosed dump data for the training table, and I used > the below SQL for random forest. > > > DROP TABLE IF EXISTS train_output, train_output_group, > train_output_summary; > SELECT madlib.forest_train('train_data', -- source table > 'train_output', -- output model table > 'rowid', -- id column > 'positive', -- response > 'features', -- features > NULL, -- exclude columns > 'caseid', -- grouping columns > 30::integer, -- number of trees > 30::integer, -- number of random features > TRUE::boolean, -- variable importance > 1::integer, -- num_permutations > 10::integer, -- max depth > 3::integer, -- min split > 1::integer, -- min bucket > 10::integer, -- number of splits per > continuous variable > NULL, -- null handling parameter > TRUE -- verbose > ); > > Regards, > Luyao Chen > > ------------------------------ > *From:* Frank McQuillan <fmcquil...@pivotal.io> > *Sent:* Monday, July 23, 2018 4:59 PM > *To:* user@madlib.apache.org > *Subject:* Re: PostgreSQL crashed during random forest training > > Hi Luyao Chen > > It's hard to debug just looking at that trace. > > 1) If you increase your data size to more than 56K instances in 56 > groups, does it work? e.g., double it to approx 112K instances and 112 > groups. > > 2) Is it possible of you could share a sample of your data so that we > could try? If not, perhaps anonymize a sample of the data so that we can > multiply it out to make it bigger? Then we could take a closer look. > > Frank > > On Mon, Jul 23, 2018 at 12:34 PM, LUYAO CHEN <luyao_c...@hotmail.com> > wrote: > > Dear user group, > > > I got a problem when training the grouped data with random forest(300 > features). Small data was fine ( eg, 56K instances in 56 groups), but > failed for 240K instances in 250 groups. Postgres forced to disconnect the > session after showing the below message in verbose mode: > > > NOTICE: view "__madlib_temp_60124179_1532371657_7130296__" will be a > temporary view > NOTICE: sql_create_empty_result_table: > > CREATE TABLE analysis.dx_rf_train_output_1 ( > gid integer, > sample_id integer, > tree madlib.bytea8); > > NOTICE: sql_refresh_training_pois_cnt: > > TRUNCATE TABLE > __madlib_temp_91155016_1532371657_5660955__ > CASCADE; > INSERT INTO __madlib_temp_91155016_1532371 > 657_5660955__ > SELECT > *, > madlib.poisson_random(1) AS poisson_count > FROM > ( > SELECT > *, > 0.::double precision AS > __madlib_temp_14328459_1532371657_7318497__ > FROM analysis.dxpredict_svec > ) subq > WHERE __madlib_temp_14328459_1532371657_7318497__ > < 1 > > NOTICE: > src_cnt: 158360, > oob_cnt: 92418, > dup_cnt: 250617. > > NOTICE: Started tree building for all groups > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > > The PostgreSQL did not capture the detail log even I increased the > logstatement to "all" > 2018-07-23 14:47:50.229 EDT [1090] LOG: server process (PID 1980) was > terminated by signal 11: Segmentation fault > 2018-07-23 14:47:50.229 EDT [1090] DETAIL: Failed process was running: > SELECT madlib.forest_train('analysis.dxpredict_svec', > 'analysis.dx_rf_train_output_1', > 'rowid', > 'positive', > '*', > 'rowid,positive,case_icd', > 'case_icd', > 30::integer, > 30::integer, > TRUE::boolean, > 1::integer, > 10::integer, > 3::integer, > 1::integer, > 10::integer, > NULL, > TRUE > ); > 2018-07-23 14:47:50.229 EDT [1090] LOG: terminating any other active > server processes > 2018-07-23 14:47:50.229 EDT [1401] WARNING: terminating connection > because of crash of another server process > > > > > > >