[jira] [Commented] (HAWQ-1498) Segments keep open file descriptors for deleted files
[ https://issues.apache.org/jira/browse/HAWQ-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117907#comment-16117907 ] Yi Jin commented on HAWQ-1498: -- The idea at the back of the fix is to explicitly release those connections which are used to drop objects in HDFS. This logic is triggered when current transaction ends no matter it is a commit or a rollback. Those cached connections only used for reading or updating are not flushed. I roughly have had this fixed in my environment, and I will propose a pull request for review. Thanks. > Segments keep open file descriptors for deleted files > - > > Key: HAWQ-1498 > URL: https://issues.apache.org/jira/browse/HAWQ-1498 > Project: Apache HAWQ > Issue Type: Bug >Reporter: Harald Bögeholz >Assignee: Yi Jin > Fix For: 2.3.0.0-incubating > > > I have been running some large computations in HAWQ using psql on the master. > These computations created temporary tables and dropped them again. > Nevertheless free disk space in HDFS decreased by much more than it should. > While the psql session on the master was still open I investigated on one of > the slave machines. > HDFS is stored on /mds: > {noformat} > [root@mds-hdp-04 ~]# ls -l /mds > total 36 > drwxr-xr-x. 3 root root4096 Jun 14 04:23 falcon > drwxr-xr-x. 3 root root4096 Jun 14 04:42 hdfs > drwx--. 2 root root 16384 Jun 8 02:48 lost+found > drwxr-xr-x. 5 storm hadoop 4096 Jun 14 04:45 storm > drwxr-xr-x. 4 root root4096 Jun 14 04:43 yarn > drwxr-xr-x. 2 zookeeper hadoop 4096 Jun 14 04:39 zookeeper > [root@mds-hdp-04 ~]# df /mds > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdc 515928320 314560220 175137316 65% /mds > [root@mds-hdp-04 ~]# du -s /mds > 89918952 /mds > {noformat} > Note that there is a more than 200 GB difference between the disk space used > according to df and the sum of all files on that file system according to du. > I have found the culprit to be several postgres processes running as gpadmin > and holding open file descriptors to deleted files. Here are the first few: > {noformat} > [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10 > postgres 665334 gpadmin 18r REG 253,32 134217728 0 9438234 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482 > (deleted) > postgres 665334 gpadmin 34r REG 253,32 24488 0 9438114 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398 > (deleted) > postgres 665334 gpadmin 35r REG 253,32 199 0 9438115 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398_187044.meta > (deleted) > postgres 665334 gpadmin 37r REG 253,32 134217728 0 9438208 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446 > (deleted) > postgres 665334 gpadmin 38r REG 253,32 1048583 0 9438209 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446_187092.meta > (deleted) > postgres 665334 gpadmin 39r REG 253,32 1048583 0 9438235 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482_187128.meta > (deleted) > postgres 665334 gpadmin 40r REG 253,32 134217728 0 9438262 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555 > (deleted) > postgres 665334 gpadmin 41r REG 253,32 1048583 0 9438263 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555_187201.meta > (deleted) > postgres 665334 gpadmin 42r REG 253,32 134217728 0 9438285 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602 > (deleted) > postgres 665334 gpadmin 43r REG 253,32 1048583 0 9438286 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602_187248.meta > (deleted) > {noformat} > As soon I close the psql session on the master the disk space is freed on the > slaves: > {noformat} > [root@mds-hdp-04 ~]# df /mds > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdc 515928320 89992720 399704816 19% /mds > [root@mds-hdp-04 ~]# du -s /mds > 89918952 /mds > [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10 > {noformat} > I believe this to be a bug. At least for me it looks like a very undesirable > behavior. -- This message was sent by Atlassian
[jira] [Commented] (HAWQ-1498) Segments keep open file descriptors for deleted files
[ https://issues.apache.org/jira/browse/HAWQ-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115801#comment-16115801 ] Yi Jin commented on HAWQ-1498: -- I will fix this issue recently in version 2.3 incubating. > Segments keep open file descriptors for deleted files > - > > Key: HAWQ-1498 > URL: https://issues.apache.org/jira/browse/HAWQ-1498 > Project: Apache HAWQ > Issue Type: Bug >Reporter: Harald Bögeholz >Assignee: Lin Wen > Fix For: 2.3.0.0-incubating > > > I have been running some large computations in HAWQ using psql on the master. > These computations created temporary tables and dropped them again. > Nevertheless free disk space in HDFS decreased by much more than it should. > While the psql session on the master was still open I investigated on one of > the slave machines. > HDFS is stored on /mds: > {noformat} > [root@mds-hdp-04 ~]# ls -l /mds > total 36 > drwxr-xr-x. 3 root root4096 Jun 14 04:23 falcon > drwxr-xr-x. 3 root root4096 Jun 14 04:42 hdfs > drwx--. 2 root root 16384 Jun 8 02:48 lost+found > drwxr-xr-x. 5 storm hadoop 4096 Jun 14 04:45 storm > drwxr-xr-x. 4 root root4096 Jun 14 04:43 yarn > drwxr-xr-x. 2 zookeeper hadoop 4096 Jun 14 04:39 zookeeper > [root@mds-hdp-04 ~]# df /mds > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdc 515928320 314560220 175137316 65% /mds > [root@mds-hdp-04 ~]# du -s /mds > 89918952 /mds > {noformat} > Note that there is a more than 200 GB difference between the disk space used > according to df and the sum of all files on that file system according to du. > I have found the culprit to be several postgres processes running as gpadmin > and holding open file descriptors to deleted files. Here are the first few: > {noformat} > [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10 > postgres 665334 gpadmin 18r REG 253,32 134217728 0 9438234 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482 > (deleted) > postgres 665334 gpadmin 34r REG 253,32 24488 0 9438114 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398 > (deleted) > postgres 665334 gpadmin 35r REG 253,32 199 0 9438115 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398_187044.meta > (deleted) > postgres 665334 gpadmin 37r REG 253,32 134217728 0 9438208 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446 > (deleted) > postgres 665334 gpadmin 38r REG 253,32 1048583 0 9438209 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446_187092.meta > (deleted) > postgres 665334 gpadmin 39r REG 253,32 1048583 0 9438235 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482_187128.meta > (deleted) > postgres 665334 gpadmin 40r REG 253,32 134217728 0 9438262 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555 > (deleted) > postgres 665334 gpadmin 41r REG 253,32 1048583 0 9438263 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555_187201.meta > (deleted) > postgres 665334 gpadmin 42r REG 253,32 134217728 0 9438285 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602 > (deleted) > postgres 665334 gpadmin 43r REG 253,32 1048583 0 9438286 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602_187248.meta > (deleted) > {noformat} > As soon I close the psql session on the master the disk space is freed on the > slaves: > {noformat} > [root@mds-hdp-04 ~]# df /mds > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdc 515928320 89992720 399704816 19% /mds > [root@mds-hdp-04 ~]# du -s /mds > 89918952 /mds > [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10 > {noformat} > I believe this to be a bug. At least for me it looks like a very undesirable > behavior. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HAWQ-1498) Segments keep open file descriptors for deleted files
[ https://issues.apache.org/jira/browse/HAWQ-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078188#comment-16078188 ] Harald Bögeholz commented on HAWQ-1498: --- The query ran successfully. I just copied any odd table that was laying around. Like this: {noformat} graphs=# create table junk as select * from bara10m; SELECT 1997 {noformat} You should be able to reproduce it by just doing the same on any table of reasonable size. > Segments keep open file descriptors for deleted files > - > > Key: HAWQ-1498 > URL: https://issues.apache.org/jira/browse/HAWQ-1498 > Project: Apache HAWQ > Issue Type: Bug >Reporter: Harald Bögeholz >Assignee: Radar Lei > Fix For: 2.2.0.0-incubating > > > I have been running some large computations in HAWQ using psql on the master. > These computations created temporary tables and dropped them again. > Nevertheless free disk space in HDFS decreased by much more than it should. > While the psql session on the master was still open I investigated on one of > the slave machines. > HDFS is stored on /mds: > {noformat} > [root@mds-hdp-04 ~]# ls -l /mds > total 36 > drwxr-xr-x. 3 root root4096 Jun 14 04:23 falcon > drwxr-xr-x. 3 root root4096 Jun 14 04:42 hdfs > drwx--. 2 root root 16384 Jun 8 02:48 lost+found > drwxr-xr-x. 5 storm hadoop 4096 Jun 14 04:45 storm > drwxr-xr-x. 4 root root4096 Jun 14 04:43 yarn > drwxr-xr-x. 2 zookeeper hadoop 4096 Jun 14 04:39 zookeeper > [root@mds-hdp-04 ~]# df /mds > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdc 515928320 314560220 175137316 65% /mds > [root@mds-hdp-04 ~]# du -s /mds > 89918952 /mds > {noformat} > Note that there is a more than 200 GB difference between the disk space used > according to df and the sum of all files on that file system according to du. > I have found the culprit to be several postgres processes running as gpadmin > and holding open file descriptors to deleted files. Here are the first few: > {noformat} > [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10 > postgres 665334 gpadmin 18r REG 253,32 134217728 0 9438234 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482 > (deleted) > postgres 665334 gpadmin 34r REG 253,32 24488 0 9438114 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398 > (deleted) > postgres 665334 gpadmin 35r REG 253,32 199 0 9438115 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398_187044.meta > (deleted) > postgres 665334 gpadmin 37r REG 253,32 134217728 0 9438208 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446 > (deleted) > postgres 665334 gpadmin 38r REG 253,32 1048583 0 9438209 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446_187092.meta > (deleted) > postgres 665334 gpadmin 39r REG 253,32 1048583 0 9438235 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482_187128.meta > (deleted) > postgres 665334 gpadmin 40r REG 253,32 134217728 0 9438262 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555 > (deleted) > postgres 665334 gpadmin 41r REG 253,32 1048583 0 9438263 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555_187201.meta > (deleted) > postgres 665334 gpadmin 42r REG 253,32 134217728 0 9438285 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602 > (deleted) > postgres 665334 gpadmin 43r REG 253,32 1048583 0 9438286 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602_187248.meta > (deleted) > {noformat} > As soon I close the psql session on the master the disk space is freed on the > slaves: > {noformat} > [root@mds-hdp-04 ~]# df /mds > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdc 515928320 89992720 399704816 19% /mds > [root@mds-hdp-04 ~]# du -s /mds > 89918952 /mds > [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10 > {noformat} > I believe this to be a bug. At least for me it looks like a very undesirable > behavior. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HAWQ-1498) Segments keep open file descriptors for deleted files
[ https://issues.apache.org/jira/browse/HAWQ-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078174#comment-16078174 ] Lin Wen commented on HAWQ-1498: --- Yes, I agree with you. Did your query run successfully or not? How many rows in table junk? And what's the rough size of one row? I may try to reproduce it in my environment. > Segments keep open file descriptors for deleted files > - > > Key: HAWQ-1498 > URL: https://issues.apache.org/jira/browse/HAWQ-1498 > Project: Apache HAWQ > Issue Type: Bug >Reporter: Harald Bögeholz >Assignee: Radar Lei > Fix For: 2.2.0.0-incubating > > > I have been running some large computations in HAWQ using psql on the master. > These computations created temporary tables and dropped them again. > Nevertheless free disk space in HDFS decreased by much more than it should. > While the psql session on the master was still open I investigated on one of > the slave machines. > HDFS is stored on /mds: > {noformat} > [root@mds-hdp-04 ~]# ls -l /mds > total 36 > drwxr-xr-x. 3 root root4096 Jun 14 04:23 falcon > drwxr-xr-x. 3 root root4096 Jun 14 04:42 hdfs > drwx--. 2 root root 16384 Jun 8 02:48 lost+found > drwxr-xr-x. 5 storm hadoop 4096 Jun 14 04:45 storm > drwxr-xr-x. 4 root root4096 Jun 14 04:43 yarn > drwxr-xr-x. 2 zookeeper hadoop 4096 Jun 14 04:39 zookeeper > [root@mds-hdp-04 ~]# df /mds > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdc 515928320 314560220 175137316 65% /mds > [root@mds-hdp-04 ~]# du -s /mds > 89918952 /mds > {noformat} > Note that there is a more than 200 GB difference between the disk space used > according to df and the sum of all files on that file system according to du. > I have found the culprit to be several postgres processes running as gpadmin > and holding open file descriptors to deleted files. Here are the first few: > {noformat} > [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10 > postgres 665334 gpadmin 18r REG 253,32 134217728 0 9438234 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482 > (deleted) > postgres 665334 gpadmin 34r REG 253,32 24488 0 9438114 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398 > (deleted) > postgres 665334 gpadmin 35r REG 253,32 199 0 9438115 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398_187044.meta > (deleted) > postgres 665334 gpadmin 37r REG 253,32 134217728 0 9438208 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446 > (deleted) > postgres 665334 gpadmin 38r REG 253,32 1048583 0 9438209 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446_187092.meta > (deleted) > postgres 665334 gpadmin 39r REG 253,32 1048583 0 9438235 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482_187128.meta > (deleted) > postgres 665334 gpadmin 40r REG 253,32 134217728 0 9438262 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555 > (deleted) > postgres 665334 gpadmin 41r REG 253,32 1048583 0 9438263 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555_187201.meta > (deleted) > postgres 665334 gpadmin 42r REG 253,32 134217728 0 9438285 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602 > (deleted) > postgres 665334 gpadmin 43r REG 253,32 1048583 0 9438286 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602_187248.meta > (deleted) > {noformat} > As soon I close the psql session on the master the disk space is freed on the > slaves: > {noformat} > [root@mds-hdp-04 ~]# df /mds > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdc 515928320 89992720 399704816 19% /mds > [root@mds-hdp-04 ~]# du -s /mds > 89918952 /mds > [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10 > {noformat} > I believe this to be a bug. At least for me it looks like a very undesirable > behavior. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HAWQ-1498) Segments keep open file descriptors for deleted files
[ https://issues.apache.org/jira/browse/HAWQ-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078138#comment-16078138 ] Harald Bögeholz commented on HAWQ-1498: --- I don't know too much about the inner workings of HAWQ, but there is one more thought I have: Even if QEs don't exit, why do they hold on to a file descriptor to a deleted file? If they closed that descriptor I wouldn't mind them hanging around for some time ... > Segments keep open file descriptors for deleted files > - > > Key: HAWQ-1498 > URL: https://issues.apache.org/jira/browse/HAWQ-1498 > Project: Apache HAWQ > Issue Type: Bug >Reporter: Harald Bögeholz >Assignee: Radar Lei > Fix For: 2.2.0.0-incubating > > > I have been running some large computations in HAWQ using psql on the master. > These computations created temporary tables and dropped them again. > Nevertheless free disk space in HDFS decreased by much more than it should. > While the psql session on the master was still open I investigated on one of > the slave machines. > HDFS is stored on /mds: > {noformat} > [root@mds-hdp-04 ~]# ls -l /mds > total 36 > drwxr-xr-x. 3 root root4096 Jun 14 04:23 falcon > drwxr-xr-x. 3 root root4096 Jun 14 04:42 hdfs > drwx--. 2 root root 16384 Jun 8 02:48 lost+found > drwxr-xr-x. 5 storm hadoop 4096 Jun 14 04:45 storm > drwxr-xr-x. 4 root root4096 Jun 14 04:43 yarn > drwxr-xr-x. 2 zookeeper hadoop 4096 Jun 14 04:39 zookeeper > [root@mds-hdp-04 ~]# df /mds > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdc 515928320 314560220 175137316 65% /mds > [root@mds-hdp-04 ~]# du -s /mds > 89918952 /mds > {noformat} > Note that there is a more than 200 GB difference between the disk space used > according to df and the sum of all files on that file system according to du. > I have found the culprit to be several postgres processes running as gpadmin > and holding open file descriptors to deleted files. Here are the first few: > {noformat} > [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10 > postgres 665334 gpadmin 18r REG 253,32 134217728 0 9438234 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482 > (deleted) > postgres 665334 gpadmin 34r REG 253,32 24488 0 9438114 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398 > (deleted) > postgres 665334 gpadmin 35r REG 253,32 199 0 9438115 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398_187044.meta > (deleted) > postgres 665334 gpadmin 37r REG 253,32 134217728 0 9438208 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446 > (deleted) > postgres 665334 gpadmin 38r REG 253,32 1048583 0 9438209 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446_187092.meta > (deleted) > postgres 665334 gpadmin 39r REG 253,32 1048583 0 9438235 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482_187128.meta > (deleted) > postgres 665334 gpadmin 40r REG 253,32 134217728 0 9438262 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555 > (deleted) > postgres 665334 gpadmin 41r REG 253,32 1048583 0 9438263 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555_187201.meta > (deleted) > postgres 665334 gpadmin 42r REG 253,32 134217728 0 9438285 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602 > (deleted) > postgres 665334 gpadmin 43r REG 253,32 1048583 0 9438286 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602_187248.meta > (deleted) > {noformat} > As soon I close the psql session on the master the disk space is freed on the > slaves: > {noformat} > [root@mds-hdp-04 ~]# df /mds > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdc 515928320 89992720 399704816 19% /mds > [root@mds-hdp-04 ~]# du -s /mds > 89918952 /mds > [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10 > {noformat} > I believe this to be a bug. At least for me it looks like a very undesirable > behavior. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HAWQ-1498) Segments keep open file descriptors for deleted files
[ https://issues.apache.org/jira/browse/HAWQ-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16077649#comment-16077649 ] Harald Bögeholz commented on HAWQ-1498: --- I had to rerun the experiment so the process IDs are different. Relevant lines from lsof on the segment node are now {noformat} postgres 99948 gpadmin 36r REG 253,32212272 0 9438395 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949201 (deleted) postgres 99948 gpadmin 37r REG 253,32 1667 0 9438396 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949201_214319.meta (deleted) postgres 99976 gpadmin 15r REG 253,32 121425488 0 9437196 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949199 (deleted) postgres 99976 gpadmin 16r REG 253,32948647 0 9438394 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949199_214317.meta (deleted) postgres 99976 gpadmin 18r REG 253,32212272 0 9438395 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949201 (deleted) postgres 99976 gpadmin 19r REG 253,32 1667 0 9438396 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir42/blk_1073949201_214319.meta (deleted) {noformat} Relevant lines from ps -ef are {noformat} gpadmin 99948 761943 0 05:40 ?00:00:03 postgres: port 4, gpadmin graphs 118.138.237.116(12911) con3231 seg1 idle gpadmin 99976 761943 0 05:40 ?00:00:01 postgres: port 4, gpadmin graphs 118.138.237.116(12929) con3231 seg1 idle {noformat} pstack says: {noformat} [root@mds-hdp-04 ~]# pstack 99948 Thread 2 (Thread 0x7f8c059b6700 (LWP 99949)): #0 0x7f8c0189ee2d in poll () from /lib64/libc.so.6 #1 0x009766c0 in rxThreadFunc (arg=) at ic_udp.c:6278 #2 0x7f8c02338dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x7f8c018a976d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f8c05b1ea00 (LWP 99948)): #0 0x7f8c0233f82b in recv () from /lib64/libpthread.so.0 #1 0x006b59dc in secure_read (port=0x2cbf910, ptr=0xf44620 , len=8192) at be-secure.c:307 #2 0x006bf9c6 in pq_recvbuf () at pqcomm.c:824 #3 0x006c06b5 in pq_getbyte () at pqcomm.c:929 #4 0x007e53e5 in SocketBackend (inBuf=0x7ffc8298b360) at postgres.c:434 #5 ReadCommand (inBuf=inBuf@entry=0x7ffc8298b360) at postgres.c:592 #6 0x007e9546 in PostgresMain (argc=, argv=, argv@entry=0x2d0a120, username=0x2d00fb0 "gpadmin") at postgres.c:4816 #7 0x0079cf68 in BackendRun (port=0x2cbf910) at postmaster.c:5915 #8 BackendStartup (port=0x2cbf910) at postmaster.c:5484 #9 ServerLoop () at postmaster.c:2163 #10 0x0079fd49 in PostmasterMain (argc=9, argv=) at postmaster.c:1454 #11 0x004a4b69 in main (argc=9, argv=0x2cba570) at main.c:226 {noformat} and {noformat} [root@mds-hdp-04 ~]# pstack 99976 Thread 2 (Thread 0x7f8c059b6700 (LWP 99979)): #0 0x7f8c0189ee2d in poll () from /lib64/libc.so.6 #1 0x009766c0 in rxThreadFunc (arg=) at ic_udp.c:6278 #2 0x7f8c02338dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x7f8c018a976d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f8c05b1ea00 (LWP 99976)): #0 0x7f8c0233f82b in recv () from /lib64/libpthread.so.0 #1 0x006b59dc in secure_read (port=0x2cbf910, ptr=0xf44620 , len=8192) at be-secure.c:307 #2 0x006bf9c6 in pq_recvbuf () at pqcomm.c:824 #3 0x006c06b5 in pq_getbyte () at pqcomm.c:929 #4 0x007e53e5 in SocketBackend (inBuf=0x7ffc8298b360) at postgres.c:434 #5 ReadCommand (inBuf=inBuf@entry=0x7ffc8298b360) at postgres.c:592 #6 0x007e9546 in PostgresMain (argc=, argv=, argv@entry=0x2d0a120, username=0x2d00fb0 "gpadmin") at postgres.c:4816 #7 0x0079cf68 in BackendRun (port=0x2cbf910) at postmaster.c:5915 #8 BackendStartup (port=0x2cbf910) at postmaster.c:5484 #9 ServerLoop () at postmaster.c:2163 #10 0x0079fd49 in PostmasterMain (argc=9, argv=) at postmaster.c:1454 #11 0x004a4b69 in main (argc=9, argv=0x2cba570) at main.c:226 {noformat} Hope this helps > Segments keep open file descriptors for deleted files > - > > Key: HAWQ-1498 > URL: https://issues.apache.org/jira/browse/HAWQ-1498 > Project: Apache HAWQ > Issue Type: Bug >Reporter: Harald Bögeholz >Assignee: Radar Lei > Fix For: 2.2.0.0-incubating > > > I have been running some large computations in HAWQ using psql on the master. > These computations created temporary tables and
[jira] [Commented] (HAWQ-1498) Segments keep open file descriptors for deleted files
[ https://issues.apache.org/jira/browse/HAWQ-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16077618#comment-16077618 ] Lin Wen commented on HAWQ-1498: --- Thanks for the information! These process are running on segment node, right? Would you please run "ps -ef" for these processes? So that we can know detailedly which processes occupy these? Also run "pstack" for one of them, to check where the process is running to. postgres 76698 gpadmin 15r REG 253,32 121425488 0 9437196 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir41/blk_1073949177 (deleted) postgres 76698 gpadmin 16r REG 253,32948647 0 9438394 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir41/blk_1073949177_214295.meta (deleted) postgres 76698 gpadmin 18r REG 253,32212024 0 9438395 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir41/blk_1073949181 (deleted) postgres 76698 gpadmin 19r REG 253,32 1667 0 9438396 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir41/blk_1073949181_214299.meta (deleted) > Segments keep open file descriptors for deleted files > - > > Key: HAWQ-1498 > URL: https://issues.apache.org/jira/browse/HAWQ-1498 > Project: Apache HAWQ > Issue Type: Bug >Reporter: Harald Bögeholz >Assignee: Radar Lei > Fix For: 2.2.0.0-incubating > > > I have been running some large computations in HAWQ using psql on the master. > These computations created temporary tables and dropped them again. > Nevertheless free disk space in HDFS decreased by much more than it should. > While the psql session on the master was still open I investigated on one of > the slave machines. > HDFS is stored on /mds: > {noformat} > [root@mds-hdp-04 ~]# ls -l /mds > total 36 > drwxr-xr-x. 3 root root4096 Jun 14 04:23 falcon > drwxr-xr-x. 3 root root4096 Jun 14 04:42 hdfs > drwx--. 2 root root 16384 Jun 8 02:48 lost+found > drwxr-xr-x. 5 storm hadoop 4096 Jun 14 04:45 storm > drwxr-xr-x. 4 root root4096 Jun 14 04:43 yarn > drwxr-xr-x. 2 zookeeper hadoop 4096 Jun 14 04:39 zookeeper > [root@mds-hdp-04 ~]# df /mds > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdc 515928320 314560220 175137316 65% /mds > [root@mds-hdp-04 ~]# du -s /mds > 89918952 /mds > {noformat} > Note that there is a more than 200 GB difference between the disk space used > according to df and the sum of all files on that file system according to du. > I have found the culprit to be several postgres processes running as gpadmin > and holding open file descriptors to deleted files. Here are the first few: > {noformat} > [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10 > postgres 665334 gpadmin 18r REG 253,32 134217728 0 9438234 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482 > (deleted) > postgres 665334 gpadmin 34r REG 253,32 24488 0 9438114 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398 > (deleted) > postgres 665334 gpadmin 35r REG 253,32 199 0 9438115 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398_187044.meta > (deleted) > postgres 665334 gpadmin 37r REG 253,32 134217728 0 9438208 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446 > (deleted) > postgres 665334 gpadmin 38r REG 253,32 1048583 0 9438209 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446_187092.meta > (deleted) > postgres 665334 gpadmin 39r REG 253,32 1048583 0 9438235 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482_187128.meta > (deleted) > postgres 665334 gpadmin 40r REG 253,32 134217728 0 9438262 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555 > (deleted) > postgres 665334 gpadmin 41r REG 253,32 1048583 0 9438263 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555_187201.meta > (deleted) > postgres 665334 gpadmin 42r REG 253,32 134217728 0 9438285 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602 > (deleted) > postgres 665334 gpadmin 43r REG 253,32
[jira] [Commented] (HAWQ-1498) Segments keep open file descriptors for deleted files
[ https://issues.apache.org/jira/browse/HAWQ-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16077442#comment-16077442 ] Harald Bögeholz commented on HAWQ-1498: --- Here are concrete steps for reproducing the problem. Fist I check the state of things on the slave node mds-hdp-04: {noformat} [root@mds-hdp-04 ~]# df /mds Filesystem 1K-blocks Used Available Use% Mounted on /dev/vdc 515928320 98522628 391174908 21% /mds [root@mds-hdp-04 ~]# du -s /mds 98448860/mds [root@mds-hdp-04 ~]# lsof L1 > lsof-before {noformat} (note that there is a plus character in the lsof command that Jira is eating.) Then on the master mds-hdp-03 I start psql and create a table: {noformat} [gpadmin@mds-hdp-03 ~]$ psql -d graphs psql (8.2.15) Type "help" for help. graphs=# create table junk as select * from bara10m; SELECT 1997 {noformat} On the slave node I see that the table is taking ~120 MB of storage: {noformat} [root@mds-hdp-04 ~]# df /mds Filesystem 1K-blocks Used Available Use% Mounted on /dev/vdc 515928320 98642512 391055024 21% /mds [root@mds-hdp-04 ~]# du -s /mds 98568532/mds {noformat} I then drop the table: {noformat} graphs=# drop table junk; DROP TABLE {noformat} On the slave node I check disk space and open files: {noformat} [root@mds-hdp-04 ~]# df /mds Filesystem 1K-blocks Used Available Use% Mounted on /dev/vdc 515928320 98642512 391055024 21% /mds [root@mds-hdp-04 ~]# du -s /mds 98449024/mds [root@mds-hdp-04 ~]# lsof L1 >lsof-dropped {noformat} Note that the output of du is roughly back to where it was, but the amount of used space df indicates stays the same. Will show the lengthy output of lsof below. Next, on the master, close the connection to the database: {noformat} graphs=# \q [gpadmin@mds-hdp-03 ~]$ {noformat} Check things on the slave: {noformat} [root@mds-hdp-04 ~]# df /mds Filesystem 1K-blocks Used Available Use% Mounted on /dev/vdc 515928320 98522664 391174872 21% /mds [root@mds-hdp-04 ~]# du -s /mds 98448896/mds [root@mds-hdp-04 ~]# lsof L1 >lsof-closed {noformat} Disk space is freed, everything is (up to a few kilobytes) as it was before. The output of lsof before (lsof-before) and after the experiment (lsof-dropped) is identical, as shown by diff. Here's lsof-dropped, the complete output of lsof after the table was dropped but before the connection to the database was closed: {noformat} COMMAND PIDUSER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME tuned 827root7u REG 253,1 4096 0 136067 /tmp/ffijUJke5 (deleted) bash 28331yarn cwdDIR 253,32 0 0 25167570 /mds/yarn/local/usercache/gpadmin/appcache/application_1498453636069_0005/container_e03_1498453636069_0005_01_000949 (deleted) bash 28331yarn 255r REG 253,32 710 0 25167857 /mds/yarn/local/usercache/gpadmin/appcache/application_1498453636069_0005/container_e03_1498453636069_0005_01_000949/default_container_executor.sh (deleted) sleep 28334yarn cwdDIR 253,32 0 0 25167570 /mds/yarn/local/usercache/gpadmin/appcache/application_1498453636069_0005/container_e03_1498453636069_0005_01_000949 (deleted) postgres 76659 gpadmin 36r REG 253,32212024 0 9438395 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir41/blk_1073949181 (deleted) postgres 76659 gpadmin 37r REG 253,32 1667 0 9438396 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir41/blk_1073949181_214299.meta (deleted) postgres 76698 gpadmin 15r REG 253,32 121425488 0 9437196 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir41/blk_1073949177 (deleted) postgres 76698 gpadmin 16r REG 253,32948647 0 9438394 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir41/blk_1073949177_214295.meta (deleted) postgres 76698 gpadmin 18r REG 253,32212024 0 9438395 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir41/blk_1073949181 (deleted) postgres 76698 gpadmin 19r REG 253,32 1667 0 9438396 /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir3/subdir41/blk_1073949181_214299.meta (deleted) bash 175610yarn cwdDIR 253,32 0 0 25168110 /mds/yarn/local/usercache/gpadmin/appcache/application_1498453636069_0006/container_e03_1498453636069_0006_01_001196 (deleted) bash 175610yarn 255r REG 253,32 710 0 25168161 /mds/yarn/local/usercache/gpadmin/appcache/application_1498453636069_0006/container_e03_1498453636069_0006_01_001196/default_container_executor.sh (deleted) sleep175612yarn cwdDIR 253,32
[jira] [Commented] (HAWQ-1498) Segments keep open file descriptors for deleted files
[ https://issues.apache.org/jira/browse/HAWQ-1498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16076231#comment-16076231 ] Lin Wen commented on HAWQ-1498: --- Hi, Harald, Thank you for reporting it! Would you like to provide more information? For example, print all the postgres process on segment when disk are not freed. Or concrete steps that can reproduce it. I am wondering if the query is executed successfully. If the query is finished, after a period of time(can be controlled by a GUC property), the idle QEs on segment should exit. If the QEs on segment exit, the disk are still not freed? > Segments keep open file descriptors for deleted files > - > > Key: HAWQ-1498 > URL: https://issues.apache.org/jira/browse/HAWQ-1498 > Project: Apache HAWQ > Issue Type: Bug >Reporter: Harald Bögeholz >Assignee: Radar Lei > Fix For: 2.2.0.0-incubating > > > I have been running some large computations in HAWQ using psql on the master. > These computations created temporary tables and dropped them again. > Nevertheless free disk space in HDFS decreased by much more than it should. > While the psql session on the master was still open I investigated on one of > the slave machines. > HDFS is stored on /mds: > {noformat} > [root@mds-hdp-04 ~]# ls -l /mds > total 36 > drwxr-xr-x. 3 root root4096 Jun 14 04:23 falcon > drwxr-xr-x. 3 root root4096 Jun 14 04:42 hdfs > drwx--. 2 root root 16384 Jun 8 02:48 lost+found > drwxr-xr-x. 5 storm hadoop 4096 Jun 14 04:45 storm > drwxr-xr-x. 4 root root4096 Jun 14 04:43 yarn > drwxr-xr-x. 2 zookeeper hadoop 4096 Jun 14 04:39 zookeeper > [root@mds-hdp-04 ~]# df /mds > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdc 515928320 314560220 175137316 65% /mds > [root@mds-hdp-04 ~]# du -s /mds > 89918952 /mds > {noformat} > Note that there is a more than 200 GB difference between the disk space used > according to df and the sum of all files on that file system according to du. > I have found the culprit to be several postgres processes running as gpadmin > and holding open file descriptors to deleted files. Here are the first few: > {noformat} > [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10 > postgres 665334 gpadmin 18r REG 253,32 134217728 0 9438234 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482 > (deleted) > postgres 665334 gpadmin 34r REG 253,32 24488 0 9438114 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398 > (deleted) > postgres 665334 gpadmin 35r REG 253,32 199 0 9438115 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922398_187044.meta > (deleted) > postgres 665334 gpadmin 37r REG 253,32 134217728 0 9438208 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446 > (deleted) > postgres 665334 gpadmin 38r REG 253,32 1048583 0 9438209 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922446_187092.meta > (deleted) > postgres 665334 gpadmin 39r REG 253,32 1048583 0 9438235 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922482_187128.meta > (deleted) > postgres 665334 gpadmin 40r REG 253,32 134217728 0 9438262 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555 > (deleted) > postgres 665334 gpadmin 41r REG 253,32 1048583 0 9438263 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir193/blk_1073922555_187201.meta > (deleted) > postgres 665334 gpadmin 42r REG 253,32 134217728 0 9438285 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602 > (deleted) > postgres 665334 gpadmin 43r REG 253,32 1048583 0 9438286 > /mds/hdfs/data/current/BP-23056860-118.138.237.114-1497415333069/current/finalized/subdir2/subdir194/blk_1073922602_187248.meta > (deleted) > {noformat} > As soon I close the psql session on the master the disk space is freed on the > slaves: > {noformat} > [root@mds-hdp-04 ~]# df /mds > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/vdc 515928320 89992720 399704816 19% /mds > [root@mds-hdp-04 ~]# du -s /mds > 89918952 /mds > [root@mds-hdp-04 ~]# lsof +L1 | grep /mds/hdfs | head -10 > {noformat} > I believe this to be a bug. At least for me it looks like a very