HBase at high ingest rates
First, I want to thank all on this list who responded to all my questions, and all the HBase developers as well. After trying both hadoop-0.20-append and Cloudera's latest, along with various settings, we are still leaking both file and socket handles at high ingest rates. The handle usage increases slowly, but linearly, till a process or OS limit is reached. The larger problem is that HBase does not recover from that situation. In fact, if once the file handles were driven high, but less than quota, and the ingest was slackened significantly, there was no recovery. If ingest was completely stopped and restarted, there was recovery. By high ingest rate I mean about 1200 records/sec per node, 2K per record, 50 records per batch put. I truly wish I had the time to dig in to find the answers, or at least indications, but my time is not my own. We will keeping a close eye on HBase, and will likely revisit it before long. Thanks again Thomas Downing
How to delete rows in a FIFO manner
Hi, Continuing with testing HBase suitability in a high ingest rate environment, I've come up with a new stumbling block, likely due to my inexperience with HBase. We want to keep and purge records on a time basis: i.e, when a record is older than say, 24 hours, we want to purge it from the database. The problem that I am encountering is the only way I've found to delete records using an arbitrary but strongly ordered over time row id is to scan for rows from lower bound to upper bound, then build an array of Delete using for Result in ResultScanner add new Delete( Result.getRow( ) ) to Delete array. This method is far too slow to keep up with our ingest rate; the iteration over the Results in the ResultScanner is the bottleneck, even though the Scan is limited to a single small column in the column family. The obvious but naive solution is to use a sequential row id where the lower and upper bound can be known. This would allow the building of the array of Delete objects without a scan step. Problem with this approach is how do you guarantee a sequential and non-colliding row id across more than one Put'ing process, and do it efficiently. As it happens, I can do this, but given the details of my operational requirements, it's not a simple thing to do. So I was hoping that I had just missed something. The ideal would be a Delete object that would take row id bounds in the same way that Scan does, allowing the work to be done all on the server side. Does this exists somewhere? Or is there some other way to skin this cat? Thanks Thomas Downing
Re: How to delete rows in a FIFO manner
Our problem does not require significant map/reduce ops, and queries tend to be for sequential rows with the timeframe being the primary consideration. So time-bounded tables are not a big hurdle, as they might be were other columns primary keys or considerations for query or map/reduce ops. TTL timestamp - that may be just the magic I was looking for... thanks, I'll look at that. td On 8/6/2010 11:59 AM, Venkatesh wrote: I wrestled with that idea of time bounded tables..Would it make it harder to write code/run map reduce on multiple tables ? Also, how do u decide to when to do the cut over (start of a new day, week/month..) if u do how to process data that cross those time boundaries efficiently.. Guess that is not your requirement.. If it is fixed time cut over, is n't enough to set the TTL timestamp ? Interesting thread..thanks -Original Message- From: Thomas Downingtdown...@proteus-technologies.com To: user@hbase.apache.orguser@hbase.apache.org Sent: Fri, Aug 6, 2010 11:39 am Subject: Re: How to delete rows in a FIFO manner Thanks for the suggestions. The problem isn't generating the Delete objects, or the delete operation itself - both are fast enough. The problem is generating the list of row keys from which the Delete objects are created. For now, the obvious work-around is to create and drop tables on the fly, using HBaseAdmin, with the tables being time-bounded. When the high end of a table passes the expiry time, just drop the table. When a table is written with the first record greater than the low bound, create a new table for the next time interval. As I am having other problems related to high ingest rates, the fact may be that I am just using the wrong tool for the job. Thanks td On 8/6/2010 10:24 AM, Jean-Daniel Cryans wrote: If the inserts are coming from more than 1 client, and your are trying to delete from only 1 client, then likely it won't work. You could try using a pool of deleters (multiple threads that delete rows) that you feed from the scanner. Or you could run a MapReduce that would parallelize that for you, that takes your table as an input and that outputs Delete objects. J-D On Fri, Aug 6, 2010 at 5:50 AM, Thomas Downing tdown...@proteus-technologies.com wrote: Hi, Continuing with testing HBase suitability in a high ingest rate environment, I've come up with a new stumbling block, likely due to my inexperience with HBase. We want to keep and purge records on a time basis: i.e, when a record is older than say, 24 hours, we want to purge it from the database. The problem that I am encountering is the only way I've found to delete records using an arbitrary but strongly ordered over time row id is to scan for rows from lower bound to upper bound, then build an array of Delete using for Result in ResultScanner add new Delete( Result.getRow( ) ) to Delete array. This method is far too slow to keep up with our ingest rate; the iteration over the Results in the ResultScanner is the bottleneck, even though the Scan is limited to a single small column in the column family. The obvious but naive solution is to use a sequential row id where the lower and upper bound can be known. This would allow the building of the array of Delete objects without a scan step. Problem with this approach is how do you guarantee a sequential and non-colliding row id across more than one Put'ing process, and do it efficiently. As it happens, I can do this, but given the details of my operational requirements, it's not a simple thing to do. So I was hoping that I had just missed something. The ideal would be a Delete object that would take row id bounds in the same way that Scan does, allowing the work to be done all on the server side. Does this exists somewhere? Or is there some other way to skin this cat? Thanks Thomas Downing -- Follow this link to mark it as spam: http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=6574C2821B.A5164 -- Follow this link to mark it as spam: http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=7E6BE2821B.A4479
Re: High ingest rate and FIN_WAIT1 problems
Yes, hadoop 0.20.2 and hbase 0.20.5. I will get the branch you suggest, and give it a whirl. I am leaving on vacation Thursday, so I may not have any results to report till I get back. When I do get back, I will catch up with versions/fixes and try some more. Meanwhile, thanks to all who have responded to my posts. thomas downing On 7/20/2010 1:06 PM, Stack wrote: Hey Thomas: You are using hadoop 0..20.2 or something? And hbase 0.20.5 or so? You might try http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/. In particlular, it has HDFS-1118 Fix socketleak on DFSClient. St.Ack On Tue, Jul 20, 2010 at 1:58 AM, Thomas Downing tdown...@proteus-technologies.com wrote: Yes, I did try the timeout of 0. As expected, I did not see sockets in FIN_WAIT2 or TIME_WAIT for very long. I still leak sockets at the ingest rates I need - the FIN_WAIT1 problem. Also, with the more careful observations this time around, I noted that even before the FIN_WAIT1 problem starts to crop up (at around 1600M inserts) there is already a slower socket leakage with timeout=0 and no FIN_WAIT1 problem. At 100M sockets were hovering around 50-60, by 800M they were around 200, and at 1600M they were at 400. This is slower than without the timeout set to 0 (about half the rate), but it is still ultimately fatal. This socket increase is all between hbase and hadoop, none between test client and hbase. While the FIN_WAIT1 problem is triggered by an hbase side issue, I have no indication of which side causes this other leak. thanks thomas downing On 7/19/2010 4:31 PM, Ryan Rawson wrote: Did you try the setting I suggested? There is/was a known bug in HDFS which can cause issues which may include abandoned sockets such as you are describing. -ryan On Mon, Jul 19, 2010 at 2:13 AM, Thomas Downing tdown...@proteus-technologies.comwrote: Thanks for the response, but my problem is not with FIN_WAIT2, it is with FIN_WAIT1. If it was FIN_WAIT2, the only concern would be socket leakage, and if setting the time out solved the issue, that would be great. The problem with FIN_WAIT1 is twofold - first, it is incumbent on the application to notice and handle this problem; from the TCP stack point of view, there is nothing wrong. It is just a special case of slow consumer. The other problem is that it implies that something will be lost if the socket is abandoned, there is data in the send queue of the socket in FIN_WAIT1 that has not yet been delivered to the peer. On 7/16/2010 3:56 PM, Ryan Rawson wrote: I've been running with this setting on both the HDFS side and the HBase side for over a year now, it's a bit of voodoo but you might be running into well known suckage of HDFS. Try this one and restart your hbase hdfs. The FIN_WAIT2/TIME_WAIT happens more on large concurrent gets, not so much for inserts. property namedfs.datanode.socket.write.timeout/name value0/value /property -ryan On Fri, Jul 16, 2010 at 9:33 AM, Thomas Downing tdown...@proteus-technologies.com wrote: Thanks for the response. My understanding is that TCP_FIN_TIMEOUT affects only FIN_WAIT2, my problem is with FIN_WAIT1. While I do see some sockets in TIME_WAIT, they are only a few, and the number is not growing. On 7/16/2010 12:07 PM, Hegner, Travis wrote: Hi Thomas, I ran into a very similar issue when running slony-I on postgresql to replicate 15-20 databases. Adjusting the TCP_FIN_TIMEOUT parameters for the kernel may help to slow (or hopefully stop), the leaking sockets. I found some notes about adjusting TCP parameters here: http://www.hikaro.com/linux/tweaking-tcpip-syctl-conf.html [snip] -- Follow this link to mark it as spam: http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=6A53327EB7.A78FD -- Follow this link to mark it as spam: http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=2E38F27E96.A72CF
Re: High ingest rate and FIN_WAIT1 problems
Thanks for the response, but my problem is not with FIN_WAIT2, it is with FIN_WAIT1. If it was FIN_WAIT2, the only concern would be socket leakage, and if setting the time out solved the issue, that would be great. The problem with FIN_WAIT1 is twofold - first, it is incumbent on the application to notice and handle this problem; from the TCP stack point of view, there is nothing wrong. It is just a special case of slow consumer. The other problem is that it implies that something will be lost if the socket is abandoned, there is data in the send queue of the socket in FIN_WAIT1 that has not yet been delivered to the peer. On 7/16/2010 3:56 PM, Ryan Rawson wrote: I've been running with this setting on both the HDFS side and the HBase side for over a year now, it's a bit of voodoo but you might be running into well known suckage of HDFS. Try this one and restart your hbase hdfs. The FIN_WAIT2/TIME_WAIT happens more on large concurrent gets, not so much for inserts. property namedfs.datanode.socket.write.timeout/name value0/value /property -ryan On Fri, Jul 16, 2010 at 9:33 AM, Thomas Downing tdown...@proteus-technologies.com wrote: Thanks for the response. My understanding is that TCP_FIN_TIMEOUT affects only FIN_WAIT2, my problem is with FIN_WAIT1. While I do see some sockets in TIME_WAIT, they are only a few, and the number is not growing. On 7/16/2010 12:07 PM, Hegner, Travis wrote: Hi Thomas, I ran into a very similar issue when running slony-I on postgresql to replicate 15-20 databases. Adjusting the TCP_FIN_TIMEOUT parameters for the kernel may help to slow (or hopefully stop), the leaking sockets. I found some notes about adjusting TCP parameters here: http://www.hikaro.com/linux/tweaking-tcpip-syctl-conf.html [snip]
High ingest rate and FIN_WAIT1 problems
Hi, I am a complete HBase and HDFS newbie, so I apologize in advance for the inevitable bloopers. We are doing feasibility testing on NoSql data store options, with rather high ingest rate requirements. So far, HBase is looking good, with only one issue identified. Running at an ingest rate of ~30K rows per second on a 4 2.2Mhz CPU 8G RAM machine I am slowly leaking sockets. This is a single node setup - no replication. The CPU load is only about 50%-60%, with the majority of that in userland, system and iowait are averaging less than 3%. There is no swapping going on. The problem is that on the datanode there are a large number of sockets in FIN_WAIT1, with corresponding peers on master in ESTABLISHED. These pairs hang around for quite some time, at at my ingest rate this means that the total sockets held by datanode and master is slowly going up. If my understanding of TCP is correct, then this indicates that the master peer has stopped reading incoming data from the datanode - i.e, it is sending a window of zero; and that the datanode has called close(2) on it's peer. There was a thread some time ago: http://www.mail-archive.com/hbase-u...@hadoop.apache.org/msg03329.html There was no real conclusion. I have played with the config params as suggested on that thread, but no luck yet. Also, in that case the problem seemed to be between datanodes for replication operations - not the case with me. Changing timeouts to avoid the slow increase might not really solve the problem IFF the master peer has in fact ceased to read it's socket. The data outstanding in the TCP stack buffer would be lost. Whether that would imply data loss is beyond me. I am posting this here as although the only logs with errors/exceptions are the datanode logs, netstat and wireshark seem to indicate that the problem is on the master side. The master, namenode, regionserver and zookeeper and logs shows no warning or errors. The datanode log shows this, over and over: 2010-07-16 00:33:09,269 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1028643313-10.1.1.200-50010-1279026099917, infoPort=50075, ipcPort=50020):Got exception while serving blk_3684861726145519813_22386 to /127.0.0.1: java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:54774] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) at java.lang.Thread.run(Thread.java:619) 2010-07-16 00:33:09,269 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1028643313-10.1.1.200-50010-1279026099917, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:54774] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) at java.lang.Thread.run(Thread.java:619) It there is any other info that might help, or any steps you would like me to take, just let me know. Thanks Thomas Downing
Re: High ingest rate and FIN_WAIT1 problems
Thanks for the response. My understanding is that TCP_FIN_TIMEOUT affects only FIN_WAIT2, my problem is with FIN_WAIT1. While I do see some sockets in TIME_WAIT, they are only a few, and the number is not growing. On 7/16/2010 12:07 PM, Hegner, Travis wrote: Hi Thomas, I ran into a very similar issue when running slony-I on postgresql to replicate 15-20 databases. Adjusting the TCP_FIN_TIMEOUT parameters for the kernel may help to slow (or hopefully stop), the leaking sockets. I found some notes about adjusting TCP parameters here: http://www.hikaro.com/linux/tweaking-tcpip-syctl-conf.html with the specific excerpt regarding the TCP_FIN_TIMEOUT: TCP_FIN_TIMEOUT This setting determines the time that must elapse before TCP/IP can release a closed connection and reuse its resources. During this TIME_WAIT state, reopening the connection to the client costs less than establishing a new connection. By reducing the value of this entry, TCP/IP can release closed connections faster, making more resources available for new connections. Addjust this in the presense of many connections sitting in the TIME_WAIT state: # echo 30 /proc/sys/net/ipv4/tcp_fin_timeout Try setting this lower on your master. You may also consider these on the same link: TCP_TW_RECYCLE It enables fast recycling of TIME_WAIT sockets. The default value is 0 (disabled). The sysctl documentation incorrectly states the default as enabled. It can be changed to 1 (enabled) in many cases. Known to cause some issues with hoststated (load balancing and fail over) if enabled, should be used with caution. echo 1 /proc/sys/net/ipv4/tcp_tw_recycle TCP_TW_REUSE This allows reusing sockets in TIME_WAIT state for new connections when it is safe from protocol viewpoint. Default value is 0 (disabled). It is generally a safer alternative to tcp_tw_recycle echo 1 /proc/sys/net/ipv4/tcp_tw_reuse The above commands will not persist reboots, but the link explains how. The list experts may be able to give more insight on which, if any, of these settings are safe to manipulate, and what risks or issues you may encounter specifically with Hbase while adjusting these settings. Hope This Helps, Travis Hegner -Original Message- From: Thomas Downing [mailto:tdown...@proteus-technologies.com] Sent: Friday, July 16, 2010 10:33 AM To: user@hbase.apache.org Subject: High ingest rate and FIN_WAIT1 problems Hi, I am a complete HBase and HDFS newbie, so I apologize in advance for the inevitable bloopers. We are doing feasibility testing on NoSql data store options, with rather high ingest rate requirements. So far, HBase is looking good, with only one issue identified. Running at an ingest rate of ~30K rows per second on a 4 2.2Mhz CPU 8G RAM machine I am slowly leaking sockets. This is a single node setup - no replication. The CPU load is only about 50%-60%, with the majority of that in userland, system and iowait are averaging less than 3%. There is no swapping going on. The problem is that on the datanode there are a large number of sockets in FIN_WAIT1, with corresponding peers on master in ESTABLISHED. These pairs hang around for quite some time, at at my ingest rate this means that the total sockets held by datanode and master is slowly going up. If my understanding of TCP is correct, then this indicates that the master peer has stopped reading incoming data from the datanode - i.e, it is sending a window of zero; and that the datanode has called close(2) on it's peer. There was a thread some time ago: http://www.mail-archive.com/hbase-u...@hadoop.apache.org/msg03329.html There was no real conclusion. I have played with the config params as suggested on that thread, but no luck yet. Also, in that case the problem seemed to be between datanodes for replication operations - not the case with me. Changing timeouts to avoid the slow increase might not really solve the problem IFF the master peer has in fact ceased to read it's socket. The data outstanding in the TCP stack buffer would be lost. Whether that would imply data loss is beyond me. I am posting this here as although the only logs with errors/exceptions are the datanode logs, netstat and wireshark seem to indicate that the problem is on the master side. The master, namenode, regionserver and zookeeper and logs shows no warning or errors. The datanode log shows this, over and over: 2010-07-16 00:33:09,269 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-1028643313-10.1.1.200-50010-1279026099917, infoPort=50075, ipcPort=50020):Got exception while serving blk_3684861726145519813_22386 to /127.0.0.1: java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:54774] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246