Re: HBase connection pool
Nick, I tried what you suggested, 1 HConnection and 1 Configuration for the entire app: this.config = HBaseConfiguration.create(); this.connection = HConnectionManager.createConnection(config); And Threaded pooled HTableInterfaces: final HConnection lconnection = this.connection; this.tlTable = new ThreadLocalHTableInterface() { @Override protected HTableInterface initialValue() { try { return lconnection.getTable(HBaseSerialWritesPOC); // return new HTable(tlConfig.get(), // HBaseSerialWritesPOC); } catch (IOException e) { throw new RuntimeException(e); } } }; I started getting this error in my application: 2015-02-26 10:23:17,833 INFO [main-SendThread(xxx)] zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(966)) - Opening socket connection to server xxx. Will not attempt to authenticate using SASL (unknown error) 2015-02-26 10:23:17,834 INFO [main-SendThread(xxx)] zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(849)) - Socket connection established to xxx, initiating session 2015-02-26 10:23:17,836 WARN [main-SendThread(xxx)] zookeeper.ClientCnxn (ClientCnxn.java:run(1089)) - Session 0x0 for server xxx, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) -Marcelo From: ndimi...@gmail.com Subject: Re: HBase connection pool Okay, looks like you're using a implicitly managed connection. It should be fine to share a single config instance across all threads. The advantage of HTablePool over this approach is that the number of HTables would be managed independently from the number of Threads. This may or not be a concern for you, based on your memory requirements, c. In your case, you're not specifying an ExecutorService per HTable, so the HTable instances will be relatively light weight. Each table will manage it's own write buffer, which can be shared by multiple threads when autoFlush is disabled and HTablePool is used. This may or may not be desirable, depending on your use-case. For what it's worth, HTablePool is marked deprecated in 1.0, will likely be removed in 2.0. To future proof this code, I would move to a single shared HConnection for the whole application, and a thread-local HTable created from/with that connection. -n On Wed, Feb 25, 2015 at 10:53 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Hi Nick, I am using HBase version 0.96, I sent the link from version 0.94 because I haven't found the java API docs for 0.96, sorry about that. I have created the HTable directly from the config object, as follows: this.tlConfig = new ThreadLocalConfiguration() { @Override protected Configuration initialValue() { return HBaseConfiguration.create(); } }; this.tlTable = new ThreadLocalHTable() { @Override protected HTable initialValue() { try { return new HTable(tlConfig.get(), HBaseSerialWritesPOC); } catch (IOException e) { throw new RuntimeException(e); } } }; I am now sure if the Configuration object should be 1 per thread as well, maybe I could share this one? So, just to clarify, would I get any advantage using HTablePool object instead of ThreadLocalHTable as I did? -Marcelo From: ndimi...@gmail.com Subject: Re: HBase connection pool Hi Marcelo, First thing, to be clear, you're working with a 0.94 release? The reason I ask is we've been doing some work in this area to improve things, so semantics may be slightly different between 0.94, 0.98, and 1.0. How are you managing the HConnection object (or are you)? How are you creating your HTable instances? These will determine how the connection is obtained and used in relation to HTables. In general, multiple HTable instances connected to tables in the same cluster should be sharing the same HConnection instance. This is handled explicitly when you manage your own HConnection and HTables (i.e., HConnection conn = ... ; HTable t = new HTable(TABLE_NAME, conn); ) It's handled implicitly when you construct via Configuration objects (HTable t = new HTable(conf, TABLE_NAME); ) This implicit option is going away in future versions. HTable is not safe for concurrent access because of how the write path is implemented (at least; there may be other portions that I'm not as familiar with). You should be perfectly fine to have an HTable per thread in a ThreadLocal. -n On Wed, Feb 25, 2015 at 9:41 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: In HBase API, does 1
Re: HBase scan time range, inconsistency
Ok… Silly question time… so just humor me for a second. 1) What do you mean by saying your have a partitioned HBase table? (Regions and partitions are not the same) 2) There’s a question of the isolation level during the scan. What happens when there is a compaction running or there’s RLL taking place? Does your scan get locked/blocked? Does it skip the row? (This should be documented.) Do you count the number of rows scanned when building the list of rows that need to be processed further? On Feb 25, 2015, at 4:46 PM, Stephen Durfey sjdur...@gmail.com wrote: Are you writing any Deletes? Are you writing any duplicates? No physical deletes are occurring in my data, and there is a very real possibility of duplicates. How is the partitioning done? The key structure would be /partition_id/person_id I'm dealing with clinical data, with a data source identified by the partition, and the person data is associated with that particular partition at load time. Are you doing the column filtering with a custom filter or one of the prepackaged ones? They appear to be all prepackaged filters: FamilyFilter, KeyOnlyFilter, QualifierFilter, and ColumnPrefixFilter are used under various conditions, depending upon what is requested on the Scan object. On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey bus...@cloudera.com wrote: Are you writing any Deletes? Are you writing any duplicates? How is the partitioning done? What does the entire key structure look like? Are you doing the column filtering with a custom filter or one of the prepackaged ones? On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey sjdur...@gmail.com wrote: What's the TTL setting for your table ? Which hbase release are you using ? Was there compaction in between the scans ? Thanks The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don’t want to say compactions aren’t a factor, but the jobs are short lived (4-5 minutes), and I have ran them frequently over the last couple of days trying to gather stats around what was being extracted, and trying to find the difference and intersection in row keys before job runs. These numbers have varied wildly, from being off by 2-3 between subsequent scans to 40 row increases, followed by a drop of 70 rows. When you say there is a variation in the number of rows retrieved - the 40 rows that got increased - are those rows in the expected time range? Or is the system retrieving some rows which are not in the specified time range? And when the rows drop by 70, are you using any row which was needed to be retrieved got missed out? The best I can tell, if there is an increase in counts, those rows are not coming from outside of the time range. In the job, I am maintaining a list of rows that have a timestamp outside of my provided time range, and then writing those out to hdfs at the end of the map task. So far, nothing has been written out. Any filters in your scan? Regards Ram There are some column filters. There is an API abstraction on top of hbase that I am using to allow users to easily extract data from columns that start with a provided column prefix. So, the column filters are in place to ensure I am only getting back data from columns that start with the provided prefix. To add a little more detail, my row keys are separated out by partition. At periodic times (through oozie), data is loaded from a source into the appropriate partition. I ran some scans against a partition that hadn't been updated in almost a year (with a scan range around the times of the 2nd to last load into the table), and the row key counts were consistent across multiple scans. I chose another partition that is actively being updated once a day. I chose a scan time around the 4th most recent load, and the results were inconsistent from scan to scan (fluctuating up and down). Setting the begin time to 4 days in the past end time on the scan range to 'right now', using System.currentTimeMillis() (with the time being after the daily load), the results also fluctuated up and down. So, it kind of seems like there is some sort of temporal recency that is causing the counts to fluctuate. On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: These numbers have varied wildly, from being off by 2-3 between subsequent scans to 40 row increases, followed by a drop of 70 rows. When you say there is a variation in the number of rows retrieved - the 40 rows that got increased - are those rows in the expected time range? Or is the system retrieving some rows which are not in the specified time range? And when the rows drop by 70, are you using any row which was needed to be retrieved got missed out? Any filters in your scan? Regards Ram On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu yuzhih...@gmail.com wrote:
HBase and YARN/MR
Hi all, We're currently moving to Hadoop 2 (years behind, I know) and debating how to handle job resource management using YARN where nearly 100% of our jobs are maps over HBase Tables and a large portion also Reduce to HBase. While YARN adequately handles the resources of the machine its tasks are running on, for the purposes of TableMapper jobs, the resources consumed are actually on the remote regionserver, which YARN doesn't seem to be able to recognize. We've implemented things such as per-job concurrent task limits to help deal with this on Hadoop 1, but that seems hard to do in Hadoop 2. I'm wondering if anyone has best practices or any ideas on how to deal with an all HBase, heavily I/O and RegionServer memory/RPC bound workload? Thanks in advance! --Ian
Re: HBase scan time range, inconsistency
1) What do you mean by saying your have a partitioned HBase table? (Regions and partitions are not the same) By partitions, I just mean logical partitions, using the row key to keep data from separate data sources apart from each other. I think the issue may be resolved now, but it isn't obvious to me why the change works. The table is set to the save the max number of versions, but the number of versions is not specified in the Scan object. Once I changed the Scan to request the max number of versions the counts remained the same across all subsequent job runs. Can anyone provide some insight as to why this is the case? On Thu, Feb 26, 2015 at 8:35 AM, Michael Segel michael_se...@hotmail.com wrote: Ok… Silly question time… so just humor me for a second. 1) What do you mean by saying your have a partitioned HBase table? (Regions and partitions are not the same) 2) There’s a question of the isolation level during the scan. What happens when there is a compaction running or there’s RLL taking place? Does your scan get locked/blocked? Does it skip the row? (This should be documented.) Do you count the number of rows scanned when building the list of rows that need to be processed further? On Feb 25, 2015, at 4:46 PM, Stephen Durfey sjdur...@gmail.com wrote: Are you writing any Deletes? Are you writing any duplicates? No physical deletes are occurring in my data, and there is a very real possibility of duplicates. How is the partitioning done? The key structure would be /partition_id/person_id I'm dealing with clinical data, with a data source identified by the partition, and the person data is associated with that particular partition at load time. Are you doing the column filtering with a custom filter or one of the prepackaged ones? They appear to be all prepackaged filters: FamilyFilter, KeyOnlyFilter, QualifierFilter, and ColumnPrefixFilter are used under various conditions, depending upon what is requested on the Scan object. On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey bus...@cloudera.com wrote: Are you writing any Deletes? Are you writing any duplicates? How is the partitioning done? What does the entire key structure look like? Are you doing the column filtering with a custom filter or one of the prepackaged ones? On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey sjdur...@gmail.com wrote: What's the TTL setting for your table ? Which hbase release are you using ? Was there compaction in between the scans ? Thanks The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don’t want to say compactions aren’t a factor, but the jobs are short lived (4-5 minutes), and I have ran them frequently over the last couple of days trying to gather stats around what was being extracted, and trying to find the difference and intersection in row keys before job runs. These numbers have varied wildly, from being off by 2-3 between subsequent scans to 40 row increases, followed by a drop of 70 rows. When you say there is a variation in the number of rows retrieved - the 40 rows that got increased - are those rows in the expected time range? Or is the system retrieving some rows which are not in the specified time range? And when the rows drop by 70, are you using any row which was needed to be retrieved got missed out? The best I can tell, if there is an increase in counts, those rows are not coming from outside of the time range. In the job, I am maintaining a list of rows that have a timestamp outside of my provided time range, and then writing those out to hdfs at the end of the map task. So far, nothing has been written out. Any filters in your scan? Regards Ram There are some column filters. There is an API abstraction on top of hbase that I am using to allow users to easily extract data from columns that start with a provided column prefix. So, the column filters are in place to ensure I am only getting back data from columns that start with the provided prefix. To add a little more detail, my row keys are separated out by partition. At periodic times (through oozie), data is loaded from a source into the appropriate partition. I ran some scans against a partition that hadn't been updated in almost a year (with a scan range around the times of the 2nd to last load into the table), and the row key counts were consistent across multiple scans. I chose another partition that is actively being updated once a day. I chose a scan time around the 4th most recent load, and the results were inconsistent from scan to scan (fluctuating up and down). Setting the begin time to 4 days in the past end time on the scan range to 'right now', using System.currentTimeMillis() (with the time being after the daily load), the results also fluctuated up and down.
Re: HBase connection pool
Can you tell when these WARN messages are produced? Is it related to the creation of the connection object or one of the HTable instances? On Thu, Feb 26, 2015 at 7:27 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Nick, I tried what you suggested, 1 HConnection and 1 Configuration for the entire app: this.config = HBaseConfiguration.create(); this.connection = HConnectionManager.createConnection(config); And Threaded pooled HTableInterfaces: final HConnection lconnection = this.connection; this.tlTable = new ThreadLocalHTableInterface() { @Override protected HTableInterface initialValue() { try { return lconnection.getTable(HBaseSerialWritesPOC); // return new HTable(tlConfig.get(), // HBaseSerialWritesPOC); } catch (IOException e) { throw new RuntimeException(e); } } }; I started getting this error in my application: 2015-02-26 10:23:17,833 INFO [main-SendThread(xxx)] zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(966)) - Opening socket connection to server xxx. Will not attempt to authenticate using SASL (unknown error) 2015-02-26 10:23:17,834 INFO [main-SendThread(xxx)] zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(849)) - Socket connection established to xxx, initiating session 2015-02-26 10:23:17,836 WARN [main-SendThread(xxx)] zookeeper.ClientCnxn (ClientCnxn.java:run(1089)) - Session 0x0 for server xxx, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) -Marcelo From: ndimi...@gmail.com Subject: Re: HBase connection pool Okay, looks like you're using a implicitly managed connection. It should be fine to share a single config instance across all threads. The advantage of HTablePool over this approach is that the number of HTables would be managed independently from the number of Threads. This may or not be a concern for you, based on your memory requirements, c. In your case, you're not specifying an ExecutorService per HTable, so the HTable instances will be relatively light weight. Each table will manage it's own write buffer, which can be shared by multiple threads when autoFlush is disabled and HTablePool is used. This may or may not be desirable, depending on your use-case. For what it's worth, HTablePool is marked deprecated in 1.0, will likely be removed in 2.0. To future proof this code, I would move to a single shared HConnection for the whole application, and a thread-local HTable created from/with that connection. -n On Wed, Feb 25, 2015 at 10:53 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Hi Nick, I am using HBase version 0.96, I sent the link from version 0.94 because I haven't found the java API docs for 0.96, sorry about that. I have created the HTable directly from the config object, as follows: this.tlConfig = new ThreadLocalConfiguration() { @Override protected Configuration initialValue() { return HBaseConfiguration.create(); } }; this.tlTable = new ThreadLocalHTable() { @Override protected HTable initialValue() { try { return new HTable(tlConfig.get(), HBaseSerialWritesPOC); } catch (IOException e) { throw new RuntimeException(e); } } }; I am now sure if the Configuration object should be 1 per thread as well, maybe I could share this one? So, just to clarify, would I get any advantage using HTablePool object instead of ThreadLocalHTable as I did? -Marcelo From: ndimi...@gmail.com Subject: Re: HBase connection pool Hi Marcelo, First thing, to be clear, you're working with a 0.94 release? The reason I ask is we've been doing some work in this area to improve things, so semantics may be slightly different between 0.94, 0.98, and 1.0. How are you managing the HConnection object (or are you)? How are you creating your HTable instances? These will determine how the connection is obtained and used in relation to HTables. In general, multiple HTable instances connected to tables in the same cluster should be sharing the same HConnection instance. This is handled explicitly when you manage your own HConnection and HTables (i.e., HConnection conn = ... ; HTable t = new HTable(TABLE_NAME, conn); ) It's handled implicitly when you construct via Configuration objects (HTable t = new HTable(conf, TABLE_NAME); ) This implicit option is going away in future versions. HTable is not safe for concurrent access because of how the
Re: oldWALs: what it is and how can I clean it?
Hi, The replication is not turned on HBase... Does this folder should be clean regularly? Because I have data from december 2014... 2015-02-26 1:40 GMT+01:00 Liam Slusser lslus...@gmail.com: I'm having this same problem. I had replication enabled but have since been disabled. However oldWALs still grows. There are so many files in there that running hadoop fs -ls /hbase/oldWALs runs out of memory. On Wed, Feb 25, 2015 at 9:27 AM, Nishanth S nishanth.2...@gmail.com wrote: Do you have replication turned on in hbase and if so is your slave consuming the replicated data?. -Nishanth On Wed, Feb 25, 2015 at 10:19 AM, Madeleine Piffaretti mpiffare...@powerspace.com wrote: Hi all, We are running out of space in our small hadoop cluster so I was checking disk usage on HDFS and I saw that most of the space was occupied by the* /hbase/oldWALs* folder. I have checked in the HBase Definitive Book and others books, web-site and I have also search my issue on google but I didn't find a proper response... So I would like to know what does this folder, what is use for and also how can I free space from this folder without breaking everything... If it's related to a specific version... our cluster is under 5.3.0-1.cdh5.3.0.p0.30 from cloudera (hbase 0.98.6). Thx for your help!
Re: oldWALs: what it is and how can I clean it?
@Madeleine, The folder gets cleaned regularly by a chore in master. When a WAL file is not needed any more for recovery purposes (when HBase can guaratee HBase has flushed all the data in the WAL file), it is moved to the oldWALs folder for archival. The log stays there until all other references to the WAL file are finished. There is currently two services which may keep the files in the archive dir. First is a TTL process, which ensures that the WAL files are kept at least for 10 min. This is mainly for debugging. You can reduce this time by setting hbase.master.logcleaner.ttl configuration property in master. It is by default 60. The other one is replication. If you have replication setup, the replication processes will hang on to the WAL files until they are replicated. Even if you disabled the replication, the files are still referenced. You can look at the logs from master from classes (LogCleaner, TimeToLiveLogCleaner, ReplicationLogCleaner) to see whether the master is actually running this chore and whether it is getting any exceptions. @Liam, Disabled replication will still hold on to the WAL files because, because it has a guarantee to not lose data between disable and enable. You can remove_peer, which frees up the WAL files to be eligible for deletion. When you re-add replication peer again, the replication will start from the current status, versus if you re-enable a peer, it will continue from where it left. On Thu, Feb 26, 2015 at 12:56 AM, Madeleine Piffaretti mpiffare...@powerspace.com wrote: Hi, The replication is not turned on HBase... Does this folder should be clean regularly? Because I have data from december 2014... 2015-02-26 1:40 GMT+01:00 Liam Slusser lslus...@gmail.com: I'm having this same problem. I had replication enabled but have since been disabled. However oldWALs still grows. There are so many files in there that running hadoop fs -ls /hbase/oldWALs runs out of memory. On Wed, Feb 25, 2015 at 9:27 AM, Nishanth S nishanth.2...@gmail.com wrote: Do you have replication turned on in hbase and if so is your slave consuming the replicated data?. -Nishanth On Wed, Feb 25, 2015 at 10:19 AM, Madeleine Piffaretti mpiffare...@powerspace.com wrote: Hi all, We are running out of space in our small hadoop cluster so I was checking disk usage on HDFS and I saw that most of the space was occupied by the* /hbase/oldWALs* folder. I have checked in the HBase Definitive Book and others books, web-site and I have also search my issue on google but I didn't find a proper response... So I would like to know what does this folder, what is use for and also how can I free space from this folder without breaking everything... If it's related to a specific version... our cluster is under 5.3.0-1.cdh5.3.0.p0.30 from cloudera (hbase 0.98.6). Thx for your help!
Re: oldWALs: what it is and how can I clean it?
I'm not able to actually look inside the folder as java runs out of memory trying to do a directory listing...I haven't had more time to look into the problem. On Thu, Feb 26, 2015 at 12:56 AM, Madeleine Piffaretti mpiffare...@powerspace.com wrote: Hi, The replication is not turned on HBase... Does this folder should be clean regularly? Because I have data from december 2014... 2015-02-26 1:40 GMT+01:00 Liam Slusser lslus...@gmail.com: I'm having this same problem. I had replication enabled but have since been disabled. However oldWALs still grows. There are so many files in there that running hadoop fs -ls /hbase/oldWALs runs out of memory. On Wed, Feb 25, 2015 at 9:27 AM, Nishanth S nishanth.2...@gmail.com wrote: Do you have replication turned on in hbase and if so is your slave consuming the replicated data?. -Nishanth On Wed, Feb 25, 2015 at 10:19 AM, Madeleine Piffaretti mpiffare...@powerspace.com wrote: Hi all, We are running out of space in our small hadoop cluster so I was checking disk usage on HDFS and I saw that most of the space was occupied by the* /hbase/oldWALs* folder. I have checked in the HBase Definitive Book and others books, web-site and I have also search my issue on google but I didn't find a proper response... So I would like to know what does this folder, what is use for and also how can I free space from this folder without breaking everything... If it's related to a specific version... our cluster is under 5.3.0-1.cdh5.3.0.p0.30 from cloudera (hbase 0.98.6). Thx for your help!
Re: oldWALs: what it is and how can I clean it?
Huge thanks, Enis, that was the information I was looking for. Cheers! liam On Thu, Feb 26, 2015 at 3:48 PM, Enis Söztutar enis@gmail.com wrote: @Madeleine, The folder gets cleaned regularly by a chore in master. When a WAL file is not needed any more for recovery purposes (when HBase can guaratee HBase has flushed all the data in the WAL file), it is moved to the oldWALs folder for archival. The log stays there until all other references to the WAL file are finished. There is currently two services which may keep the files in the archive dir. First is a TTL process, which ensures that the WAL files are kept at least for 10 min. This is mainly for debugging. You can reduce this time by setting hbase.master.logcleaner.ttl configuration property in master. It is by default 60. The other one is replication. If you have replication setup, the replication processes will hang on to the WAL files until they are replicated. Even if you disabled the replication, the files are still referenced. You can look at the logs from master from classes (LogCleaner, TimeToLiveLogCleaner, ReplicationLogCleaner) to see whether the master is actually running this chore and whether it is getting any exceptions. @Liam, Disabled replication will still hold on to the WAL files because, because it has a guarantee to not lose data between disable and enable. You can remove_peer, which frees up the WAL files to be eligible for deletion. When you re-add replication peer again, the replication will start from the current status, versus if you re-enable a peer, it will continue from where it left. On Thu, Feb 26, 2015 at 12:56 AM, Madeleine Piffaretti mpiffare...@powerspace.com wrote: Hi, The replication is not turned on HBase... Does this folder should be clean regularly? Because I have data from december 2014... 2015-02-26 1:40 GMT+01:00 Liam Slusser lslus...@gmail.com: I'm having this same problem. I had replication enabled but have since been disabled. However oldWALs still grows. There are so many files in there that running hadoop fs -ls /hbase/oldWALs runs out of memory. On Wed, Feb 25, 2015 at 9:27 AM, Nishanth S nishanth.2...@gmail.com wrote: Do you have replication turned on in hbase and if so is your slave consuming the replicated data?. -Nishanth On Wed, Feb 25, 2015 at 10:19 AM, Madeleine Piffaretti mpiffare...@powerspace.com wrote: Hi all, We are running out of space in our small hadoop cluster so I was checking disk usage on HDFS and I saw that most of the space was occupied by the* /hbase/oldWALs* folder. I have checked in the HBase Definitive Book and others books, web-site and I have also search my issue on google but I didn't find a proper response... So I would like to know what does this folder, what is use for and also how can I free space from this folder without breaking everything... If it's related to a specific version... our cluster is under 5.3.0-1.cdh5.3.0.p0.30 from cloudera (hbase 0.98.6). Thx for your help!