Re: HBase connection pool

2015-02-26 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Nick,

I tried what you suggested, 1 HConnection and 1 Configuration for the entire 
app:

this.config = HBaseConfiguration.create();
this.connection = HConnectionManager.createConnection(config);

And Threaded pooled HTableInterfaces:

final HConnection lconnection = this.connection;
this.tlTable = new ThreadLocalHTableInterface() {
@Override
protected HTableInterface initialValue() {
try {
return lconnection.getTable(HBaseSerialWritesPOC);
// return new HTable(tlConfig.get(),
// HBaseSerialWritesPOC);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
};

I started getting this error in my application:

2015-02-26 10:23:17,833 INFO  [main-SendThread(xxx)] zookeeper.ClientCnxn 
(ClientCnxn.java:logStartConnect(966)) - Opening socket connection to server 
xxx. Will not attempt to authenticate using SASL (unknown error)
2015-02-26 10:23:17,834 INFO  [main-SendThread(xxx)] zookeeper.ClientCnxn 
(ClientCnxn.java:primeConnection(849)) - Socket connection established to xxx, 
initiating session
2015-02-26 10:23:17,836 WARN  [main-SendThread(xxx)] zookeeper.ClientCnxn 
(ClientCnxn.java:run(1089)) - Session 0x0 for server xxx, unexpected error, 
closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)


-Marcelo

From: ndimi...@gmail.com 
Subject: Re: HBase connection pool

Okay, looks like you're using a implicitly managed connection. It should be 
fine to share a single config instance across all threads. The advantage of 
HTablePool over this approach is that the number of HTables would be managed 
independently from the number of Threads. This may or not be a concern for you, 
based on your memory requirements, c. In your case, you're not specifying an 
ExecutorService per HTable, so the HTable instances will be relatively light 
weight. Each table will manage it's own write buffer, which can be shared by 
multiple threads when autoFlush is disabled and HTablePool is used. This may or 
may not be desirable, depending on your use-case.

For what it's worth, HTablePool is marked deprecated in 1.0, will likely be 
removed in 2.0. To future proof this code, I would move to a single shared 
HConnection for the whole application, and a thread-local HTable created 
from/with that connection.

-n

On Wed, Feb 25, 2015 at 10:53 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

Hi Nick, 

I am using HBase version 0.96, I sent the link from version 0.94 because I 
haven't found the java API docs for 0.96, sorry about that.
I have created the HTable directly from the config object, as follows:


this.tlConfig = new ThreadLocalConfiguration() {

@Override
protected Configuration initialValue() {
return HBaseConfiguration.create();
}
};
this.tlTable = new ThreadLocalHTable() {
@Override
protected HTable initialValue() {
try {
return new HTable(tlConfig.get(), HBaseSerialWritesPOC);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
};

I am now sure if the Configuration object should be 1 per thread as well, maybe 
I could share this one? 

So, just to clarify, would I get any advantage using HTablePool object instead 
of ThreadLocalHTable as I did?

-Marcelo

From: ndimi...@gmail.com 
Subject: Re: HBase connection pool

Hi Marcelo,

First thing, to be clear, you're working with a 0.94 release? The reason I ask 
is we've been doing some work in this area to improve things, so semantics may 
be slightly different between 0.94, 0.98, and 1.0.

How are you managing the HConnection object (or are you)? How are you creating 
your HTable instances? These will determine how the connection is obtained and 
used in relation to HTables.

In general, multiple HTable instances connected to tables in the same cluster 
should be sharing the same HConnection instance. This is handled explicitly 
when you manage your own HConnection and HTables (i.e., HConnection conn = ... 
; HTable t = new HTable(TABLE_NAME, conn); ) It's handled implicitly when you 
construct via Configuration objects (HTable t = new HTable(conf, TABLE_NAME); ) 
This implicit option is going away in future versions.

HTable is not safe for concurrent access because of how the write path is 
implemented (at least; there may be other portions that I'm not as familiar 
with). You should be perfectly fine to have an HTable per thread in a 
ThreadLocal.

-n

On Wed, Feb 25, 2015 at 9:41 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

In HBase API, does 1 

Re: HBase scan time range, inconsistency

2015-02-26 Thread Michael Segel
Ok… 

Silly question time… so just humor me for a second.

1) What do you mean by saying your have a partitioned HBase table?  (Regions 
and partitions are not the same) 

2) There’s a question of the isolation level during the scan. What happens when 
there is a compaction running or there’s RLL taking place? 

Does your scan get locked/blocked? Does it skip the row? 
(This should be documented.) 
Do you count the number of rows scanned when building the list of rows that 
need to be processed further? 





 On Feb 25, 2015, at 4:46 PM, Stephen Durfey sjdur...@gmail.com wrote:

 
 
 Are you writing any Deletes? Are you writing any duplicates?
 
 
 No physical deletes are occurring in my data, and there is a very real
 possibility of duplicates.
 
 How is the partitioning done?
 
 
 The key structure would be /partition_id/person_id  I'm dealing with
 clinical data, with a data source identified by the partition, and the
 person data is associated with that particular partition at load time.
 
 Are you doing the column filtering with a custom filter or one of the
 prepackaged ones?
 
 
 They appear to be all prepackaged filters:  FamilyFilter, KeyOnlyFilter,
 QualifierFilter, and ColumnPrefixFilter are used under various conditions,
 depending upon what is requested on the Scan object.
 
 
 On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey bus...@cloudera.com wrote:
 
 Are you writing any Deletes? Are you writing any duplicates?
 
 How is the partitioning done?
 
 What does the entire key structure look like?
 
 Are you doing the column filtering with a custom filter or one of the
 prepackaged ones?
 
 On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey sjdur...@gmail.com
 wrote:
 
 
 What's the TTL setting for your table ?
 
 Which hbase release are you using ?
 
 Was there compaction in between the scans ?
 
 Thanks
 
 
 The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don’t
 want to say compactions aren’t a factor, but the jobs are short lived
 (4-5
 minutes), and I have ran them frequently over the last couple of days
 trying to gather stats around what was being extracted, and trying to
 find
 the difference and intersection in row keys before job runs.
 
 These numbers have varied wildly, from being off by 2-3 between
 
 subsequent scans to 40 row increases, followed by a drop of 70 rows.
 When you say there is a variation in the number of rows retrieved - the
 40
 rows that got increased - are those rows in the expected time range? Or
 is
 the system retrieving some rows which are not in the specified time
 range?
 
 And when the rows drop by 70, are you using any row which was needed to
 be
 retrieved got missed out?
 
 
 The best I can tell, if there is an increase in counts, those rows are
 not
 coming from outside of the time range. In the job, I am maintaining a
 list
 of rows that have a timestamp outside of my provided time range, and then
 writing those out to hdfs at the end of the map task. So far, nothing has
 been written out.
 
 Any filters in your scan?
 
 
 
 Regards
 Ram
 
 
 There are some column filters. There is an API abstraction on top of
 hbase
 that I am using to allow users to easily extract data from columns that
 start with a provided column prefix. So, the column filters are in place
 to
 ensure I am only getting back data from columns that start with the
 provided prefix.
 
 To add a little more detail, my row keys are separated out by partition.
 At
 periodic times (through oozie), data is loaded from a source into the
 appropriate partition. I ran some scans against a partition that hadn't
 been updated in almost a year (with a scan range around the times of the
 2nd to last load into the table), and the row key counts were consistent
 across multiple scans. I chose another partition that is actively being
 updated once a day. I chose a scan time around the 4th most recent load,
 and the results were inconsistent from scan to scan (fluctuating up and
 down). Setting the begin time to 4 days in the past end time on the scan
 range to 'right now', using System.currentTimeMillis() (with the time
 being
 after the daily load), the results also fluctuated up and down. So, it
 kind
 of seems like there is some sort of temporal recency that is causing the
 counts to fluctuate.
 
 
 
 On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan 
 ramkrishna.s.vasude...@gmail.com wrote:
 
 These numbers have varied wildly, from being off by 2-3 between
 
 subsequent scans to 40 row increases, followed by a drop of 70 rows.
 When you say there is a variation in the number of rows retrieved - the
 40
 rows that got increased - are those rows in the expected time range? Or
 is
 the system retrieving some rows which are not in the specified time
 range?
 
 And when the rows drop by 70, are you using any row which was needed to
 be
 retrieved got missed out?
 
 Any filters in your scan?
 
 Regards
 Ram
 
 On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu yuzhih...@gmail.com wrote:
 
 

HBase and YARN/MR

2015-02-26 Thread Ian Friedman
Hi all,

We're currently moving to Hadoop 2 (years behind, I know) and debating how to 
handle job resource management using YARN where nearly 100% of our jobs are 
maps over HBase Tables and a large portion also Reduce to HBase. While YARN 
adequately handles the resources of the machine its tasks are running on, for 
the purposes of TableMapper jobs, the resources consumed are actually on the 
remote regionserver, which YARN doesn't seem to be able to recognize. We've 
implemented things such as per-job concurrent task limits to help deal with 
this on Hadoop 1, but that seems hard to do in Hadoop 2. I'm wondering if 
anyone has best practices or any ideas on how to deal with an all HBase, 
heavily I/O and RegionServer memory/RPC bound workload? Thanks in advance!

--Ian

Re: HBase scan time range, inconsistency

2015-02-26 Thread Stephen Durfey

 1) What do you mean by saying your have a partitioned HBase table?
 (Regions and partitions are not the same)


By partitions, I just mean logical partitions, using the row key to keep
data from separate data sources apart from each other.

I think the issue may be resolved now, but it isn't obvious to me why the
change works. The table is set to the save the max number of versions, but
the number of versions is not specified in the Scan object. Once I changed
the Scan to request the max number of versions the counts remained the same
across all subsequent job runs. Can anyone provide some insight as to why
this is the case?

On Thu, Feb 26, 2015 at 8:35 AM, Michael Segel michael_se...@hotmail.com
wrote:

 Ok…

 Silly question time… so just humor me for a second.

 1) What do you mean by saying your have a partitioned HBase table?
 (Regions and partitions are not the same)

 2) There’s a question of the isolation level during the scan. What happens
 when there is a compaction running or there’s RLL taking place?

 Does your scan get locked/blocked? Does it skip the row?
 (This should be documented.)
 Do you count the number of rows scanned when building the list of rows
 that need to be processed further?





  On Feb 25, 2015, at 4:46 PM, Stephen Durfey sjdur...@gmail.com wrote:

 
 
  Are you writing any Deletes? Are you writing any duplicates?
 
 
  No physical deletes are occurring in my data, and there is a very real
  possibility of duplicates.
 
  How is the partitioning done?
 
 
  The key structure would be /partition_id/person_id  I'm dealing with
  clinical data, with a data source identified by the partition, and the
  person data is associated with that particular partition at load time.
 
  Are you doing the column filtering with a custom filter or one of the
  prepackaged ones?
 
 
  They appear to be all prepackaged filters:  FamilyFilter, KeyOnlyFilter,
  QualifierFilter, and ColumnPrefixFilter are used under various
 conditions,
  depending upon what is requested on the Scan object.
 
 
  On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey bus...@cloudera.com
 wrote:
 
  Are you writing any Deletes? Are you writing any duplicates?
 
  How is the partitioning done?
 
  What does the entire key structure look like?
 
  Are you doing the column filtering with a custom filter or one of the
  prepackaged ones?
 
  On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey sjdur...@gmail.com
  wrote:
 
 
  What's the TTL setting for your table ?
 
  Which hbase release are you using ?
 
  Was there compaction in between the scans ?
 
  Thanks
 
 
  The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I
 don’t
  want to say compactions aren’t a factor, but the jobs are short lived
  (4-5
  minutes), and I have ran them frequently over the last couple of days
  trying to gather stats around what was being extracted, and trying to
  find
  the difference and intersection in row keys before job runs.
 
  These numbers have varied wildly, from being off by 2-3 between
 
  subsequent scans to 40 row increases, followed by a drop of 70 rows.
  When you say there is a variation in the number of rows retrieved -
 the
  40
  rows that got increased - are those rows in the expected time range?
 Or
  is
  the system retrieving some rows which are not in the specified time
  range?
 
  And when the rows drop by 70, are you using any row which was needed
 to
  be
  retrieved got missed out?
 
 
  The best I can tell, if there is an increase in counts, those rows are
  not
  coming from outside of the time range. In the job, I am maintaining a
  list
  of rows that have a timestamp outside of my provided time range, and
 then
  writing those out to hdfs at the end of the map task. So far, nothing
 has
  been written out.
 
  Any filters in your scan?
 
 
 
  Regards
  Ram
 
 
  There are some column filters. There is an API abstraction on top of
  hbase
  that I am using to allow users to easily extract data from columns that
  start with a provided column prefix. So, the column filters are in
 place
  to
  ensure I am only getting back data from columns that start with the
  provided prefix.
 
  To add a little more detail, my row keys are separated out by
 partition.
  At
  periodic times (through oozie), data is loaded from a source into the
  appropriate partition. I ran some scans against a partition that hadn't
  been updated in almost a year (with a scan range around the times of
 the
  2nd to last load into the table), and the row key counts were
 consistent
  across multiple scans. I chose another partition that is actively being
  updated once a day. I chose a scan time around the 4th most recent
 load,
  and the results were inconsistent from scan to scan (fluctuating up and
  down). Setting the begin time to 4 days in the past end time on the
 scan
  range to 'right now', using System.currentTimeMillis() (with the time
  being
  after the daily load), the results also fluctuated up and down. 

Re: HBase connection pool

2015-02-26 Thread Nick Dimiduk
Can you tell when these WARN messages are produced? Is it related to the
creation of the connection object or one of the HTable instances?

On Thu, Feb 26, 2015 at 7:27 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

 Nick,

 I tried what you suggested, 1 HConnection and 1 Configuration for the
 entire app:

 this.config = HBaseConfiguration.create();
 this.connection = HConnectionManager.createConnection(config);

 And Threaded pooled HTableInterfaces:

 final HConnection lconnection = this.connection;
 this.tlTable = new ThreadLocalHTableInterface() {
 @Override
 protected HTableInterface initialValue() {
 try {
 return lconnection.getTable(HBaseSerialWritesPOC);
 // return new HTable(tlConfig.get(),
 // HBaseSerialWritesPOC);
 } catch (IOException e) {
 throw new RuntimeException(e);
 }
 }
 };

 I started getting this error in my application:

 2015-02-26 10:23:17,833 INFO [main-SendThread(xxx)] zookeeper.ClientCnxn
 (ClientCnxn.java:logStartConnect(966)) - Opening socket connection to
 server xxx. Will not attempt to authenticate using SASL (unknown error)
 2015-02-26 10:23:17,834 INFO [main-SendThread(xxx)] zookeeper.ClientCnxn
 (ClientCnxn.java:primeConnection(849)) - Socket connection established to
 xxx, initiating session
 2015-02-26 10:23:17,836 WARN [main-SendThread(xxx)] zookeeper.ClientCnxn
 (ClientCnxn.java:run(1089)) - Session 0x0 for server xxx, unexpected error,
 closing socket connection and attempting reconnect
 java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
 at sun.nio.ch.IOUtil.read(IOUtil.java:192)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
 at
 org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
 at
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)


 -Marcelo

 From: ndimi...@gmail.com
 Subject: Re: HBase connection pool

 Okay, looks like you're using a implicitly managed connection. It should
 be fine to share a single config instance across all threads. The advantage
 of HTablePool over this approach is that the number of HTables would be
 managed independently from the number of Threads. This may or not be a
 concern for you, based on your memory requirements, c. In your case,
 you're not specifying an ExecutorService per HTable, so the HTable
 instances will be relatively light weight. Each table will manage it's own
 write buffer, which can be shared by multiple threads when autoFlush is
 disabled and HTablePool is used. This may or may not be desirable,
 depending on your use-case.

 For what it's worth, HTablePool is marked deprecated in 1.0, will likely
 be removed in 2.0. To future proof this code, I would move to a single
 shared HConnection for the whole application, and a thread-local HTable
 created from/with that connection.

 -n

 On Wed, Feb 25, 2015 at 10:53 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

 Hi Nick,

 I am using HBase version 0.96, I sent the link from version 0.94 because
 I haven't found the java API docs for 0.96, sorry about that.
 I have created the HTable directly from the config object, as follows:

 this.tlConfig = new ThreadLocalConfiguration() {

 @Override
 protected Configuration initialValue() {
 return HBaseConfiguration.create();
 }
 };
 this.tlTable = new ThreadLocalHTable() {
 @Override
 protected HTable initialValue() {
 try {
 return new HTable(tlConfig.get(), HBaseSerialWritesPOC);
 } catch (IOException e) {
 throw new RuntimeException(e);
 }
 }
 };

 I am now sure if the Configuration object should be 1 per thread as well,
 maybe I could share this one?

 So, just to clarify, would I get any advantage using HTablePool object
 instead of ThreadLocalHTable as I did?

 -Marcelo

 From: ndimi...@gmail.com
 Subject: Re: HBase connection pool

 Hi Marcelo,

 First thing, to be clear, you're working with a 0.94 release? The reason
 I ask is we've been doing some work in this area to improve things, so
 semantics may be slightly different between 0.94, 0.98, and 1.0.

 How are you managing the HConnection object (or are you)? How are you
 creating your HTable instances? These will determine how the connection is
 obtained and used in relation to HTables.

 In general, multiple HTable instances connected to tables in the same
 cluster should be sharing the same HConnection instance. This is handled
 explicitly when you manage your own HConnection and HTables (i.e.,
 HConnection conn = ... ; HTable t = new HTable(TABLE_NAME, conn); ) It's
 handled implicitly when you construct via Configuration objects (HTable t =
 new HTable(conf, TABLE_NAME); ) This implicit option is going away in
 future versions.

 HTable is not safe for concurrent access because of how the 

Re: oldWALs: what it is and how can I clean it?

2015-02-26 Thread Madeleine Piffaretti
Hi,

The replication is not turned on HBase...
Does this folder should be clean regularly? Because I have data from
december 2014...


2015-02-26 1:40 GMT+01:00 Liam Slusser lslus...@gmail.com:

 I'm having this same problem.  I had replication enabled but have since
 been disabled.  However oldWALs still grows.  There are so many files in
 there that running hadoop fs -ls /hbase/oldWALs runs out of memory.

 On Wed, Feb 25, 2015 at 9:27 AM, Nishanth S nishanth.2...@gmail.com
 wrote:

  Do you have replication turned on in hbase and  if so is your slave
   consuming the replicated data?.
 
  -Nishanth
 
  On Wed, Feb 25, 2015 at 10:19 AM, Madeleine Piffaretti 
  mpiffare...@powerspace.com wrote:
 
   Hi all,
  
   We are running out of space in our small hadoop cluster so I was
 checking
   disk usage on HDFS and I saw that most of the space was occupied by
 the*
   /hbase/oldWALs* folder.
  
   I have checked in the HBase Definitive Book and others books,
 web-site
   and I have also search my issue on google but I didn't find a proper
   response...
  
   So I would like to know what does this folder, what is use for and also
  how
   can I free space from this folder without breaking everything...
  
  
   If it's related to a specific version... our cluster is under
   5.3.0-1.cdh5.3.0.p0.30 from cloudera (hbase 0.98.6).
  
   Thx for your help!
  
 



Re: oldWALs: what it is and how can I clean it?

2015-02-26 Thread Enis Söztutar
@Madeleine,

The folder gets cleaned regularly by a chore in master. When a WAL file is
not needed any more for recovery purposes (when HBase can guaratee HBase
has flushed all the data in the WAL file), it is moved to the oldWALs
folder for archival. The log stays there until all other references to the
WAL file are finished. There is currently two services which may keep the
files in the archive dir. First is a TTL process, which ensures that the
WAL files are kept at least for 10 min. This is mainly for debugging. You
can reduce this time by setting hbase.master.logcleaner.ttl configuration
property in master. It is by default 60. The other one is replication.
If you have replication setup, the replication processes will hang on to
the WAL files until they are replicated. Even if you disabled the
replication, the files are still referenced.

You can look at the logs from master from classes (LogCleaner,
TimeToLiveLogCleaner, ReplicationLogCleaner) to see whether the master is
actually running this chore and whether it is getting any exceptions.

@Liam,
Disabled replication will still hold on to the WAL files because, because
it has a guarantee to not lose data between disable and enable. You can
remove_peer, which frees up the WAL files to be eligible for deletion. When
you re-add replication peer again, the replication will start from the
current status, versus if you re-enable a peer, it will continue from where
it left.



On Thu, Feb 26, 2015 at 12:56 AM, Madeleine Piffaretti 
mpiffare...@powerspace.com wrote:

 Hi,

 The replication is not turned on HBase...
 Does this folder should be clean regularly? Because I have data from
 december 2014...


 2015-02-26 1:40 GMT+01:00 Liam Slusser lslus...@gmail.com:

  I'm having this same problem.  I had replication enabled but have since
  been disabled.  However oldWALs still grows.  There are so many files in
  there that running hadoop fs -ls /hbase/oldWALs runs out of memory.
 
  On Wed, Feb 25, 2015 at 9:27 AM, Nishanth S nishanth.2...@gmail.com
  wrote:
 
   Do you have replication turned on in hbase and  if so is your slave
consuming the replicated data?.
  
   -Nishanth
  
   On Wed, Feb 25, 2015 at 10:19 AM, Madeleine Piffaretti 
   mpiffare...@powerspace.com wrote:
  
Hi all,
   
We are running out of space in our small hadoop cluster so I was
  checking
disk usage on HDFS and I saw that most of the space was occupied by
  the*
/hbase/oldWALs* folder.
   
I have checked in the HBase Definitive Book and others books,
  web-site
and I have also search my issue on google but I didn't find a proper
response...
   
So I would like to know what does this folder, what is use for and
 also
   how
can I free space from this folder without breaking everything...
   
   
If it's related to a specific version... our cluster is under
5.3.0-1.cdh5.3.0.p0.30 from cloudera (hbase 0.98.6).
   
Thx for your help!
   
  
 



Re: oldWALs: what it is and how can I clean it?

2015-02-26 Thread Liam Slusser
I'm not able to actually look inside the folder as java runs out of memory
trying to do a directory listing...I haven't had more time to look into the
problem.

On Thu, Feb 26, 2015 at 12:56 AM, Madeleine Piffaretti 
mpiffare...@powerspace.com wrote:

 Hi,

 The replication is not turned on HBase...
 Does this folder should be clean regularly? Because I have data from
 december 2014...


 2015-02-26 1:40 GMT+01:00 Liam Slusser lslus...@gmail.com:

  I'm having this same problem.  I had replication enabled but have since
  been disabled.  However oldWALs still grows.  There are so many files in
  there that running hadoop fs -ls /hbase/oldWALs runs out of memory.
 
  On Wed, Feb 25, 2015 at 9:27 AM, Nishanth S nishanth.2...@gmail.com
  wrote:
 
   Do you have replication turned on in hbase and  if so is your slave
consuming the replicated data?.
  
   -Nishanth
  
   On Wed, Feb 25, 2015 at 10:19 AM, Madeleine Piffaretti 
   mpiffare...@powerspace.com wrote:
  
Hi all,
   
We are running out of space in our small hadoop cluster so I was
  checking
disk usage on HDFS and I saw that most of the space was occupied by
  the*
/hbase/oldWALs* folder.
   
I have checked in the HBase Definitive Book and others books,
  web-site
and I have also search my issue on google but I didn't find a proper
response...
   
So I would like to know what does this folder, what is use for and
 also
   how
can I free space from this folder without breaking everything...
   
   
If it's related to a specific version... our cluster is under
5.3.0-1.cdh5.3.0.p0.30 from cloudera (hbase 0.98.6).
   
Thx for your help!
   
  
 



Re: oldWALs: what it is and how can I clean it?

2015-02-26 Thread Liam Slusser
Huge thanks, Enis, that was the information I was looking for.

Cheers!
liam


On Thu, Feb 26, 2015 at 3:48 PM, Enis Söztutar enis@gmail.com wrote:

 @Madeleine,

 The folder gets cleaned regularly by a chore in master. When a WAL file is
 not needed any more for recovery purposes (when HBase can guaratee HBase
 has flushed all the data in the WAL file), it is moved to the oldWALs
 folder for archival. The log stays there until all other references to the
 WAL file are finished. There is currently two services which may keep the
 files in the archive dir. First is a TTL process, which ensures that the
 WAL files are kept at least for 10 min. This is mainly for debugging. You
 can reduce this time by setting hbase.master.logcleaner.ttl configuration
 property in master. It is by default 60. The other one is replication.
 If you have replication setup, the replication processes will hang on to
 the WAL files until they are replicated. Even if you disabled the
 replication, the files are still referenced.

 You can look at the logs from master from classes (LogCleaner,
 TimeToLiveLogCleaner, ReplicationLogCleaner) to see whether the master is
 actually running this chore and whether it is getting any exceptions.

 @Liam,
 Disabled replication will still hold on to the WAL files because, because
 it has a guarantee to not lose data between disable and enable. You can
 remove_peer, which frees up the WAL files to be eligible for deletion. When
 you re-add replication peer again, the replication will start from the
 current status, versus if you re-enable a peer, it will continue from where
 it left.



 On Thu, Feb 26, 2015 at 12:56 AM, Madeleine Piffaretti 
 mpiffare...@powerspace.com wrote:

  Hi,
 
  The replication is not turned on HBase...
  Does this folder should be clean regularly? Because I have data from
  december 2014...
 
 
  2015-02-26 1:40 GMT+01:00 Liam Slusser lslus...@gmail.com:
 
   I'm having this same problem.  I had replication enabled but have since
   been disabled.  However oldWALs still grows.  There are so many files
 in
   there that running hadoop fs -ls /hbase/oldWALs runs out of memory.
  
   On Wed, Feb 25, 2015 at 9:27 AM, Nishanth S nishanth.2...@gmail.com
   wrote:
  
Do you have replication turned on in hbase and  if so is your slave
 consuming the replicated data?.
   
-Nishanth
   
On Wed, Feb 25, 2015 at 10:19 AM, Madeleine Piffaretti 
mpiffare...@powerspace.com wrote:
   
 Hi all,

 We are running out of space in our small hadoop cluster so I was
   checking
 disk usage on HDFS and I saw that most of the space was occupied by
   the*
 /hbase/oldWALs* folder.

 I have checked in the HBase Definitive Book and others books,
   web-site
 and I have also search my issue on google but I didn't find a
 proper
 response...

 So I would like to know what does this folder, what is use for and
  also
how
 can I free space from this folder without breaking everything...


 If it's related to a specific version... our cluster is under
 5.3.0-1.cdh5.3.0.p0.30 from cloudera (hbase 0.98.6).

 Thx for your help!