Good idea, let me try it.
J-D
On Wed, Jan 13, 2010 at 11:01 AM, Joydeep Sarma wrote:
> i posted on the jira as well - but we should be able to simulate the
> effect of the patch.
>
> if the sync was simulated merely a sleep (for 2-3ms - whatever is the
> average RTT for dfs write pipeline) inste
i posted on the jira as well - but we should be able to simulate the
effect of the patch.
if the sync was simulated merely a sleep (for 2-3ms - whatever is the
average RTT for dfs write pipeline) instead of an actual call into dfs
client - it should simulate the effect of the patch. (the appends
w
Awesome, I will try to post a patch soon and will let you know as soon as I
have the first version ready.
thanks,
dhruba
On Wed, Jan 13, 2010 at 10:40 AM, Jean-Daniel Cryans wrote:
> I'll be happy to benchmark, we already have code to test the
> multi-client hitting 1 region server case.
> k
I'll be happy to benchmark, we already have code to test the
multi-client hitting 1 region server case.
J-D
On Wed, Jan 13, 2010 at 10:38 AM, Dhruba Borthakur wrote:
> I will try to make a patch for it first. depending on the complexity of the
> patch code, we can decide which release it can go
I will try to make a patch for it first. depending on the complexity of the
patch code, we can decide which release it can go in.
thanks,
dhruba
On Wed, Jan 13, 2010 at 9:56 AM, Jean-Daniel Cryans wrote:
> That's great dhruba, I guess the sooner it could go in is 0.21.1?
>
> J-D
>
> On Wed, Jan
That's great dhruba, I guess the sooner it could go in is 0.21.1?
J-D
On Wed, Jan 13, 2010 at 8:51 AM, Dhruba Borthakur wrote:
> I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
>
> thanks,
> dhruba
>
> On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma wrote:
>
>> this is inter
I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
thanks,
dhruba
On Tue, Jan 12, 2010 at 9:41 PM, Joydeep Sarma wrote:
> this is internal to the dfsclient. this would explain why performance
> would suck with queue threshold of 1.
>
> leave it up to Dhruba to explain the deta
this is internal to the dfsclient. this would explain why performance
would suck with queue threshold of 1.
leave it up to Dhruba to explain the details.
On Tue, Jan 12, 2010 at 9:16 PM, stack wrote:
> On Tue, Jan 12, 2010 at 9:12 PM, stack wrote:
>
>> > any IO to a HDFS-file (appends, writes,
On Tue, Jan 12, 2010 at 9:12 PM, stack wrote:
> > any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
> > pending sync. "sync" in HDFS is a pretty heavyweight operation as it
> stands.
>
> i think this is likely to explain limited throughput with the default
> write queue thresh
, kan...@facebook.com, Dhruba Borthakur <
dhr...@facebook.com>
Date: Tue, 12 Jan 2010 15:39:05 -0800
Subject: Re: commit semantics
btw - i followed up with Dhruba afterwards on this comment:
> any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
> pending sync. "syn
On Tue, Jan 12, 2010 at 1:07 PM, Kannan Muthukkaruppan
wrote:
>
> Seems like we all generally agree that large number of regions per region
> server may not be the way to go.
>
> What Andrew says. You could make regions bigger so more data per
regionserver but same rough (small) number to redeplo
; Kannan
> -Original Message-
> From: Andrew Purtell [mailto:apurt...@apache.org]
> Sent: Tuesday, January 12, 2010 12:50 PM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: commit semantics
>
>> But would say having a
>> smaller number of regions per regio
nal Message-
From: Andrew Purtell [mailto:apurt...@apache.org]
Sent: Tuesday, January 12, 2010 12:50 PM
To: hbase-dev@hadoop.apache.org
Subject: Re: commit semantics
> But would say having a
> smaller number of regions per region server (say ~50) be really bad.
Not at all.
There are
l Message
> From: Kannan Muthukkaruppan
> To: "hbase-dev@hadoop.apache.org"
> Sent: Tue, January 12, 2010 11:40:00 AM
> Subject: RE: commit semantics
>
> Btw, is there much gains in having a large number of regions-- i.e. to the
> tune
> of 500 -- per re
ase-dev@hadoop.apache.org
Subject: Re: commit semantics
On Tue, Jan 12, 2010 at 11:29 AM, Kannan Muthukkaruppan
wrote:
>
> For data integrity, going with group commits (batch commits) seems like a
> good option. My understanding of group commits as implemented in 0.21 is as
> follows:
&
Cryans
> Sent: Tuesday, January 12, 2010 9:42 AM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: commit semantics
>
> wrt 1 HLog per region server, this is from the Bigtable paper. Their
> main concern is the number of opened files since if you have 1000
> region servers * 500 r
Sent: Tuesday, January 12, 2010 10:52 AM
> To: hbase-dev@hadoop.apache.org
> Cc: Kannan Muthukkaruppan; Dhruba Borthakur
> Subject: Re: commit semantics
>
> On Tue, Jan 12, 2010 at 10:14 AM, Dhruba Borthakur
> mailto:dhr...@gmail.com>> wrote:
> Hi stack,
>
> I wa
server?
regards,
Kannan
-Original Message-
From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean-Daniel
Cryans
Sent: Tuesday, January 12, 2010 9:42 AM
To: hbase-dev@hadoop.apache.org
Subject: Re: commit semantics
wrt 1 HLog per region server, this is from the Bigtable paper
aint@gmail.com] On Behalf Of stack
Sent: Tuesday, January 12, 2010 10:52 AM
To: hbase-dev@hadoop.apache.org
Cc: Kannan Muthukkaruppan; Dhruba Borthakur
Subject: Re: commit semantics
On Tue, Jan 12, 2010 at 10:14 AM, Dhruba Borthakur
mailto:dhr...@gmail.com>> wrote:
Hi stack,
I was meaning &q
On Tue, Jan 12, 2010 at 10:14 AM, Dhruba Borthakur wrote:
> Hi stack,
>
> I was meaning "what if the application inserted the same record into two
> Hbase instances"? Of course, now the onus is on the appl to keep both of
> them in sync and recover from any inconsistencies between them.
>
>
Ok.
Hi stack,
I was meaning "what if the application inserted the same record into two
Hbase instances"? Of course, now the onus is on the appl to keep both of
them in sync and recover from any inconsistencies between them.
thanks,
dhruba
On Tue, Jan 12, 2010 at 9:58 AM, stack wrote:
> On Mon, Jan
On Mon, Jan 11, 2010 at 10:25 PM, Dhruba Borthakur wrote:
> if we want the best of both worlds.. latency as well as data integrity, how
> about inserting the same record into two completely separate HBase tables
> in
> parallel... the operation can complete as soon as the record is inserted
> int
wrt 1 HLog per region server, this is from the Bigtable paper. Their
main concern is the number of opened files since if you have 1000
region servers * 500 regions then you may have 100 000 HLogs to
manage. Also you can have more than one file per HLog, so let's say
you have on average 5 log files
On Tue, Jan 12, 2010 at 12:24 AM, Dhruba Borthakur wrote:
> Hi Ryan,
>
> thanks for ur response.
>
>>Right now each regionserver has 1 log, so if 2 puts on different
>>tables hit the same RS, they hit the same HLog.
>
> I understand. My point was that the application could insert the same record
>
Hi Ryan,
thanks for ur response.
>Right now each regionserver has 1 log, so if 2 puts on different
>tables hit the same RS, they hit the same HLog.
I understand. My point was that the application could insert the same record
into two different tables on two different Hbase instances on two diffe
Right now each regionserver has 1 log, so if 2 puts on different
tables hit the same RS, they hit the same HLog.
There are 2 performance enhancing things in trunk:
- bulk commit - we only call sync() once per RPC, no matter how many
rows are involved. If you use the batch put API you can get real
any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
pending sync. "sync" in HDFS is a pretty heavyweight operation as it stands.
if we want the best of both worlds.. latency as well as data integrity, how
about inserting the same record into two completely separate HBase tables i
Inline.
J-D
On Mon, Jan 11, 2010 at 8:12 PM, Joydeep Sarma wrote:
> ok - hadn't thought about it that way - but yeah with a default of 1 -
> the semantics seem correct.
>
> under high load - some batching would automatically happen at this
> setting (or so one would think - not sure if hdfs appe
ok - hadn't thought about it that way - but yeah with a default of 1 -
the semantics seem correct.
under high load - some batching would automatically happen at this
setting (or so one would think - not sure if hdfs appends are blocked
on pending syncs (in which case the batching wouldn't quite ha
Hey Joydeep,
This is actually intended this way but the name of the variable is
misleading. The sync is done only if forceSync or we have enough
entries to sync (default is 1). If someone wants to sync only 100
entries for example, they would play with that configuration.
Hope that helps,
J-D
O
Performance It's all about performance.
In my own tests, calling sync() in HDFS-0.21 on every single commit
can limit the number of small rows you do to about a max of 1200 a
second. One way to speed things up is to sync less often. Another
way is to sync on a timer instead. Both of these a
31 matches
Mail list logo