Ryan previously said that there must be something interesting going on
in the region server logs, and by looking at your code I'm convinced
that you will indeed find an answer to your slowness. Do look at them!
He also talked about the 1 RPC thread per client, so multiple JVMs
should be faster. And for the ZK connection, that's ok since there's
almost no traffic going there.

So each value that you insert is exactly 3.4k, which is much higher
than what the default configurations are tuned for. I bet you will see
a lot of log rolling happening, flushes, compactions, splits, etc.
Those all take time and, when you hit some blocks that are in place,
it stops the clients from inserting until some condition happens.

See in these slides the settings we used for our initial import at
StumbleUpon: http://people.apache.org/~jdcryans/HUG8/HUG8-rawson.pdf.
The blocking store files and memstore block multiplier are important
if you have machines that can support it (also don't forget to give
more heap, you won't achieve anything with 1GB). And with such big
values, setting hbase.regionserver.hlog.blocksize toi something higher
than 64MB probably makes a lot of sense. And maybe set your
MAX_FILESIZE on your table to something bigger than 256MB.

Also in your code I don't see you calling htable.flushCommit, so you
are probably missing edits after the insert.

Finally, what are you trying to do? Your initial data import? If so,
there are better solutions like the HFileOutputFormat. Or are you
testing the max import speed you can get? Then you probably should let
your table "warm up" by letting your script run a first time to create
> 100 regions in order to get better load distribution else it will
hit the too few regions during the first phase of the upload.

J-D

On Thu, Jun 17, 2010 at 4:51 PM, Sujee Maniyam <[email protected]> wrote:
> Following up on this:
>
> Here is my sample code to reproduce the issue:
> http://pastebin.com/vTX8Pu7c
>
> I am importing data from a single JVM, using multiple threads (10).
>
> Each thread creates its own instance of HTable.  But I see only one 'zoo
> keeper connection' in the output.  Is that right?
>
> For the same import code, my throughput is cut by factor of 4,  going from
> 0.23  to 0.24.  The current write-speed is a bit slow for our needs.
>
> 1) is there any parameters I can tweak?  I have already disabled 'auto
> flush'
> 2) if multi-threaded-write isn't going to be effective, should I consider
> doing multiple JVM processes
>
> thanks
> Sujee
>
> http://sujee.net
>
>
> On Thu, Jun 10, 2010 at 3:41 PM, Jean-Daniel Cryans 
> <[email protected]>wrote:
>
>> Also 0.20.4 has the ExplicitColumnTracker that spins in a infinite
>> loop in some situations.
>>
>> J-D
>>
>> On Thu, Jun 10, 2010 at 3:38 PM, Ryan Rawson <[email protected]> wrote:
>> > hey,
>> >
>> > so you have discovered a particular 'trick' about how the HBase RPC
>> > works... at the lowest level there is only 1 socket for every thread
>> > to talk to all regionservers.  Thus if you are sending a large amount
>> > of data to HBase you can see this bottlenecking.
>> >
>> > It is highly likely there might be something interesting in the
>> > HRegionServer logs, perhaps the regionserver is blocking because it's
>> > trying to keep from being overrun (we ship with very conservative
>> > defaults).  There was a recent thread about this too... the thread was
>> > titled "ideas to improve throughput of the base writting".
>> >
>> > -ryan
>> >
>> >
>> > On Thu, Jun 10, 2010 at 3:17 PM, Sujee Maniyam <[email protected]> wrote:
>> >> forgot to mention, that I am using hbase 0.20.4
>> >>
>> >
>>
>

Reply via email to