Hi Friso,

I think I identified the issue. As you suspected, we were unnecessarily
allocating a lot of native byte buffers in the LZO code where we weren't
before.

I just pushed a fix to my LZO repository and bumped the version number to
0.4.7.

If you have a chance to test this on a dev environment that would be great.
I will try to test myself this week. (unfortunately I wasn't able to
reproduce the issue yet)

Thanks
-Todd

On Fri, Nov 12, 2010 at 4:09 PM, Todd Lipcon <[email protected]> wrote:

> Hey Friso,
>
> Thanks so much for the details. I am starting to imagine it could indeed be
> a codec leak - especially since you have some cells which are into the MB,
> maybe it's expanding some buffers to 64MB.
>
> Let me try to do some tests to reproduce it here in the next week or so.
>
> Anyone else seen this issue?
>
> Thanks
> -Todd
>
> On Fri, Nov 12, 2010 at 1:19 AM, Friso van Vollenhoven <
> [email protected]> wrote:
>
>> Hi Todd,
>>
>> I am afraid I no longer have the broken setup around, because we really
>> need a working one right now. We need to demo at a conference next week and
>> until after that, all changes are frozen both on dev and prod (so we can use
>> dev as fall back). Later on I could maybe try some more things on our dev
>> boxes.
>>
>> If you are doing a repro, here's the stuff you'd probably want to know:
>> The workload is write only. No reads happening at the same time. No other
>> active clients. It is an initial import of data. We do insertions in a MR
>> job from the reducers. The total volume is about 11 billion puts across
>> roughly 450K rows per table (we have a many columns per row data model)
>> across 15 tables, all use LZO. Qualifiers are some 50 bytes. Values range
>> from a small number of KBs generally to MBs in rare cases. The row keys have
>> a time-related part at the start, so I know the keyspace in advance, so I
>> create the empty tables with pre-created regions (40 regions) across the
>> keyspace to get decent distribution from the start of the job. In order to
>> not overload HBase, I run the job with only 15 reducers, so there are max 15
>> concurrent clients active. Other settnigs: max file size is 1GB, HFile block
>> size is default 64K, client side buffer is 16M, memstore flush size is 128M,
>> compaction threshold is 5, blocking store files is 9, mem store upper limit
>> is 20%, lower limit 15%, block cache 40%. During the run, the RSes never
>> report more than 5GB of heap usage from the UI, which makes sense, because
>> block cache is not touched. On a healthy run with somewhat conservative
>> settings right now, HBase reports on average about 380K requests per second
>> in the master UI.
>>
>> The cluster has 8 workers running TT, DN, RS and another JVM process for
>> our own software that sits in front of HBase. Workers are dual quad cores
>> with 64GB RAM and 10x 600GB disks (we decided to scale the amount of seeks
>> we can do concurrently). Disks are quite fast: 10K RPM. MR task VMs get 1GB
>> of heap, TT and DN also. RS gets 16GB of heap and our own software too. We
>> run 8 mappers and 4 reducers per node. So at absolute max, we should have
>> 46GB of allocated heap. That leaves 18GB for JVM overhead, native
>> allocations and OS. We run Linux 2.6.18-194.11.4.el5. I think it is CentOS,
>> but I didn't do the installs myself.
>>
>> I tried numerous different settings both more extreme and more
>> conservative to get the thing working, but in the end it always ends up
>> swapping. I should have tried a run without LZO, of course, but I was out of
>> time by then.
>>
>>
>>
>> Cheers,
>> Friso
>>
>>
>>
>> On 12 nov 2010, at 07:06, Todd Lipcon wrote:
>>
>> > Hrm, any chance you can run with a smaller heap and get a jmap dump? The
>> > eclipse MAT tool is also super nice for looking at this stuff if indeed
>> they
>> > are java objects.
>> >
>> > What kind of workload are you using? Read mostly? Write mostly? Mixed? I
>> > will try to repro.
>> >
>> > -Todd
>> >
>> > On Thu, Nov 11, 2010 at 8:41 PM, Friso van Vollenhoven <
>> > [email protected]> wrote:
>> >
>> >> I figured the same. I also did a run with CMS instead of G1. Same
>> results.
>> >>
>> >> I also did a run with the RS heap tuned down to 12GB and 8GB, but given
>> >> enough time the process still grows over 40GB in size.
>> >>
>> >>
>> >> Friso
>> >>
>> >>
>> >>
>> >> On 12 nov 2010, at 01:55, Todd Lipcon wrote:
>> >>
>> >>> Can you try running this with CMS GC instead of G1GC? G1 still has
>> some
>> >>> bugs... 64M sounds like it might be G1 "regions"?
>> >>>
>> >>> -Todd
>> >>>
>> >>> On Thu, Nov 11, 2010 at 2:07 AM, Friso van Vollenhoven <
>> >>> [email protected]> wrote:
>> >>>
>> >>>> Hi All,
>> >>>>
>> >>>> (This is all about CDH3, so I am not sure whether it should go on
>> this
>> >>>> list, but I figure it is at least interesting for people trying the
>> >> same.)
>> >>>>
>> >>>> I've recently tried CDH3 on a new cluster from RPMs with the
>> hadoop-lzo
>> >>>> fork from https://github.com/toddlipcon/hadoop-lzo. Everything works
>> >> like
>> >>>> a charm initially, but after some time (minutes to max one hour), the
>> RS
>> >> JVM
>> >>>> process memory grows to more than twice the given heap size and
>> beyond.
>> >> I
>> >>>> have seen a RS with 16GB heap that grows to 55GB virtual size. At
>> some
>> >>>> point, everything start swapping and GC times go into the minutes and
>> >>>> everything dies or is considered dead by the master.
>> >>>>
>> >>>> I did a pmap -x on the RS process and that shows a lot of allocated
>> >> blocks
>> >>>> of about 64M by the process. There about 500 of these, which is 32GB
>> in
>> >>>> total. See: http://pastebin.com/8pgzPf7b (bottom of the file, the
>> >> blocks
>> >>>> of about 1M on top are probably thread stacks). Unfortunately, Linux
>> >> shows
>> >>>> the native heap as anon blocks, so I can not link it to a specific
>> lib
>> >> or
>> >>>> something.
>> >>>>
>> >>>> I am running the latest CDH3 and hadoop-lzo 0.4.6 (from said URL, the
>> >> one
>> >>>> which has the reinit() support). I run Java 6u21 with the G1 garbage
>> >>>> collector, which has been running fine for some weeks now. Full
>> command
>> >> line
>> >>>> is:
>> >>>> java -Xmx16000m -XX:+HeapDumpOnOutOfMemoryError
>> >>>> -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+UseCompressedOops
>> >>>> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps
>> >>>> -Xloggc:/export/logs/hbase/gc-hbase.log
>> >>>> -Djava.library.path=/home/inr/java-lib/hbase/native/Linux-amd64-64
>> >>>> -Djava.net.preferIPv4Stack=true -Dhbase.log.dir=/export/logs/hbase
>> >>>> -Dhbase.log.file=hbase-hbase-regionserver-w3r1.inrdb.ripe.net.log
>> >>>> -Dhbase.home.dir=/usr/lib/hbase/bin/.. -Dhbase.id.str=hbase -Dhbase.r
>> >>>>
>> >>>> I searched the HBase source for something that could point to native
>> >> heap
>> >>>> usage (like ByteBuffer#allocateDirect(...)), but I could not find
>> >> anything.
>> >>>> Thread count is about 185 (I have 100 handlers), so nothing strange
>> >> there as
>> >>>> well.
>> >>>>
>> >>>> Question is, could this be HBase or is this a problem with the
>> >> hadoop-lzo?
>> >>>>
>> >>>> I have currently downgraded to a version known to work, because we
>> have
>> >> a
>> >>>> demo coming up. But still interested in the answer.
>> >>>>
>> >>>>
>> >>>>
>> >>>> Regards,
>> >>>> Friso
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> Todd Lipcon
>> >>> Software Engineer, Cloudera
>> >>
>> >>
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to