Can you enable verboseGC and look at the tenuring distribution and times for GC?
On Tue, Sep 1, 2009 at 5:54 PM, Satish Bhatti <cthd2...@gmail.com> wrote: > Parallel/Serial. > inf...@domu-12-31-39-06-3d-d1:/opt/ir/agent/infact-installs/aaa/infact$ > iostat > Linux 2.6.18-xenU-ec2-v1.0 (domU-12-31-39-06-3D-D1) 09/01/2009 > _x86_64_ > > avg-cpu: %user %nice %system %iowait %steal %idle > 66.11 0.00 1.54 2.96 20.30 9.08 > > Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > sda2 460.83 410.02 12458.18 40499322 1230554928 > sdc 0.00 0.00 0.00 96 0 > sda1 0.53 5.01 4.89 495338 482592 > > > > On Tue, Sep 1, 2009 at 5:46 PM, Mahadev Konar <maha...@yahoo-inc.com> > wrote: > > > Hi satish, > > what GC are you using? Is it ConcurrentMarkSweep or Parallel/Serial? > > > > Also, how is your disk usage on this machine? Can you check your iostat > > numbers? > > > > Thanks > > mahadev > > > > > > On 9/1/09 5:15 PM, "Satish Bhatti" <cthd2...@gmail.com> wrote: > > > > > GC Time: 11.628 seconds on PS MarkSweep (389 collections)5 minutes on > PS > > > scavenge( 7,636 collections) > > > > > > It's been running for about 48 hours. > > > > > > > > > On Tue, Sep 1, 2009 at 5:12 PM, Ted Dunning <ted.dunn...@gmail.com> > > wrote: > > > > > >> Do you have long GC delays? > > >> > > >> On Tue, Sep 1, 2009 at 4:51 PM, Satish Bhatti <cthd2...@gmail.com> > > wrote: > > >> > > >>> Session timeout is 30 seconds. > > >>> > > >>> On Tue, Sep 1, 2009 at 4:26 PM, Patrick Hunt <ph...@apache.org> > wrote: > > >>> > > >>>> What is your client timeout? It may be too low. > > >>>> > > >>>> also see this section on handling recoverable errors: > > >>>> http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling > > >>>> > > >>>> connection loss in particular needs special care since: > > >>>> "When a ZooKeeper client loses a connection to the ZooKeeper server > > >> there > > >>>> may be some requests in flight; we don't know where they were in > their > > >>>> flight at the time of the connection loss. " > > >>>> > > >>>> Patrick > > >>>> > > >>>> > > >>>> Satish Bhatti wrote: > > >>>> > > >>>>> I have recently started running on EC2 and am seeing quite a few > > >>>>> ConnectionLoss exceptions. Should I just catch these and retry? > > >> Since > > >>> I > > >>>>> assume that eventually, if the shit truly hits the fan, I will get > a > > >>>>> SessionExpired? > > >>>>> Satish > > >>>>> > > >>>>> On Mon, Jul 6, 2009 at 11:35 AM, Ted Dunning < > ted.dunn...@gmail.com> > > >>>>> wrote: > > >>>>> > > >>>>> We have used EC2 quite a bit for ZK. > > >>>>>> > > >>>>>> The basic lessons that I have learned include: > > >>>>>> > > >>>>>> a) EC2's biggest advantage after scaling and elasticity was > > >> conformity > > >>> of > > >>>>>> configuration. Since you are bringing machines up and down all > the > > >>> time, > > >>>>>> they begin to act more like programs and you wind up with boot > > >> scripts > > >>>>>> that > > >>>>>> give you a very predictable environment. Nice. > > >>>>>> > > >>>>>> b) EC2 interconnect has a lot more going on than in a dedicated > > VLAN. > > >>>>>> That > > >>>>>> can make the ZK servers appear a bit less connected. You have to > > >> plan > > >>>>>> for > > >>>>>> ConnectionLoss events. > > >>>>>> > > >>>>>> c) for highest reliability, I switched to large instances. On > > >>>>>> reflection, > > >>>>>> I > > >>>>>> think that was helpful, but less important than I thought at the > > >> time. > > >>>>>> > > >>>>>> d) increasing and decreasing cluster size is nearly painless and > is > > >>>>>> easily > > >>>>>> scriptable. To decrease, do a rolling update on the survivors to > > >>> update > > >>>>>> their configuration. Then take down the instance you want to > lose. > > >> To > > >>>>>> increase, do a rolling update starting with the new instances to > > >> update > > >>>>>> the > > >>>>>> configuration to include all of the machines. The rolling update > > >>> should > > >>>>>> bounce each ZK with several seconds between each bounce. > Rescaling > > >> the > > >>>>>> cluster takes less than a minute which makes it comparable to EC2 > > >>>>>> instance > > >>>>>> boot time (about 30 seconds for the Alestic ubuntu instance that > we > > >>> used > > >>>>>> plus about 20 seconds for additional configuration). > > >>>>>> > > >>>>>> On Mon, Jul 6, 2009 at 4:45 AM, David Graf <david.g...@28msec.com > > > > >>>>>> wrote: > > >>>>>> > > >>>>>> Hello > > >>>>>>> > > >>>>>>> I wanna set up a zookeeper ensemble on amazon's ec2 service. In > my > > >>>>>>> > > >>>>>> system, > > >>>>>> > > >>>>>>> zookeeper is used to run a locking service and to generate unique > > >>> id's. > > >>>>>>> Currently, for testing purposes, I am only running one instance. > > >> Now, > > >>> I > > >>>>>>> > > >>>>>> need > > >>>>>> > > >>>>>>> to set up an ensemble to protect my system against crashes. > > >>>>>>> The ec2 services has some differences to a normal server farm. > E.g. > > >>> the > > >>>>>>> data saved on the file system of an ec2 instance is lost if the > > >>> instance > > >>>>>>> crashes. In the documentation of zookeeper, I have read that > > >> zookeeper > > >>>>>>> > > >>>>>> saves > > >>>>>> > > >>>>>>> snapshots of the in-memory data in the file system. Is that > needed > > >> for > > >>>>>>> recovery? Logically, it would be much easier for me if this is > not > > >> the > > >>>>>>> > > >>>>>> case. > > >>>>>> > > >>>>>>> Additionally, ec2 brings the advantage that serves can be switch > on > > >>> and > > >>>>>>> > > >>>>>> off > > >>>>>> > > >>>>>>> dynamically dependent on the load, traffic, etc. Can this > advantage > > >> be > > >>>>>>> utilized for a zookeeper ensemble? Is it possible to add a > > zookeeper > > >>>>>>> > > >>>>>> server > > >>>>>> > > >>>>>>> dynamically to an ensemble? E.g. dependent on the in-memory load? > > >>>>>>> > > >>>>>>> David > > >>>>>>> > > >>>>>>> > > >>>>> > > >>> > > >> > > >> > > >> > > >> -- > > >> Ted Dunning, CTO > > >> DeepDyve > > >> > > > > > -- Ted Dunning, CTO DeepDyve