Re: Trying to find the problem with a broken pipe

Anthony Ikeda Mon, 08 Aug 2011 18:12:29 -0700

Tim do you know if this is the actual reason that is causing the broken
pipe? I'm having a hard time convincing my team that modifying this value
will fix the issue.


Jonathan, do you know if there is a valid explanation on why Tim no longer
has the problem based on this change?

Anthony Ikeda



On Thu, Aug 4, 2011 at 11:13 PM, Tim Snyder <t...@proinnovations.com> wrote:

> I no longer get the error on the loader program. The steps I took to fix
> it are increasing the thrift_max_message_length_in_mb msg length,
> stopping cassandra, blowing away the prior data store, and then
> restarting cassandra.
>
> Tim
>
>
> -------- Original Message --------
> Subject: Re: Trying to find the problem with a broken pipe
> From: aaron morton <aa...@thelastpickle.com>
> Date: Fri, August 05, 2011 12:58 am
> To: user@cassandra.apache.org
>
> It's probably a network thing.
>
> The only thing I can think of in cassandra is
> thrift_max_message_length_in_mb in the config. That config setting will
> result in a TException thrown on the server side (i think), not sure if
> that makes the server kill the socket. I would hope the error returns to
> the client.
>
> Perhaps check the server log.
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 4 Aug 2011, at 23:05, Tim Snyder wrote:
>
> > I am getting the same problem (Broken Pipe) on a loader program, after
> > about 8 million read, write pairs. I am pushing serialized objects into
> > a column with the program, the object it seems to be doing it on is much
> > larger than the prior objects, so I am wondering if it is possibly a
> > column size streaming issue through the thrift api? I am using Cassandra
> > 0.8.0 and Hector 0.8.0-1
> >
> > Tim
> >
> > -------- Original Message --------
> > Subject: Re: Trying to find the problem with a broken pipe
> > From: Anthony Ikeda <anthony.ikeda....@gmail.com>
> > Date: Tue, August 02, 2011 10:43 pm
> > To: user@cassandra.apache.org
> >
> >> Very interesting. After the second host goes down do you see
> >> "me.prettyprint.hector.api.exceptions.HectorException: All host pools
> >> marked down. Retry burden pushed out to client"?
> >
> > No, the last message is:
> > 2011-08-02 08:43:06,561 INFO
> > [me.prettyprint.cassandra.connection.HConnectionManager] - Client
> > CassandraClient<cassandradevrk2:9393-49> released to inactive or dead
> > pool. Closing.> Does your client recover after a period of time?
> >
> >
> >
> > The application seems to be fine for now but my concern is the
> > connection pooling as well - I mean do we have one pool or multiple?
> > I'll post to the Hector user group about the pooling because the
> > incident seems so isolated. We also have our infrastructure team looking
> > into the communication between the application server and the cassandra
> > nodes.
> >
> >
> > So far it's still a mystery.
> >
> >
> >
> >
> >
> > On Tue, Aug 2, 2011 at 1:25 PM, Jim Ancona <j...@anconafamily.com> wrote:
> > On Tue, Aug 2, 2011 at 6:13 PM, Anthony Ikeda
> > <anthony.ikeda....@gmail.com> wrote:
> >
> >> The link (which I may be misreading)
> >> is
> >
> http://groups.google.com/group/hector-users/browse_thread/thread/8d7004b6f85a0f2e
> >
> >
> > I hadn't found that one, but I doubt that our issue is related to that.
> >
> >
> >> It's only started happening today and happened on 2 occassions (8:43
> > and
> >> 10:21) performing the same function (querying a column family).
> >> It seems to be trying to access a connection on one of the servers
> >> The client accesses the first node:
> >>
> >> 2011-08-02 08:43:06,541 ERROR
> >> [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush
> >> transport (to be expected if the pool is shutting down) in close for
> > client:
> >> CassandraClient<cassandradevrk1:9393-33>
> >> org.apache.thrift.transport.TTransportException:
> > java.net.SocketException:
> >> Broken pipe
> >>
> >> ...
> >> 2011-08-02 08:43:06,544 WARN
> >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not
> >> fullfill request on this host
> > CassandraClient<cassandradevrk1:9393-33>
> >>
> >> ...
> >>
> >> 2011-08-02 08:43:06,543 ERROR
> >> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST
> > AS DOWN
> >> TRIGGERED for host cassandradevrk1(10.130.202.34):9393
> >> 2011-08-02 08:43:06,543 ERROR
> >> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool state
> > on
> >> shutdown:
> >>
> >
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393};
> >> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted:
> > 49
> >> 2011-08-02 08:43:06,543 ERROR
> >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> > Shutdown
> >> triggered on
> >>
> >
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
> >> 2011-08-02 08:43:06,544 ERROR
> >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> > Shutdown
> >> complete on
> >>
> >
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
> >> 2011-08-02 08:43:06,544 INFO
> >> [me.prettyprint.cassandra.connection.CassandraHostRetryService] -
> > Host
> >> detected as down was added to retry queue:
> >> cassandradevrk1(10.130.202.34):9393
> >> 2011-08-02 08:43:06,544 WARN
> >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not
> >> fullfill request on this host CassandraClient<
> >> cassandradevrk1:9393-33>
> >> 2011-08-02 08:43:06,544 WARN
> >> [me.prettyprint.cassandra.connection.HConnectionManager] - Exception:
> >> me.prettyprint.hector.api.exceptions.HectorTransportException:
> >> org.apache.thrift.transport.TTransportException:
> > java.net.SocketException:
> >> Connection reset
> >>
> >>
> >> Then it appears to try the second node and fails:
> >>
> >> 2011-08-02 08:43:06,556 INFO
> >> [me.prettyprint.cassandra.connection.HConnectionManager] - Client
> >> CassandraClient<cassandradevrk1:9393-33> released to inactive or dead
> > pool.
> >> Closing.
> >> 2011-08-02 08:43:06,557 ERROR
> >> [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush
> >> transport (to be expected if the pool is shutting down) in close for
> > client:
> >> CassandraClient<cassandradevrk2:9393-49>
> >>
> >> org.apache.thrift.transport.TTransportException:
> > java.net.SocketException:
> >> Broken pipe
> >>
> >> 2011-08-02 08:43:06,558 ERROR
> >> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST
> > AS DOWN
> >> TRIGGERED for host cassandradevrk2(10.130.202.35):9393
> >> 2011-08-02 08:43:06,559 ERROR
> >> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool state
> > on
> >> shutdown:
> >>
> >
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393};
> >> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted:
> > 49
> >> 2011-08-02 08:43:06,559 ERROR
> >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> > Shutdown
> >> triggered on
> >>
> >
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393}
> >> 2011-08-02 08:43:06,559 ERROR
> >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> > Shutdown
> >> complete on
> >>
> >
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393}
> >> 2011-08-02 08:43:06,559 INFO
> >> [me.prettyprint.cassandra.connection.CassandraHostRetryService] -
> > Host
> >> detected as down was added to retry queue:
> >> cassandradevrk2(10.130.202.35):9393
> >> 2011-08-02 08:43:06,560 WARN
> >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not
> >> fullfill request on this host
> > CassandraClient<cassandradevrk2:9393-49>
> >> 2011-08-02 08:43:06,560 WARN
> >> [me.prettyprint.cassandra.connection.HConnectionManager] - Exception:
> >> me.prettyprint.hector.api.exceptions.HectorTransportException:
> >> org.apache.thrift.transport.TTransportException:
> > java.net.SocketException:
> >> Connection reset
> >
> >
> >
> > Very interesting. After the second host goes down do you see
> > "me.prettyprint.hector.api.exceptions.HectorException: All host pools
> > marked down. Retry burden pushed out to client"?
> >
> > Does your client recover after a period of time?
> >
> >>
> >> The process is the same at 10:21.
> >> Are the exceptions related to any external events (e.g. node
> >> restarts, network issues...)?
> >> Not that I'm aware, unless there are firewall timeouts between the
> >> application and the node servers. Let me find out. The cassandra log
> > files
> >> have no errors reported.
> >> What versions of Hector and Cassandra are you running?
> >> Cassandra 0.8.1, Hector 0.8.0-1
> >
> >
> > Our issue is occurring with Cassandra 0.7.8 and Hector 0.7-30. We plan
> > to deploy Hector 0.7-31 this week and to turn on useSocketKeepalive.
> > Are you using that? We're also using tcpdump to capture packets when
> > failures occur to see if there are anomalies in the network traffic.
> >
> > Jim
> >
> >
> >
> >>
> >>
> >>
> >> On Tue, Aug 2, 2011 at 10:37 AM, Jim Ancona <j...@anconafamily.com>
> > wrote:
> >>>
> >>> On Tue, Aug 2, 2011 at 4:36 PM, Anthony Ikeda
> >>> <anthony.ikeda....@gmail.com> wrote:
> >>>> I'm not sure if this is a problem with Hector or with Cassandra.
> >>>> We seem to be seeing broken pipe issues with our connections on
> > the
> >>>> client
> >>>> side (Exception below). A bit of googling finds possibly a problem
> > with
> >>>> the
> >>>> amount of data we are trying to store, although I'm certain our
> > datasets
> >>>> are
> >>>> not all that large.
> >>>
> >>> I'm not sure what you're referring to here. Large requests could
> > lead
> >>> to timeouts, but that's not what you're seeing here. Could you link
> > to
> >>> the page you're referencing?
> >>>
> >>>> A nodetool ring command doesn't seem to present any downed nodes:
> >>>> Address DC Rack Status State Load
> >>>> Owns
> >>>> Token
> >>>>
> >>>> 153951716904446304929228999025275230571
> >>>> 10.130.202.34 datacenter1 rack1 Up Normal 470.74 KB
> >>>> 79.19% 118538200848404459763384037192174096102
> >>>> 10.130.202.35 datacenter1 rack1 Up Normal 483.63 KB
> >>>> 20.81% 153951716904446304929228999025275230571
> >>>>
> >>>> There are no errors in the cassandra server logs.
> >>>>
> >>>> Are there any particular timeouts on connections that we need to
> > be
> >>>> aware
> >>>> of? Or perhaps configure on the Cassandra nodes? Is this purely
> > and
> >>>> issue
> >>>> with the Hector API configuration?
> >>>
> >>> There is a server side timeout (rpc_timeout_in_ms in cassandra.yaml)
> >>> and a Hector client-side timeout
> >>> (CassandraHostConfigurator.cassandraThriftSocketTimeout). But again,
> >>> the "Broken pipe" error is not a timeout, it indicates that
> > something
> >>> happened to the underlying network socket. For example you will see
> >>> those when a server node is restarted.
> >>>
> >>> Some questions that might help troubleshoot this:
> >>> How often are these occurring?
> >>> Does this affect both nodes in the cluster or just one?
> >>> Are the exceptions related to any external events (e.g. node
> > restarts,
> >>> network issues...)?
> >>> What versions of Hector and Cassandra are you running?
> >>>
> >>> Keep in mind that failures like this will normally be retried by
> >>> Hector, resulting in no loss of data. For that reason, I think that
> >>> exception is logged as a warning in the newest Hector versions.
> >>>
> >>> We've seen something similar, but more catastrophic because it
> > affects
> >>> connectivity to the entire cluster, not just a single node. See this
> >>> post for more details: http://goo.gl/hrgkw So far we haven't
> >>> identified the cause.
> >>>
> >>> Jim
> >>>
> >>>> Anthony
> >>>>
> >>>> 2011-08-02 08:43:06,541 ERROR
> >>>> [me.prettyprint.cassandra.connection.HThriftClient] - Could not
> > flush
> >>>> transport (to be expected if the pool is shutting down) in close
> > for
> >>>> client:
> >>>> CassandraClient<cassandradevrk1:9393-33>
> >>>> org.apache.thrift.transport.TTransportException:
> >>>> java.net.SocketException:
> >>>> Broken pipe
> >>>> at
> >>>>
> >>>>
> >
> org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
> >>>> at
> >>>>
> >>>>
> >
> me.prettyprint.cassandra.connection.HThriftClient.close(HThriftClient.java:85)
> >>>> at
> >>>>
> >>>>
> >
> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232)
> >>>> at
> >>>>
> >>>>
> >
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131)
> >>>> at
> >>>>
> >>>>
> >
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:289)
> >>>> at
> >>>>
> >>>>
> >
> me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53)
> >>>> at
> >>>>
> >>>>
> >
> me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49)
> >>>> at
> >>>>
> >>>>
> >
> me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
> >>>> at
> >>>>
> >>>>
> >
> me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
> >>>> at
> >>>>
> >>>>
> >
> me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48)
> >>>> at
> >>>>
> >>>>
> >
> com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.findRegistryByProfileId(StandardFindRegistryPersistenceStrategy.java:237)
> >>>> at
> >>>>
> >>>>
> >
> com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.execute(StandardFindRegistryPersistenceStrategy.java:277)
> >>>> at
> >>>>
> >>>>
> >
> com.wsgc.services.registry.registryservice.impl.service.StandardRegistryService.getRegistriesByProfileId(StandardRegistryService.java:327)
> >>>> at
> >>>>
> >>>>
> >
> com.wsgc.services.registry.webapp.impl.RegistryServicesController.getRegistriesByProfileId(RegistryServicesController.java:247)
> >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > Method)
> >>>> at
> >>>>
> >>>>
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>> at
> >>>>
> >>>>
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>>> at
> >>>>
> >>>>
> >
> org.springframework.web.bind.annotation.support.HandlerMethodInvoker.invokeHandlerMethod(HandlerMethodInvoker.java:175)
> >>>> at
> >>>>
> >>>>
> >
> org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.invokeHandlerMethod(AnnotationMethodHandlerAdapter.java:421)
> >>>> at
> >>>>
> >>>>
> >
> org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.handle(AnnotationMethodHandlerAdapter.java:409)
> >>>> at
> >>>>
> >>>>
> >
> org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:774)
> >>>> at
> >>>>
> >>>>
> >
> org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:719)
> >>>> at
> >>>>
> >>>>
> >
> org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:644)
> >>>> at
> >>>>
> >>>>
> >
> org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:549)
> >>>> at
> > javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
> >>>> at
> > javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> >>>> at
> >>>>
> >>>>
> >
> org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:77)
> >>>> at
> >>>>
> >>>>
> >
> org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:76)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:563)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> >>>> at
> >>>>
> > org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190)
> >>>> at
> >>>>
> > org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291)
> >>>> at
> >>>> org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:774)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:703)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:896)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690)
> >>>> at java.lang.Thread.run(Thread.java:662)
> >>>> Caused by: java.net.SocketException: Broken pipe
> >>>> at java.net.SocketOutputStream.socketWrite0(Native Method)
> >>>> at
> >>>>
> > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
> >>>> at
> > java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> >>>> at
> >>>>
> >>>>
> >
> org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
> >>>> ... 47 more
> >>>> 2011-08-02 08:43:06,543 ERROR
> >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK
> > HOST AS
> >>>> DOWN
> >>>> TRIGGERED for host cassandradevrk1(10.130.202.34):9393
> >>>> 2011-08-02 08:43:06,543 ERROR
> >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool
> > state on
> >>>> shutdown:
> >>>>
> >>>>
> >
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393};
> >>>> IsActive?: true; Active: 1; Blocked: 0; Idle: 15;
> > NumBeforeExhausted: 49
> >>>> 2011-08-02 08:43:06,543 ERROR
> >>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> > Shutdown
> >>>> triggered on
> >>>>
> >>>>
> >
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
> >>>> 2011-08-02 08:43:06,544 ERROR
> >>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] -
> > Shutdown
> >>>> complete on
> >>>>
> >>>>
> >
> <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}
> >>>> 2011-08-02 08:43:06,544 INFO
> >>>> [me.prettyprint.cassandra.connection.CassandraHostRetryService] -
> > Host
> >>>> detected as down was added to retry queue:
> >>>> cassandradevrk1(10.130.202.34):9393
> >>>> 2011-08-02 08:43:06,544 WARN
> >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - Could
> > not
> >>>> fullfill request on this host
> > CassandraClient<cassandradevrk1:9393-33>
> >>>>
> >>
> >>
> >
>
>

Re: Trying to find the problem with a broken pipe

Reply via email to