Tim do you know if this is the actual reason that is causing the broken pipe? I'm having a hard time convincing my team that modifying this value will fix the issue.
Jonathan, do you know if there is a valid explanation on why Tim no longer has the problem based on this change? Anthony Ikeda On Thu, Aug 4, 2011 at 11:13 PM, Tim Snyder <t...@proinnovations.com> wrote: > I no longer get the error on the loader program. The steps I took to fix > it are increasing the thrift_max_message_length_in_mb msg length, > stopping cassandra, blowing away the prior data store, and then > restarting cassandra. > > Tim > > > -------- Original Message -------- > Subject: Re: Trying to find the problem with a broken pipe > From: aaron morton <aa...@thelastpickle.com> > Date: Fri, August 05, 2011 12:58 am > To: user@cassandra.apache.org > > It's probably a network thing. > > The only thing I can think of in cassandra is > thrift_max_message_length_in_mb in the config. That config setting will > result in a TException thrown on the server side (i think), not sure if > that makes the server kill the socket. I would hope the error returns to > the client. > > Perhaps check the server log. > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 4 Aug 2011, at 23:05, Tim Snyder wrote: > > > I am getting the same problem (Broken Pipe) on a loader program, after > > about 8 million read, write pairs. I am pushing serialized objects into > > a column with the program, the object it seems to be doing it on is much > > larger than the prior objects, so I am wondering if it is possibly a > > column size streaming issue through the thrift api? I am using Cassandra > > 0.8.0 and Hector 0.8.0-1 > > > > Tim > > > > -------- Original Message -------- > > Subject: Re: Trying to find the problem with a broken pipe > > From: Anthony Ikeda <anthony.ikeda....@gmail.com> > > Date: Tue, August 02, 2011 10:43 pm > > To: user@cassandra.apache.org > > > >> Very interesting. After the second host goes down do you see > >> "me.prettyprint.hector.api.exceptions.HectorException: All host pools > >> marked down. Retry burden pushed out to client"? > > > > No, the last message is: > > 2011-08-02 08:43:06,561 INFO > > [me.prettyprint.cassandra.connection.HConnectionManager] - Client > > CassandraClient<cassandradevrk2:9393-49> released to inactive or dead > > pool. Closing.> Does your client recover after a period of time? > > > > > > > > The application seems to be fine for now but my concern is the > > connection pooling as well - I mean do we have one pool or multiple? > > I'll post to the Hector user group about the pooling because the > > incident seems so isolated. We also have our infrastructure team looking > > into the communication between the application server and the cassandra > > nodes. > > > > > > So far it's still a mystery. > > > > > > > > > > > > On Tue, Aug 2, 2011 at 1:25 PM, Jim Ancona <j...@anconafamily.com> wrote: > > On Tue, Aug 2, 2011 at 6:13 PM, Anthony Ikeda > > <anthony.ikeda....@gmail.com> wrote: > > > >> The link (which I may be misreading) > >> is > > > http://groups.google.com/group/hector-users/browse_thread/thread/8d7004b6f85a0f2e > > > > > > I hadn't found that one, but I doubt that our issue is related to that. > > > > > >> It's only started happening today and happened on 2 occassions (8:43 > > and > >> 10:21) performing the same function (querying a column family). > >> It seems to be trying to access a connection on one of the servers > >> The client accesses the first node: > >> > >> 2011-08-02 08:43:06,541 ERROR > >> [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush > >> transport (to be expected if the pool is shutting down) in close for > > client: > >> CassandraClient<cassandradevrk1:9393-33> > >> org.apache.thrift.transport.TTransportException: > > java.net.SocketException: > >> Broken pipe > >> > >> ... > >> 2011-08-02 08:43:06,544 WARN > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not > >> fullfill request on this host > > CassandraClient<cassandradevrk1:9393-33> > >> > >> ... > >> > >> 2011-08-02 08:43:06,543 ERROR > >> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST > > AS DOWN > >> TRIGGERED for host cassandradevrk1(10.130.202.34):9393 > >> 2011-08-02 08:43:06,543 ERROR > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool state > > on > >> shutdown: > >> > > > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}; > >> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted: > > 49 > >> 2011-08-02 08:43:06,543 ERROR > >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > > Shutdown > >> triggered on > >> > > > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393} > >> 2011-08-02 08:43:06,544 ERROR > >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > > Shutdown > >> complete on > >> > > > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393} > >> 2011-08-02 08:43:06,544 INFO > >> [me.prettyprint.cassandra.connection.CassandraHostRetryService] - > > Host > >> detected as down was added to retry queue: > >> cassandradevrk1(10.130.202.34):9393 > >> 2011-08-02 08:43:06,544 WARN > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not > >> fullfill request on this host CassandraClient< > >> cassandradevrk1:9393-33> > >> 2011-08-02 08:43:06,544 WARN > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Exception: > >> me.prettyprint.hector.api.exceptions.HectorTransportException: > >> org.apache.thrift.transport.TTransportException: > > java.net.SocketException: > >> Connection reset > >> > >> > >> Then it appears to try the second node and fails: > >> > >> 2011-08-02 08:43:06,556 INFO > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Client > >> CassandraClient<cassandradevrk1:9393-33> released to inactive or dead > > pool. > >> Closing. > >> 2011-08-02 08:43:06,557 ERROR > >> [me.prettyprint.cassandra.connection.HThriftClient] - Could not flush > >> transport (to be expected if the pool is shutting down) in close for > > client: > >> CassandraClient<cassandradevrk2:9393-49> > >> > >> org.apache.thrift.transport.TTransportException: > > java.net.SocketException: > >> Broken pipe > >> > >> 2011-08-02 08:43:06,558 ERROR > >> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK HOST > > AS DOWN > >> TRIGGERED for host cassandradevrk2(10.130.202.35):9393 > >> 2011-08-02 08:43:06,559 ERROR > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool state > > on > >> shutdown: > >> > > > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393}; > >> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; NumBeforeExhausted: > > 49 > >> 2011-08-02 08:43:06,559 ERROR > >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > > Shutdown > >> triggered on > >> > > > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393} > >> 2011-08-02 08:43:06,559 ERROR > >> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > > Shutdown > >> complete on > >> > > > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk2(10.130.202.35):9393} > >> 2011-08-02 08:43:06,559 INFO > >> [me.prettyprint.cassandra.connection.CassandraHostRetryService] - > > Host > >> detected as down was added to retry queue: > >> cassandradevrk2(10.130.202.35):9393 > >> 2011-08-02 08:43:06,560 WARN > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Could not > >> fullfill request on this host > > CassandraClient<cassandradevrk2:9393-49> > >> 2011-08-02 08:43:06,560 WARN > >> [me.prettyprint.cassandra.connection.HConnectionManager] - Exception: > >> me.prettyprint.hector.api.exceptions.HectorTransportException: > >> org.apache.thrift.transport.TTransportException: > > java.net.SocketException: > >> Connection reset > > > > > > > > Very interesting. After the second host goes down do you see > > "me.prettyprint.hector.api.exceptions.HectorException: All host pools > > marked down. Retry burden pushed out to client"? > > > > Does your client recover after a period of time? > > > >> > >> The process is the same at 10:21. > >> Are the exceptions related to any external events (e.g. node > >> restarts, network issues...)? > >> Not that I'm aware, unless there are firewall timeouts between the > >> application and the node servers. Let me find out. The cassandra log > > files > >> have no errors reported. > >> What versions of Hector and Cassandra are you running? > >> Cassandra 0.8.1, Hector 0.8.0-1 > > > > > > Our issue is occurring with Cassandra 0.7.8 and Hector 0.7-30. We plan > > to deploy Hector 0.7-31 this week and to turn on useSocketKeepalive. > > Are you using that? We're also using tcpdump to capture packets when > > failures occur to see if there are anomalies in the network traffic. > > > > Jim > > > > > > > >> > >> > >> > >> On Tue, Aug 2, 2011 at 10:37 AM, Jim Ancona <j...@anconafamily.com> > > wrote: > >>> > >>> On Tue, Aug 2, 2011 at 4:36 PM, Anthony Ikeda > >>> <anthony.ikeda....@gmail.com> wrote: > >>>> I'm not sure if this is a problem with Hector or with Cassandra. > >>>> We seem to be seeing broken pipe issues with our connections on > > the > >>>> client > >>>> side (Exception below). A bit of googling finds possibly a problem > > with > >>>> the > >>>> amount of data we are trying to store, although I'm certain our > > datasets > >>>> are > >>>> not all that large. > >>> > >>> I'm not sure what you're referring to here. Large requests could > > lead > >>> to timeouts, but that's not what you're seeing here. Could you link > > to > >>> the page you're referencing? > >>> > >>>> A nodetool ring command doesn't seem to present any downed nodes: > >>>> Address DC Rack Status State Load > >>>> Owns > >>>> Token > >>>> > >>>> 153951716904446304929228999025275230571 > >>>> 10.130.202.34 datacenter1 rack1 Up Normal 470.74 KB > >>>> 79.19% 118538200848404459763384037192174096102 > >>>> 10.130.202.35 datacenter1 rack1 Up Normal 483.63 KB > >>>> 20.81% 153951716904446304929228999025275230571 > >>>> > >>>> There are no errors in the cassandra server logs. > >>>> > >>>> Are there any particular timeouts on connections that we need to > > be > >>>> aware > >>>> of? Or perhaps configure on the Cassandra nodes? Is this purely > > and > >>>> issue > >>>> with the Hector API configuration? > >>> > >>> There is a server side timeout (rpc_timeout_in_ms in cassandra.yaml) > >>> and a Hector client-side timeout > >>> (CassandraHostConfigurator.cassandraThriftSocketTimeout). But again, > >>> the "Broken pipe" error is not a timeout, it indicates that > > something > >>> happened to the underlying network socket. For example you will see > >>> those when a server node is restarted. > >>> > >>> Some questions that might help troubleshoot this: > >>> How often are these occurring? > >>> Does this affect both nodes in the cluster or just one? > >>> Are the exceptions related to any external events (e.g. node > > restarts, > >>> network issues...)? > >>> What versions of Hector and Cassandra are you running? > >>> > >>> Keep in mind that failures like this will normally be retried by > >>> Hector, resulting in no loss of data. For that reason, I think that > >>> exception is logged as a warning in the newest Hector versions. > >>> > >>> We've seen something similar, but more catastrophic because it > > affects > >>> connectivity to the entire cluster, not just a single node. See this > >>> post for more details: http://goo.gl/hrgkw So far we haven't > >>> identified the cause. > >>> > >>> Jim > >>> > >>>> Anthony > >>>> > >>>> 2011-08-02 08:43:06,541 ERROR > >>>> [me.prettyprint.cassandra.connection.HThriftClient] - Could not > > flush > >>>> transport (to be expected if the pool is shutting down) in close > > for > >>>> client: > >>>> CassandraClient<cassandradevrk1:9393-33> > >>>> org.apache.thrift.transport.TTransportException: > >>>> java.net.SocketException: > >>>> Broken pipe > >>>> at > >>>> > >>>> > > > org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147) > >>>> at > >>>> > >>>> > > > org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.connection.HThriftClient.close(HThriftClient.java:85) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:232) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:289) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85) > >>>> at > >>>> > >>>> > > > me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48) > >>>> at > >>>> > >>>> > > > com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.findRegistryByProfileId(StandardFindRegistryPersistenceStrategy.java:237) > >>>> at > >>>> > >>>> > > > com.wsgc.services.registry.persistenceservice.impl.cassandra.strategy.read.StandardFindRegistryPersistenceStrategy.execute(StandardFindRegistryPersistenceStrategy.java:277) > >>>> at > >>>> > >>>> > > > com.wsgc.services.registry.registryservice.impl.service.StandardRegistryService.getRegistriesByProfileId(StandardRegistryService.java:327) > >>>> at > >>>> > >>>> > > > com.wsgc.services.registry.webapp.impl.RegistryServicesController.getRegistriesByProfileId(RegistryServicesController.java:247) > >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > > Method) > >>>> at > >>>> > >>>> > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > >>>> at > >>>> > >>>> > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >>>> at java.lang.reflect.Method.invoke(Method.java:597) > >>>> at > >>>> > >>>> > > > org.springframework.web.bind.annotation.support.HandlerMethodInvoker.invokeHandlerMethod(HandlerMethodInvoker.java:175) > >>>> at > >>>> > >>>> > > > org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.invokeHandlerMethod(AnnotationMethodHandlerAdapter.java:421) > >>>> at > >>>> > >>>> > > > org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter.handle(AnnotationMethodHandlerAdapter.java:409) > >>>> at > >>>> > >>>> > > > org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:774) > >>>> at > >>>> > >>>> > > > org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:719) > >>>> at > >>>> > >>>> > > > org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:644) > >>>> at > >>>> > >>>> > > > org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:549) > >>>> at > > javax.servlet.http.HttpServlet.service(HttpServlet.java:617) > >>>> at > > javax.servlet.http.HttpServlet.service(HttpServlet.java:717) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > >>>> at > >>>> > >>>> > > > org.springframework.web.filter.HiddenHttpMethodFilter.doFilterInternal(HiddenHttpMethodFilter.java:77) > >>>> at > >>>> > >>>> > > > org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:76) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) > >>>> at > >>>> > >>>> > > > org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:563) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > >>>> at > >>>> > >>>> > > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > >>>> at > >>>> > >>>> > > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > >>>> at > >>>> > >>>> > > > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) > >>>> at > >>>> > > org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190) > >>>> at > >>>> > > org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291) > >>>> at > >>>> org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:774) > >>>> at > >>>> > >>>> > > > org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:703) > >>>> at > >>>> > >>>> > > > org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:896) > >>>> at > >>>> > >>>> > > > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690) > >>>> at java.lang.Thread.run(Thread.java:662) > >>>> Caused by: java.net.SocketException: Broken pipe > >>>> at java.net.SocketOutputStream.socketWrite0(Native Method) > >>>> at > >>>> > > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) > >>>> at > > java.net.SocketOutputStream.write(SocketOutputStream.java:136) > >>>> at > >>>> > >>>> > > > org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145) > >>>> ... 47 more > >>>> 2011-08-02 08:43:06,543 ERROR > >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - MARK > > HOST AS > >>>> DOWN > >>>> TRIGGERED for host cassandradevrk1(10.130.202.34):9393 > >>>> 2011-08-02 08:43:06,543 ERROR > >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - Pool > > state on > >>>> shutdown: > >>>> > >>>> > > > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393}; > >>>> IsActive?: true; Active: 1; Blocked: 0; Idle: 15; > > NumBeforeExhausted: 49 > >>>> 2011-08-02 08:43:06,543 ERROR > >>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > > Shutdown > >>>> triggered on > >>>> > >>>> > > > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393} > >>>> 2011-08-02 08:43:06,544 ERROR > >>>> [me.prettyprint.cassandra.connection.ConcurrentHClientPool] - > > Shutdown > >>>> complete on > >>>> > >>>> > > > <ConcurrentCassandraClientPoolByHost>:{cassandradevrk1(10.130.202.34):9393} > >>>> 2011-08-02 08:43:06,544 INFO > >>>> [me.prettyprint.cassandra.connection.CassandraHostRetryService] - > > Host > >>>> detected as down was added to retry queue: > >>>> cassandradevrk1(10.130.202.34):9393 > >>>> 2011-08-02 08:43:06,544 WARN > >>>> [me.prettyprint.cassandra.connection.HConnectionManager] - Could > > not > >>>> fullfill request on this host > > CassandraClient<cassandradevrk1:9393-33> > >>>> > >> > >> > > > >