Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors

2013-09-03 Thread agateaaa
Uploaded a patch for HiVE-5172. Can someone please review?

Do I have to be a contributor before I submit a patch?

Did run about test with the patch in our test environment (Ran about 1000
pig jobs to read and insert into hive table (via hcatalog), along with
equal number of alter table statements)  for the past 4 days and haven't
seen any error on the client or the server.



On Thu, Aug 29, 2013 at 2:39 PM, agateaaa agate...@gmail.com wrote:

 Thanks Ashutosh.

 Filed https://issues.apache.org/jira/browse/HIVE-5172



 On Thu, Aug 29, 2013 at 11:53 AM, Ashutosh Chauhan 
 hashut...@apache.orgwrote:

 Thanks Agatea for digging in. Seems like you have hit a bug. Would you
 mind opening a jira and adding your findings to it.

 Thanks,
 Ashutosh


 On Thu, Aug 29, 2013 at 11:22 AM, agateaaa agate...@gmail.com wrote:

 Sorry hit send too soon ...

 Hi All:

 Put some debugging code in TUGIContainingTransport.getTransport() and I
 tracked it down to

 @Override
 public TUGIContainingTransport getTransport(TTransport trans) {

 // UGI information is not available at connection setup time, it will be
 set later
 // via set_ugi() rpc.
 transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));

 //return transMap.get(trans); //-change
   TUGIContainingTransport retTrans = transMap.get(trans);

   if ( retTrans == null ) {
  LOGGER.error ( cannot find transport that was in map !!)
}  else {
  LOGGER.debug ( cannot find transport that was in map !!)
  return retTrans;
}
 }

 When we run this in our test environment, see that we run into the
 problem
 just after GC runs,
 and cannot find transport that was in the map!! message gets logged.

 Could the GC be collecting entries from transMap, just before the we get
 it

 Tried a minor change which seems to work

 public TUGIContainingTransport getTransport(TTransport trans) {

TUGIContainingTransport retTrans = transMap.get(trans);

 if ( retTrans == null ) {
 // UGI information is not available at connection setup time, it will be
 set later
 // via set_ugi() rpc.
 transMap.putIfAbsent(trans, retTrans);
 }
return retTrans;
 }


 My questions for hive and  thrift experts

 1.) Do we need to use a ConcurrentMap
 ConcurrentMapTTransport, TUGIContainingTransport transMap = new
 MapMaker().weakKeys().weakValues().makeMap();
 It does use == to compare keys (which might be the problem), also in this
 case we cant rely on the trans to be always there in the transMap, even
 after a put, so in that case change above
 probably makes sense


 2.) Is it better idea to use WeakHashMap with WeakReference instead ?
 (was
 looking at org.apache.thrift.transport.TSaslServerTransport, esp change
 made by THRIFT-1468)

 e.g.
 private static MapTTransport, WeakReferenceTUGIContainingTransport
 transMap3 = Collections.synchronizedMap(new WeakHashMapTTransport,
 WeakReferenceTUGIContainingTransport());

 getTransport() would be something like

 public TUGIContainingTransport getTransport(TTransport trans) {
 WeakReferenceTUGIContainingTransport ret = transMap.get(trans);
 if (ret == null || ret.get() == null) {
 ret = new WeakReferenceTUGIContainingTransport(new
 TUGIContainingTransport(trans));
 transMap3.put(trans, ret); // No need for putIfAbsent().
 // Concurrent calls to getTransport() will pass in different TTransports.
 }
 return ret.get();
 }


 I did try 1.) above in our test environment and it does seem to resolve
 the
 problem, though i am not sure if I am introducing any other problem


 Can someone help ?


 Thanks
 Agatea













 On Thu, Aug 29, 2013 at 10:57 AM, agateaaa agate...@gmail.com wrote:

  Hi All:
 
  Put some debugging code in TUGIContainingTransport.getTransport() and I
  tracked it down to
 
  @Override
  public TUGIContainingTransport getTransport(TTransport trans) {
 
  // UGI information is not available at connection setup time, it will
 be
  set later
  // via set_ugi() rpc.
  transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));
 
  //return transMap.get(trans); -change
TUGIContainingTransport retTrans = transMap.get(trans);
 
if ( retTrans == null ) {
 
 
 
  }
 
 
 
 
 
  On Wed, Jul 31, 2013 at 9:48 AM, agateaaa agate...@gmail.com wrote:
 
  Thanks Nitin
 
  There arent too many connections in close_wait state only 1 or two
 when
  we run into this. Most likely its because of dropped connection.
 
  I could not find any read or write timeouts we can set for the thrift
  server which will tell thrift to hold on to the client connection.
   See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt
  seem to have been implemented yet. We do have set a client connection
  timeout but cannot find
  an equivalent setting for the server.
 
  We have  a suspicion that this happens when we run two client
 processes
  which modify two distinct partitions of the same hive table. We put
 in a
  workaround so that the two 

Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors

2013-08-29 Thread agateaaa
Hi All:

Put some debugging code in TUGIContainingTransport.getTransport() and I
tracked it down to

@Override
public TUGIContainingTransport getTransport(TTransport trans) {

// UGI information is not available at connection setup time, it will be
set later
// via set_ugi() rpc.
transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));

//return transMap.get(trans); -change
  TUGIContainingTransport retTrans = transMap.get(trans);

  if ( retTrans == null ) {



}





On Wed, Jul 31, 2013 at 9:48 AM, agateaaa agate...@gmail.com wrote:

 Thanks Nitin

 There arent too many connections in close_wait state only 1 or two when we
 run into this. Most likely its because of dropped connection.

 I could not find any read or write timeouts we can set for the thrift
 server which will tell thrift to hold on to the client connection.
  See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt seem
 to have been implemented yet. We do have set a client connection timeout
 but cannot find
 an equivalent setting for the server.

 We have  a suspicion that this happens when we run two client processes
 which modify two distinct partitions of the same hive table. We put in a
 workaround so that the two hive client processes never run together and so
 far things look ok but we will keep monitoring.

 Could it be because hive metastore server is not thread safe, would
 running two alter table statements on two distinct partitions of the same
 table using two client connections cause problems like these, where hive
 metastore server closes or drops a wrong client connection and leaves the
 other hanging?

 Agateaaa




 On Tue, Jul 30, 2013 at 12:49 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 The mentioned flow is called when you have unsecure mode of thrift
 metastore client-server connection. So one way to avoid this is have a
 secure way.

 code
 public boolean process(final TProtocol in, final TProtocol out)
 throwsTException {
 setIpAddress(in);
 ...
 ...
 ...
 @Override
  protected void setIpAddress(final TProtocol in) {
 TUGIContainingTransport ugiTrans =
 (TUGIContainingTransport)in.getTransport();
 Socket socket = ugiTrans.getSocket();
 if (socket != null) {
   setIpAddress(socket);

 /code


 From the above code snippet, it looks like the null pointer exception is
 not handled if the getSocket returns null.

 can you check whats the ulimit setting on the server? If its set to
 default
 can you set it to unlimited and restart hcat server. (This is just a wild
 guess).

 also the getSocket method suggests If the underlying TTransport is an
 instance of TSocket, it returns the Socket object which it contains.
 Otherwise it returns null.

 so someone from thirft gurus need to tell us whats happening. I have no
 knowledge of this depth

 may be Ashutosh or Thejas will be able to help on this.




 From the netstat close_wait, it looks like the hive metastore server has
 not closed the connection (do not know why yet), may be the hive dev guys
 can help.Are there too many connections in close_wait state?



 On Tue, Jul 30, 2013 at 5:52 AM, agateaaa agate...@gmail.com wrote:

  Looking at the hive metastore server logs see errors like these:
 
  2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer
  (TThreadPoolServer.java:run(182)) - Error occurred during processing of
  message.
  java.lang.NullPointerException
  at
 
 
 org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183)
  at
 
 
 org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79)
  at
 
 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:662)
 
  approx same time as we see timeout or connection reset errors.
 
  Dont know if this is the cause or the side affect of he connection
  timeout/connection reset errors. Does anybody have any pointers or
  suggestions ?
 
  Thanks
 
 
  On Mon, Jul 29, 2013 at 11:29 AM, agateaaa agate...@gmail.com wrote:
 
   Thanks Nitin!
  
   We have simiar setup (identical hcatalog and hive server versions) on
 a
   another production environment and dont see any errors (its been
 running
  ok
   for a few months)
  
   Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or
 hive
   0.10 soon.
  
   I did see that the last time we ran into this problem doing a
 netstat-ntp
   | grep :1 see that server was holding on to one socket
 connection
  in
   CLOSE_WAIT state for a long time
(hive metastore server is running on port 1). Dont know if thats
   relevant here or not
  
   Can you suggest any hive configuration settings we can tweak or
  networking
   tools/tips, we can use to 

Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors

2013-08-29 Thread agateaaa
Thanks Ashutosh.

Filed https://issues.apache.org/jira/browse/HIVE-5172


On Thu, Aug 29, 2013 at 11:53 AM, Ashutosh Chauhan hashut...@apache.orgwrote:

 Thanks Agatea for digging in. Seems like you have hit a bug. Would you
 mind opening a jira and adding your findings to it.

 Thanks,
 Ashutosh


 On Thu, Aug 29, 2013 at 11:22 AM, agateaaa agate...@gmail.com wrote:

 Sorry hit send too soon ...

 Hi All:

 Put some debugging code in TUGIContainingTransport.getTransport() and I
 tracked it down to

 @Override
 public TUGIContainingTransport getTransport(TTransport trans) {

 // UGI information is not available at connection setup time, it will be
 set later
 // via set_ugi() rpc.
 transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));

 //return transMap.get(trans); //-change
   TUGIContainingTransport retTrans = transMap.get(trans);

   if ( retTrans == null ) {
  LOGGER.error ( cannot find transport that was in map !!)
}  else {
  LOGGER.debug ( cannot find transport that was in map !!)
  return retTrans;
}
 }

 When we run this in our test environment, see that we run into the problem
 just after GC runs,
 and cannot find transport that was in the map!! message gets logged.

 Could the GC be collecting entries from transMap, just before the we get
 it

 Tried a minor change which seems to work

 public TUGIContainingTransport getTransport(TTransport trans) {

TUGIContainingTransport retTrans = transMap.get(trans);

 if ( retTrans == null ) {
 // UGI information is not available at connection setup time, it will be
 set later
 // via set_ugi() rpc.
 transMap.putIfAbsent(trans, retTrans);
 }
return retTrans;
 }


 My questions for hive and  thrift experts

 1.) Do we need to use a ConcurrentMap
 ConcurrentMapTTransport, TUGIContainingTransport transMap = new
 MapMaker().weakKeys().weakValues().makeMap();
 It does use == to compare keys (which might be the problem), also in this
 case we cant rely on the trans to be always there in the transMap, even
 after a put, so in that case change above
 probably makes sense


 2.) Is it better idea to use WeakHashMap with WeakReference instead ? (was
 looking at org.apache.thrift.transport.TSaslServerTransport, esp change
 made by THRIFT-1468)

 e.g.
 private static MapTTransport, WeakReferenceTUGIContainingTransport
 transMap3 = Collections.synchronizedMap(new WeakHashMapTTransport,
 WeakReferenceTUGIContainingTransport());

 getTransport() would be something like

 public TUGIContainingTransport getTransport(TTransport trans) {
 WeakReferenceTUGIContainingTransport ret = transMap.get(trans);
 if (ret == null || ret.get() == null) {
 ret = new WeakReferenceTUGIContainingTransport(new
 TUGIContainingTransport(trans));
 transMap3.put(trans, ret); // No need for putIfAbsent().
 // Concurrent calls to getTransport() will pass in different TTransports.
 }
 return ret.get();
 }


 I did try 1.) above in our test environment and it does seem to resolve
 the
 problem, though i am not sure if I am introducing any other problem


 Can someone help ?


 Thanks
 Agatea













 On Thu, Aug 29, 2013 at 10:57 AM, agateaaa agate...@gmail.com wrote:

  Hi All:
 
  Put some debugging code in TUGIContainingTransport.getTransport() and I
  tracked it down to
 
  @Override
  public TUGIContainingTransport getTransport(TTransport trans) {
 
  // UGI information is not available at connection setup time, it will be
  set later
  // via set_ugi() rpc.
  transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));
 
  //return transMap.get(trans); -change
TUGIContainingTransport retTrans = transMap.get(trans);
 
if ( retTrans == null ) {
 
 
 
  }
 
 
 
 
 
  On Wed, Jul 31, 2013 at 9:48 AM, agateaaa agate...@gmail.com wrote:
 
  Thanks Nitin
 
  There arent too many connections in close_wait state only 1 or two when
  we run into this. Most likely its because of dropped connection.
 
  I could not find any read or write timeouts we can set for the thrift
  server which will tell thrift to hold on to the client connection.
   See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt
  seem to have been implemented yet. We do have set a client connection
  timeout but cannot find
  an equivalent setting for the server.
 
  We have  a suspicion that this happens when we run two client processes
  which modify two distinct partitions of the same hive table. We put in
 a
  workaround so that the two hive client processes never run together
 and so
  far things look ok but we will keep monitoring.
 
  Could it be because hive metastore server is not thread safe, would
  running two alter table statements on two distinct partitions of the
 same
  table using two client connections cause problems like these, where
 hive
  metastore server closes or drops a wrong client connection and leaves
 the
  other hanging?
 
  Agateaaa
 
 
 
 
  On Tue, Jul 30, 2013 

Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors

2013-07-31 Thread agateaaa
Thanks Nitin

There arent too many connections in close_wait state only 1 or two when we
run into this. Most likely its because of dropped connection.

I could not find any read or write timeouts we can set for the thrift
server which will tell thrift to hold on to the client connection.
 See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt seem
to have been implemented yet. We do have set a client connection timeout
but cannot find
an equivalent setting for the server.

We have  a suspicion that this happens when we run two client processes
which modify two distinct partitions of the same hive table. We put in a
workaround so that the two hive client processes never run together and so
far things look ok but we will keep monitoring.

Could it be because hive metastore server is not thread safe, would running
two alter table statements on two distinct partitions of the same table
using two client connections cause problems like these, where hive
metastore server closes or drops a wrong client connection and leaves the
other hanging?

Agateaaa




On Tue, Jul 30, 2013 at 12:49 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 The mentioned flow is called when you have unsecure mode of thrift
 metastore client-server connection. So one way to avoid this is have a
 secure way.

 code
 public boolean process(final TProtocol in, final TProtocol out)
 throwsTException {
 setIpAddress(in);
 ...
 ...
 ...
 @Override
  protected void setIpAddress(final TProtocol in) {
 TUGIContainingTransport ugiTrans =
 (TUGIContainingTransport)in.getTransport();
 Socket socket = ugiTrans.getSocket();
 if (socket != null) {
   setIpAddress(socket);

 /code


 From the above code snippet, it looks like the null pointer exception is
 not handled if the getSocket returns null.

 can you check whats the ulimit setting on the server? If its set to default
 can you set it to unlimited and restart hcat server. (This is just a wild
 guess).

 also the getSocket method suggests If the underlying TTransport is an
 instance of TSocket, it returns the Socket object which it contains.
 Otherwise it returns null.

 so someone from thirft gurus need to tell us whats happening. I have no
 knowledge of this depth

 may be Ashutosh or Thejas will be able to help on this.




 From the netstat close_wait, it looks like the hive metastore server has
 not closed the connection (do not know why yet), may be the hive dev guys
 can help.Are there too many connections in close_wait state?



 On Tue, Jul 30, 2013 at 5:52 AM, agateaaa agate...@gmail.com wrote:

  Looking at the hive metastore server logs see errors like these:
 
  2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer
  (TThreadPoolServer.java:run(182)) - Error occurred during processing of
  message.
  java.lang.NullPointerException
  at
 
 
 org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183)
  at
 
 
 org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79)
  at
 
 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:662)
 
  approx same time as we see timeout or connection reset errors.
 
  Dont know if this is the cause or the side affect of he connection
  timeout/connection reset errors. Does anybody have any pointers or
  suggestions ?
 
  Thanks
 
 
  On Mon, Jul 29, 2013 at 11:29 AM, agateaaa agate...@gmail.com wrote:
 
   Thanks Nitin!
  
   We have simiar setup (identical hcatalog and hive server versions) on a
   another production environment and dont see any errors (its been
 running
  ok
   for a few months)
  
   Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or hive
   0.10 soon.
  
   I did see that the last time we ran into this problem doing a
 netstat-ntp
   | grep :1 see that server was holding on to one socket connection
  in
   CLOSE_WAIT state for a long time
(hive metastore server is running on port 1). Dont know if thats
   relevant here or not
  
   Can you suggest any hive configuration settings we can tweak or
  networking
   tools/tips, we can use to narrow this down ?
  
   Thanks
   Agateaaa
  
  
  
  
   On Mon, Jul 29, 2013 at 11:02 AM, Nitin Pawar nitinpawar...@gmail.com
  wrote:
  
   Is there any chance you can do a update on test environment with
  hcat-0.5
   and hive-0(11 or 10) and see if you can reproduce the issue?
  
   We used to see this error when there was load on hcat server or some
   network issue connecting to the server(second one was rare occurrence)
  
  
   On Mon, Jul 29, 2013 at 11:13 PM, agateaaa agate...@gmail.com
 wrote:
  
   Hi All:
  
   We are running into frequent problem using HCatalog 0.4.1 

Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors

2013-07-30 Thread Nitin Pawar
The mentioned flow is called when you have unsecure mode of thrift
metastore client-server connection. So one way to avoid this is have a
secure way.

code
public boolean process(final TProtocol in, final TProtocol out)
throwsTException {
setIpAddress(in);
...
...
...
@Override
 protected void setIpAddress(final TProtocol in) {
TUGIContainingTransport ugiTrans =
(TUGIContainingTransport)in.getTransport();
Socket socket = ugiTrans.getSocket();
if (socket != null) {
  setIpAddress(socket);

/code


From the above code snippet, it looks like the null pointer exception is
not handled if the getSocket returns null.

can you check whats the ulimit setting on the server? If its set to default
can you set it to unlimited and restart hcat server. (This is just a wild
guess).

also the getSocket method suggests If the underlying TTransport is an
instance of TSocket, it returns the Socket object which it contains.
Otherwise it returns null.

so someone from thirft gurus need to tell us whats happening. I have no
knowledge of this depth

may be Ashutosh or Thejas will be able to help on this.




From the netstat close_wait, it looks like the hive metastore server has
not closed the connection (do not know why yet), may be the hive dev guys
can help.Are there too many connections in close_wait state?



On Tue, Jul 30, 2013 at 5:52 AM, agateaaa agate...@gmail.com wrote:

 Looking at the hive metastore server logs see errors like these:

 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer
 (TThreadPoolServer.java:run(182)) - Error occurred during processing of
 message.
 java.lang.NullPointerException
 at

 org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183)
 at

 org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79)
 at

 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)

 approx same time as we see timeout or connection reset errors.

 Dont know if this is the cause or the side affect of he connection
 timeout/connection reset errors. Does anybody have any pointers or
 suggestions ?

 Thanks


 On Mon, Jul 29, 2013 at 11:29 AM, agateaaa agate...@gmail.com wrote:

  Thanks Nitin!
 
  We have simiar setup (identical hcatalog and hive server versions) on a
  another production environment and dont see any errors (its been running
 ok
  for a few months)
 
  Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or hive
  0.10 soon.
 
  I did see that the last time we ran into this problem doing a netstat-ntp
  | grep :1 see that server was holding on to one socket connection
 in
  CLOSE_WAIT state for a long time
   (hive metastore server is running on port 1). Dont know if thats
  relevant here or not
 
  Can you suggest any hive configuration settings we can tweak or
 networking
  tools/tips, we can use to narrow this down ?
 
  Thanks
  Agateaaa
 
 
 
 
  On Mon, Jul 29, 2013 at 11:02 AM, Nitin Pawar nitinpawar...@gmail.com
 wrote:
 
  Is there any chance you can do a update on test environment with
 hcat-0.5
  and hive-0(11 or 10) and see if you can reproduce the issue?
 
  We used to see this error when there was load on hcat server or some
  network issue connecting to the server(second one was rare occurrence)
 
 
  On Mon, Jul 29, 2013 at 11:13 PM, agateaaa agate...@gmail.com wrote:
 
  Hi All:
 
  We are running into frequent problem using HCatalog 0.4.1 (HIve
 Metastore
  Server 0.9) where we get connection reset or connection timeout errors.
 
  The hive metastore server has been allocated enough (12G) memory.
 
  This is a critical problem for us and would appreciate if anyone has
 any
  pointers.
 
  We did add a retry logic in our client, which seems to help, but I am
  just
  wondering how can we narrow down to the root cause
  of this problem. Could this be a hiccup in networking which causes the
  hive
  server to get into a unresponsive state  ?
 
  Thanks
 
  Agateaaa
 
 
  Example Connection reset error:
  ===
 
  org.apache.thrift.transport.TTransportException:
  java.net.SocketException:
  Connection reset
  at
 
 
 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
   at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
  at
 
 
 org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
   at
 
 
 org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
  at
 
 
 org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
   at
 org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
  at
 
 
 

Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors

2013-07-29 Thread agateaaa
Hi All:

We are running into frequent problem using HCatalog 0.4.1 (HIve Metastore
Server 0.9) where we get connection reset or connection timeout errors.

The hive metastore server has been allocated enough (12G) memory.

This is a critical problem for us and would appreciate if anyone has any
pointers.

We did add a retry logic in our client, which seems to help, but I am just
wondering how can we narrow down to the root cause
of this problem. Could this be a hiccup in networking which causes the hive
server to get into a unresponsive state  ?

Thanks

Agateaaa


Example Connection reset error:
===

org.apache.thrift.transport.TTransportException: java.net.SocketException:
Connection reset
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
 at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
 at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
 at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:157)
 at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
at
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableAddParts(DDLSemanticAnalyzer.java:1817)
 at
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:297)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
 at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
 ... 30 more




Example Connection timeout error:
==

org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: Read timed out
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
 at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
 at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
 at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:157)
 at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:830)
 at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:954)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7524)
 at

Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors

2013-07-29 Thread Nitin Pawar
Is there any chance you can do a update on test environment with hcat-0.5
and hive-0(11 or 10) and see if you can reproduce the issue?

We used to see this error when there was load on hcat server or some
network issue connecting to the server(second one was rare occurrence)


On Mon, Jul 29, 2013 at 11:13 PM, agateaaa agate...@gmail.com wrote:

 Hi All:

 We are running into frequent problem using HCatalog 0.4.1 (HIve Metastore
 Server 0.9) where we get connection reset or connection timeout errors.

 The hive metastore server has been allocated enough (12G) memory.

 This is a critical problem for us and would appreciate if anyone has any
 pointers.

 We did add a retry logic in our client, which seems to help, but I am just
 wondering how can we narrow down to the root cause
 of this problem. Could this be a hiccup in networking which causes the hive
 server to get into a unresponsive state  ?

 Thanks

 Agateaaa


 Example Connection reset error:
 ===

 org.apache.thrift.transport.TTransportException: java.net.SocketException:
 Connection reset
 at

 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
 at

 org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
  at

 org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
 at

 org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
  at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
 at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
  at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
 at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:157)
  at

 org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
 at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
  at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
 at

 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableAddParts(DDLSemanticAnalyzer.java:1817)
  at

 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:297)
 at

 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
 at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)
  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 Caused by: java.net.SocketException: Connection reset
 at java.net.SocketInputStream.read(SocketInputStream.java:168)
 at

 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
  ... 30 more




 Example Connection timeout error:
 ==

 org.apache.thrift.transport.TTransportException:
 java.net.SocketTimeoutException: Read timed out
 at

 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
 at

 org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
  at

 org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
 at

 org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
  at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
 at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
  at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
 at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:157)
  at

 

Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors

2013-07-29 Thread agateaaa
Thanks Nitin!

We have simiar setup (identical hcatalog and hive server versions) on a
another production environment and dont see any errors (its been running ok
for a few months)

Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or hive
0.10 soon.

I did see that the last time we ran into this problem doing a netstat-ntp |
grep :1 see that server was holding on to one socket connection in
CLOSE_WAIT state for a long time
 (hive metastore server is running on port 1). Dont know if thats
relevant here or not

Can you suggest any hive configuration settings we can tweak or networking
tools/tips, we can use to narrow this down ?

Thanks
Agateaaa




On Mon, Jul 29, 2013 at 11:02 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 Is there any chance you can do a update on test environment with hcat-0.5
 and hive-0(11 or 10) and see if you can reproduce the issue?

 We used to see this error when there was load on hcat server or some
 network issue connecting to the server(second one was rare occurrence)


 On Mon, Jul 29, 2013 at 11:13 PM, agateaaa agate...@gmail.com wrote:

 Hi All:

 We are running into frequent problem using HCatalog 0.4.1 (HIve Metastore
 Server 0.9) where we get connection reset or connection timeout errors.

 The hive metastore server has been allocated enough (12G) memory.

 This is a critical problem for us and would appreciate if anyone has any
 pointers.

 We did add a retry logic in our client, which seems to help, but I am just
 wondering how can we narrow down to the root cause
 of this problem. Could this be a hiccup in networking which causes the
 hive
 server to get into a unresponsive state  ?

 Thanks

 Agateaaa


 Example Connection reset error:
 ===

 org.apache.thrift.transport.TTransportException: java.net.SocketException:
 Connection reset
 at

 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
 at

 org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
  at

 org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
 at

 org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
  at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
 at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
  at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
 at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:157)
  at

 org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
 at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
  at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
 at

 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableAddParts(DDLSemanticAnalyzer.java:1817)
  at

 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:297)
 at

 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
 at
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)
  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 Caused by: java.net.SocketException: Connection reset
 at java.net.SocketInputStream.read(SocketInputStream.java:168)
 at

 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
  ... 30 more




 Example Connection timeout error:
 ==

 org.apache.thrift.transport.TTransportException:
 java.net.SocketTimeoutException: Read timed out
 at

 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
 at

 org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
  at

 

Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors

2013-07-29 Thread agateaaa
Looking at the hive metastore server logs see errors like these:

2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer
(TThreadPoolServer.java:run(182)) - Error occurred during processing of
message.
java.lang.NullPointerException
at
org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183)
at
org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

approx same time as we see timeout or connection reset errors.

Dont know if this is the cause or the side affect of he connection
timeout/connection reset errors. Does anybody have any pointers or
suggestions ?

Thanks


On Mon, Jul 29, 2013 at 11:29 AM, agateaaa agate...@gmail.com wrote:

 Thanks Nitin!

 We have simiar setup (identical hcatalog and hive server versions) on a
 another production environment and dont see any errors (its been running ok
 for a few months)

 Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or hive
 0.10 soon.

 I did see that the last time we ran into this problem doing a netstat-ntp
 | grep :1 see that server was holding on to one socket connection in
 CLOSE_WAIT state for a long time
  (hive metastore server is running on port 1). Dont know if thats
 relevant here or not

 Can you suggest any hive configuration settings we can tweak or networking
 tools/tips, we can use to narrow this down ?

 Thanks
 Agateaaa




 On Mon, Jul 29, 2013 at 11:02 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 Is there any chance you can do a update on test environment with hcat-0.5
 and hive-0(11 or 10) and see if you can reproduce the issue?

 We used to see this error when there was load on hcat server or some
 network issue connecting to the server(second one was rare occurrence)


 On Mon, Jul 29, 2013 at 11:13 PM, agateaaa agate...@gmail.com wrote:

 Hi All:

 We are running into frequent problem using HCatalog 0.4.1 (HIve Metastore
 Server 0.9) where we get connection reset or connection timeout errors.

 The hive metastore server has been allocated enough (12G) memory.

 This is a critical problem for us and would appreciate if anyone has any
 pointers.

 We did add a retry logic in our client, which seems to help, but I am
 just
 wondering how can we narrow down to the root cause
 of this problem. Could this be a hiccup in networking which causes the
 hive
 server to get into a unresponsive state  ?

 Thanks

 Agateaaa


 Example Connection reset error:
 ===

 org.apache.thrift.transport.TTransportException:
 java.net.SocketException:
 Connection reset
 at

 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
 at

 org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
  at

 org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
 at

 org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
  at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
 at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
  at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
 at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:157)
  at

 org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
 at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
  at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
 at

 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableAddParts(DDLSemanticAnalyzer.java:1817)
  at

 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:297)
 at

 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
 at
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)