In PIG I am doing query like this: sdp1 = load 'db1.table1' using org.apache.hcatalog.pig.HCatLoader; sdp = FILTER sdp1 BY key1=='value1' AND key2=='value2'; ll = LIMIT sdp 100; dump ll;
and hcatalog starts talking for few minutes to mysql asking for metadata, in the meantime after few seconds pig does: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out Number of partitions I have: hive -e 'use db1; show partitions table1' |wc -l Time taken: 1.467 seconds 37748 When I run the same query on a different environment where I have only ~1000 partitions all works fine. Also problem does not exist on cdh3 and hcatalog-0.4.0. In hcatalog's logs I can see: (note the timestamp, I run the query at 17:10:45,216) 2013-08-27 17:10:46,275 INFO DataNucleus.MetaData (Log4JLogger.java:info(77)) - Listener found initialisation for persistable class org.apache.hadoop.hive.metastore.model.MPartition 2013-08-27 17:14:23,661 DEBUG metastore.ObjectStore (ObjectStore.java:listMPartitionsByFilter(1832)) - Done retrieving all objects for listMPartitionsByFilter 2013-08-27 17:22:32,410 INFO metastore.ObjectStore (ObjectStore.java:getPartitionsByFilter(1699)) - # parts after pruning = 37748 After that the hcatalog continues to: 2013-08-27 17:30:14,631 DEBUG DataNucleus.Transaction (Log4JLogger.java:debug(58)) - Transaction committed in 462221 ms Please note that I have datanucleus set to DEBUG and that slows things down significantly, without that, it still takes around 7 minutes for hcatalog to settle. Also datanucleus settings from the hcatalog's logs: datanucleus.autoStartMechanismMode = checked javax.jdo.option.Multithreaded = true datanucleus.identifierFactory = datanucleus datanucleus.transactionIsolation = read datanucleus.validateTables = false javax.jdo.option.ConnectionURL = jdbc:mysql://XXX javax.jdo.option.DetachAllOnCommit = true javax.jdo.option.NonTransactionalRead = true datanucleus.validateConstraints = false javax.jdo.option.ConnectionDriverName = com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName = hive datanucleus.validateColumns = false datanucleus.cache.level2 = false datanucleus.plugin.pluginRegistryBundleCheck = LOG datanucleus.cache.level2.type = none javax.jdo.PersistenceManagerFactoryClass = org.datanucleus.jdo.JDOPersistenceManagerFactory datanucleus.autoCreateSchema = true datanucleus.storeManagerType = rdbms datanucleus.connectionPoolingType = DBCP This runs on CDH4 4.3.0 hcatalog version: 0.5.0+9-1.cdh4.3.0.p0.12~precise-cdh4.3.0 Does anyone know is it possible to increase pig's timeout? I already have hive.metastore.client.socket.timeout set to 3600 and pig times out in about 5-8 seconds.
