@doug Regarding monotonically increasing keys, I took care by randomizing the data order. Regarding pre-created regions - did not know i can do that. Thanks. But when i looked into the case studys, the section "HBase Region With Non-Local Data". Will this be a problem when I pre-create the regions?
@michel Schema is simple.. one column family... in which we'll insert a max of 10 columns. 4 columns are compulsory. and other 6 cols are sparsely filled. KEY: a string of 50 Characters Col1: int Col2: string of 20 characters col3: string of 20 characters col4 : int col5 : int [ sparse ] col6: float [sparse] col7: string of 3 char [sparse] col8: string of 3 char [sparse] col9: string of 3 char [sparse] I've kept max.reduce.tasks = 16 .. Haven't set MSLABS.. what values do you recommend for my cluster. > "10k rows in a batch put() not really a good idea." Hmm.. should it be less or more ? > "What's your region size?" I did not set hbase.hregion.max.filesize manually.. please recommend. neither did i pre-create regions.. I'm not saying PIG will be a bottleneck.. The Output format / configurations of hbase /hardware can be... need suggestions on the same... Can I use HFileOutputFormat in this case? can i get some example snippets? Thanks Raj On Thu, Apr 26, 2012 at 7:11 PM, Michel Segel <[email protected]>wrote: > Ok... > 5 machines... > Total cluster? Is that 5 DN? > Each machine 1quad core, 32gb ram, 7 x600GB not sure what types of drives. > > > so let's assume 1control node running NN, JT, HM, ZK > And 4 DN running DN,TT,RS. > > We don't know your Schema, row size, or network. ( 10GBe, 1GBe, 100MBe?) > > We also don't know if you've tuned GC implemented MSLABS ... Etc. > > So 4 hours for 175Million rows? Could be ok. > Write your insert using a java M/R and see how long it takes. > > Nor do we know how many. Slots you have on each box. > 10k rows in a batch put() not really a good idea. > What's your region size? > > > Lots to think about before you can ask if you are doing the right thing, > or if PIG is the bottleneck. > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Apr 26, 2012, at 7:09 AM, Rajgopal Vaithiyanathan <[email protected]> > wrote: > > > My bad. > > > > I had used cat /proc/cpuinfo | grep "processor" | wc -l > > cat /proc/cpuinfo | grep “physical id” | sort | uniq | wc -l => 4 > > > > so its 4 physical cores then! > > > > and free -m gives me this. > > total used free shared buffers cached > > Mem: 32174 31382 792 0 123 27339 > > -/+ buffers/cache: 3918 28256 > > Swap: 24575 0 24575 > > > > > > > > On Thu, Apr 26, 2012 at 5:18 PM, Michel Segel <[email protected] > >wrote: > > > >> 32 cores w 32GB of Ram? > >> > >> Pig isn't fast, but I have to question what you are using for hardware. > >> Who makes a 32 core box? > >> Assuming you mean 16 physical cores. > >> > >> 7 drives? Not enough spindles for the number of cores. > >> > >> Sent from a remote device. Please excuse any typos... > >> > >> Mike Segel > >> > >> On Apr 26, 2012, at 6:38 AM, Rajgopal Vaithiyanathan < > [email protected]> > >> wrote: > >> > >>> Hey all, > >>> > >>> The default - HBaseStorage() takes hell lot of time for puts. > >>> > >>> In a cluster of 5 machines, insertion of 175 Million records took > 4Hours > >> 45 > >>> minutes > >>> Question - Is this good enough ? > >>> each machine has 32 cores and 32GB ram with 7*600GB harddisks. HBASE's > >> heap > >>> has been configured to 8GB. > >>> If the put speed is low, how can i improve them..? > >>> > >>> I tried tweaking the TableOutputFormat by increasing the > WriteBufferSize > >> to > >>> 24MB, and adding the multi put feature (by adding 10,000 puts in > >> ArrayList > >>> and putting it as a batch). After doing this, it started throwing > >>> > >>> java.util.concurrent.ExecutionException: > java.net.SocketTimeoutException: > >>> Call to slave1/172.21.208.176:60020 failed on socket timeout > exception: > >>> java.net.SocketTimeoutException: 60000 millis timeout while waiting for > >>> channel to be ready for read. ch : > >>> java.nio.channels.SocketChannel[connected > >>> local=/172.21.208.176:41135remote=slave1/ > >>> 172.21.208.176:60020] > >>> > >>> Which i assume is because, the clients took too long to put. > >>> > >>> The detailed log is as follows from one of the reduce job is as > follows. > >>> > >>> I've 'censored' some of the details. which i assume is Okay.! :P > >>> 2012-04-23 20:07:12,815 INFO org.apache.hadoop.util.NativeCodeLoader: > >>> Loaded the native-hadoop library > >>> 2012-04-23 20:07:13,097 WARN > >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi > >> already > >>> exists! > >>> 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:zookeeper.version=3.4.2-1221870, built on 12/21/2011 20:46 > >> GMT > >>> 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:host.name=*****.***** > >>> 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:java.version=1.6.0_22 > >>> 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:java.vendor=Sun Microsystems Inc. > >>> 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:java.home=/usr/lib/jvm/java-6-openjdk/jre > >>> 2012-04-23 20:07:13,787 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:java.class.path=**************************** > >>> 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:java.library.path=********************** > >>> 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:java.io.tmpdir=*************************** > >>> 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:java.compiler=<NA> > >>> 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:os.name=Linux > >>> 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:os.arch=amd64 > >>> 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:os.version=2.6.38-8-server > >>> 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:user.name=raj > >>> > >>> 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:user.home=********* > >>> 2012-04-23 20:07:13,788 INFO org.apache.zookeeper.ZooKeeper: Client > >>> environment:user.dir=**********************: > >>> 2012-04-23 20:07:13,790 INFO org.apache.zookeeper.ZooKeeper: Initiating > >>> client connection, connectString=master:2181 sessionTimeout=180000 > >>> watcher=hconnection > >>> 2012-04-23 20:07:13,822 INFO org.apache.zookeeper.ClientCnxn: Opening > >>> socket connection to server /172.21.208.180:2181 > >>> 2012-04-23 20:07:13,823 INFO > >>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: The identifier > of > >>> this process is [email protected] > >>> 2012-04-23 20:07:13,825 INFO org.apache.zookeeper.ClientCnxn: Socket > >>> connection established to master/172.21.208.180:2181, initiating > session > >>> 2012-04-23 20:07:13,840 INFO org.apache.zookeeper.ClientCnxn: Session > >>> establishment complete on server master/172.21.208.180:2181, > sessionid = > >>> 0x136dfa124e90015, negotiated timeout = 180000 > >>> 2012-04-23 20:07:14,129 INFO com.raj.OptimisedTableOutputFormat: > Created > >>> table instance for index > >>> 2012-04-23 20:07:14,184 INFO org.apache.hadoop.util.ProcessTree: setsid > >>> exited with exit code 0 > >>> 2012-04-23 20:07:14,205 INFO org.apache.hadoop.mapred.Task: Using > >>> ResourceCalculatorPlugin : > >>> org.apache.hadoop.util.LinuxResourceCalculatorPlugin@4513e9fd > >>> 2012-04-23 20:08:49,852 WARN > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > >>> Failed all from > >>> region=index,,1335191775144.2e69ca9ad2a2d92699aa34b1dc37f1bb., > >>> hostname=slave1, port=60020 > >>> java.util.concurrent.ExecutionException: > java.net.SocketTimeoutException: > >>> Call to slave1/172.21.208.176:60020 failed on socket timeout > exception: > >>> java.net.SocketTimeoutException: 60000 millis timeout while waiting for > >>> channel to be ready for read. ch : > >>> java.nio.channels.SocketChannel[connected > >>> local=/172.21.208.176:41135remote=slave1/ > >>> 172.21.208.176:60020] > >>> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) > >>> at java.util.concurrent.FutureTask.get(FutureTask.java:111) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1557) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1409) > >>> at > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:900) > >>> at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:773) > >>> at org.apache.hadoop.hbase.client.HTable.put(HTable.java:760) > >>> at > >>> > >> > com.raj.OptimisedTableOutputFormat$TableRecordWriter.write(OptimisedTableOutputFormat.java:142) > >>> at > >>> > >> > com.raj.OptimisedTableOutputFormat$TableRecordWriter.write(OptimisedTableOutputFormat.java:1) > >>> at com.raj.HBaseStorage.putNext(HBaseStorage.java:583) > >>> at > >>> > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139) > >>> at > >>> > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98) > >>> at > >>> > >> > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639) > >>> at > >>> > >> > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > >>> at > >>> > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48) > >>> at > >>> > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:269) > >>> at > >>> > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262) > >>> at > >>> > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) > >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > >>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > >>> at java.security.AccessController.doPrivileged(Native Method) > >>> at javax.security.auth.Subject.doAs(Subject.java:416) > >>> at > >>> > >> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) > >>> at org.apache.hadoop.mapred.Child.main(Child.java:249) > >>> Caused by: java.net.SocketTimeoutException: Call to slave1/ > >>> 172.21.208.176:60020 failed on socket timeout exception: > >>> java.net.SocketTimeoutException: 60000 millis timeout while waiting for > >>> channel to be ready for read. ch : > >>> java.nio.channels.SocketChannel[connected > >>> local=/172.21.208.176:41135remote=slave1/ > >>> 172.21.208.176:60020] > >>> at > >>> > >> > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:930) > >>> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:903) > >>> at > >>> > >> > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150) > >>> at $Proxy7.multi(Unknown Source) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1386) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1384) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithoutRetries(HConnectionManager.java:1365) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1383) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1381) > >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > >>> at java.util.concurrent.FutureTask.run(FutureTask.java:166) > >>> at > >>> > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > >>> at > >>> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > >>> at java.lang.Thread.run(Thread.java:679) > >>> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while > >>> waiting for channel to be ready for read. ch : > >>> java.nio.channels.SocketChannel[connected > >>> local=/172.21.208.176:41135remote=slave1/ > >>> 172.21.208.176:60020] > >>> at > >>> > >> > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > >>> at > >>> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > >>> at > >>> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > >>> at java.io.FilterInputStream.read(FilterInputStream.java:133) > >>> at > >>> > >> > org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:311) > >>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > >>> at java.io.BufferedInputStream.read(BufferedInputStream.java:254) > >>> at java.io.DataInputStream.readInt(DataInputStream.java:387) > >>> at > >>> > >> > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:571) > >>> at > >>> > >> > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:505) > >>> 2012-04-23 20:09:51,018 WARN > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > >>> Failed all from > >>> region=index,,1335191775144.2e69ca9ad2a2d92699aa34b1dc37f1bb., > >>> hostname=slave1, port=60020 > >>> java.util.concurrent.ExecutionException: > java.net.SocketTimeoutException: > >>> Call to slave1/172.21.208.176:60020 failed on socket timeout > exception: > >>> java.net.SocketTimeoutException: 60000 millis timeout while waiting for > >>> channel to be ready for read. ch : > >>> java.nio.channels.SocketChannel[connected > >>> local=/172.21.208.176:41150remote=slave1/ > >>> 172.21.208.176:60020] > >>> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) > >>> at java.util.concurrent.FutureTask.get(FutureTask.java:111) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1557) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1409) > >>> at > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:900) > >>> at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:773) > >>> at org.apache.hadoop.hbase.client.HTable.put(HTable.java:760) > >>> at > >>> > >> > com.raj.OptimisedTableOutputFormat$TableRecordWriter.write(OptimisedTableOutputFormat.java:142) > >>> at > >>> > >> > com.raj.OptimisedTableOutputFormat$TableRecordWriter.write(OptimisedTableOutputFormat.java:1) > >>> at com.raj.HBaseStorage.putNext(HBaseStorage.java:583) > >>> at > >>> > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139) > >>> at > >>> > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98) > >>> at > >>> > >> > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639) > >>> at > >>> > >> > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > >>> at > >>> > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48) > >>> at > >>> > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:269) > >>> at > >>> > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262) > >>> at > >>> > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) > >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > >>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > >>> at java.security.AccessController.doPrivileged(Native Method) > >>> at javax.security.auth.Subject.doAs(Subject.java:416) > >>> at > >>> > >> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) > >>> at org.apache.hadoop.mapred.Child.main(Child.java:249) > >>> Caused by: java.net.SocketTimeoutException: Call to slave1/ > >>> 172.21.208.176:60020 failed on socket timeout exception: > >>> java.net.SocketTimeoutException: 60000 millis timeout while waiting for > >>> channel to be ready for read. ch : > >>> java.nio.channels.SocketChannel[connected > >>> local=/172.21.208.176:41150remote=slave1/ > >>> 172.21.208.176:60020] > >>> at > >>> > >> > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:930) > >>> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:903) > >>> at > >>> > >> > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150) > >>> at $Proxy7.multi(Unknown Source) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1386) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1384) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithoutRetries(HConnectionManager.java:1365) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1383) > >>> at > >>> > >> > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1381) > >>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > >>> at java.util.concurrent.FutureTask.run(FutureTask.java:166) > >>> at > >>> > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > >>> at > >>> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > >>> at java.lang.Thread.run(Thread.java:679) > >>> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while > >>> waiting for channel to be ready for read. ch : > >>> java.nio.channels.SocketChannel[connected > >>> local=/172.21.208.176:41150remote=slave1/ > >>> 172.21.208.176:60020] > >>> at > >>> > >> > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > >>> at > >>> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > >>> at > >>> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > >>> at java.io.FilterInputStream.read(FilterInputStream.java:133) > >>> at > >>> > >> > org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:311) > >>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > >>> at java.io.BufferedInputStream.read(BufferedInputStream.java:254) > >>> at java.io.DataInputStream.readInt(DataInputStream.java:387) > >>> at > >>> > >> > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:571) > >>> at > >>> > >> > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:505) > >>> > >>> -- > >>> Thanks and Regards, > >>> Raj > >> >
