Hi,

I am trying to do bulkload into a HBase Table with one column family, using a custom mapper to create the PUTs according to my needs. (Machine setup at the end of the mail)

Unfortunately, with our data it is a bit hard to presplit the tables since the keys are not predictable thaat good (we won't really do scans afterwards, so no problem from that side here).

Anyway, i managed to do some presplitting (testing now 9 to 127 regions) to get more than one reducer and from my point of view they distribute the load quite well.

When i want to load a file of about 11 GB, containing 100 Million small records, i get the following error from the reducers very very close to the end of the whole job (at around 99.x %). Always happens with the last two unfinished reducers. And what does this "EEXIST. File exist error" mean here? It does not happen for smaller datasets.


From std-error:
---------------------
Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError: Java heap space
        at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:59)
        at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:42)
        at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:215)
        at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:288)
        at org.apache.hadoop.mapred.Child$3.run(Child.java:157)
log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
log4j:WARN Please initialize the log4j system properly.

from syslog-errors:
------------------------------------
2011-12-22 17:57:30,581 WARN org.apache.hadoop.mapred.Child: Error running child org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: EEXIST: File exists at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:178)
        at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:215)
        at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:288)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
        at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: EEXIST: File exists
        at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method)
at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:172)
        ... 7 more
2011-12-22 17:57:30,583 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task


or another one from a another attempf after the first one failed:
2011-12-22 18:06:36,640 INFO org.apache.hadoop.mapred.Task: Communication exception: java.io.IOException: Call to /127.0.0.1:45198 failed on local exception: java.io.IOException: Couldn't set up IO streams
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142)
        at org.apache.hadoop.ipc.Client.call(Client.java:1110)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
        at $Proxy0.statusUpdate(Unknown Source)
        at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:643)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Couldn't set up IO streams
        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:591)
        at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:210)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247)
        at org.apache.hadoop.ipc.Client.call(Client.java:1078)
        ... 4 more
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:59)
        at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:42)
        at 
org.apache.hadoop.ipc.Client$Connection.writeRpcHeader(Client.java:646)
        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:535)
        ... 7 more

2011-12-22 18:06:36,642 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2011-12-22 18:06:36,643 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space



or:
Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:59)
        at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:42)
        at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:215)
        at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:288)
        at org.apache.hadoop.mapred.Child$3.run(Child.java:157)
log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
log4j:WARN Please initialize the log4j system properly.



or:
2011-12-22 18:09:03,417 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded
        at org.apache.hadoop.hbase.client.Put.readFields(Put.java:495)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116) at org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:163) at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:60) at org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:40)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
        at org.apache.hadoop.mapred.Child.main(Child.java:264)




Unfortunately, Our machines can only have 8 GB of RAM in total. We have 9 Tasktrackers / Regionserver and one server to be Jobtracker, HBase Master and Zookeeper.
We use Cloudera CDH3u2, hbase 0.90.4.

HBASE HEAP 4 GB
Java Heap Space 1.5 GB

We run pretty much the standard configuration.

It seems to run fine for smaller datasets (e.g.25 M or 50 M records).

When i try to load the data using TableOutputFormat from a Map-only- Job, then it runs without errors, but slow (3.5 hours, compared to a few minutes with the bulkloader). I would like to go a bit further on data size (400 M records in 42 GB file, or 800 M records from a 82 GB file).

Can it really be, that the error only comes from java heap space and garbage collection? Is there some other thing to consider?

Thank you,
Christopher







Reply via email to