Hi,
I am trying to do bulkload into a HBase Table with one column family,
using a custom mapper to create the PUTs according to my needs. (Machine
setup at the end of the mail)
Unfortunately, with our data it is a bit hard to presplit the tables
since the keys are not predictable thaat good (we won't really do scans
afterwards, so no problem from that side here).
Anyway, i managed to do some presplitting (testing now 9 to 127 regions)
to get more than one reducer and from my point of view they distribute
the load quite well.
When i want to load a file of about 11 GB, containing 100 Million small
records, i get the following error from the reducers very very close to
the end of the whole job (at around 99.x %). Always happens with the
last two unfinished reducers.
And what does this "EEXIST. File exist error" mean here? It does not
happen for smaller datasets.
From std-error:
---------------------
Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError:
Java heap space
at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:59)
at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:42)
at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:215)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:288)
at org.apache.hadoop.mapred.Child$3.run(Child.java:157)
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.hdfs.DFSClient).
log4j:WARN Please initialize the log4j system properly.
from syslog-errors:
------------------------------------
2011-12-22 17:57:30,581 WARN org.apache.hadoop.mapred.Child: Error
running child
org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: EEXIST: File
exists
at
org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:178)
at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:215)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:288)
at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: EEXIST: File exists
at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method)
at
org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:172)
... 7 more
2011-12-22 17:57:30,583 INFO org.apache.hadoop.mapred.Task: Runnning
cleanup for the task
or another one from a another attempf after the first one failed:
2011-12-22 18:06:36,640 INFO org.apache.hadoop.mapred.Task:
Communication exception: java.io.IOException: Call to /127.0.0.1:45198
failed on local exception: java.io.IOException: Couldn't set up IO streams
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142)
at org.apache.hadoop.ipc.Client.call(Client.java:1110)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy0.statusUpdate(Unknown Source)
at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:643)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Couldn't set up IO streams
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:591)
at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:210)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247)
at org.apache.hadoop.ipc.Client.call(Client.java:1078)
... 4 more
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:59)
at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:42)
at
org.apache.hadoop.ipc.Client$Connection.writeRpcHeader(Client.java:646)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:535)
... 7 more
2011-12-22 18:06:36,642 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2011-12-22 18:06:36,643 FATAL org.apache.hadoop.mapred.Child: Error
running child : java.lang.OutOfMemoryError: Java heap space
or:
Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError: GC
overhead limit exceeded
at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:59)
at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:42)
at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:215)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:288)
at org.apache.hadoop.mapred.Child$3.run(Child.java:157)
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.hdfs.DFSClient).
log4j:WARN Please initialize the log4j system properly.
or:
2011-12-22 18:09:03,417 FATAL org.apache.hadoop.mapred.Child: Error
running child : java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.hadoop.hbase.client.Put.readFields(Put.java:495)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
at
org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:163)
at
org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:60)
at
org.apache.hadoop.hbase.mapreduce.PutSortReducer.reduce(PutSortReducer.java:40)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Unfortunately, Our machines can only have 8 GB of RAM in total. We have
9 Tasktrackers / Regionserver and one server to be Jobtracker, HBase
Master and Zookeeper.
We use Cloudera CDH3u2, hbase 0.90.4.
HBASE HEAP 4 GB
Java Heap Space 1.5 GB
We run pretty much the standard configuration.
It seems to run fine for smaller datasets (e.g.25 M or 50 M records).
When i try to load the data using TableOutputFormat from a Map-only-
Job, then it runs without errors, but slow (3.5 hours, compared to a few
minutes with the bulkloader).
I would like to go a bit further on data size (400 M records in 42 GB
file, or 800 M records from a 82 GB file).
Can it really be, that the error only comes from java heap space and
garbage collection? Is there some other thing to consider?
Thank you,
Christopher