Hi Noam, It doesn't sound all that surprising that you're CPU bound on a batch import job like this if you consider everything that is going on within the mappers.
Let's say you're importing data for a table with 20 columns. For each line of input data, the following is then occurring within the mapper: * each of the 20 columns is then being deserialized from CSV and then serialized according to the Phoenix encoding * there are 21 KeyValues then being created based on these 20 columns. Each of these KeyValues includes all rows from the primary key in the table * the 21 key values are then being added to a buffer where other KeyValues have been written, then partitioned and sorted based on the byte array containing all primary key values The output of this process is also (probably) being compressed before being written to local disk, so the in-memory structures that are being serialized and sorted are probably quite a bit bigger than the amount of physical IO. In other words, a relatively small amount of input (i.e. one row) results in quite a few operations. The structure of your table (e.g. key size) will probably have an effect on how much CPU is used within the mappers, but I think that it's not that uncommon for the map phase to be CPU bound. - Gabriel On Wed, Jan 7, 2015 at 1:55 PM, Bulvik, Noam <[email protected]> wrote: > > > Hi, > > > > We are tuning our system for bulk loading. We managed to load ~250M records > per hour (~96G of raw input csv data ) on a cluster with 8 nodes. We use MR > bulk loading tool with pre split table and salted key. > > > > What we currently see is that while Mappers are working we have 100% CPU > usage across the cluster. It was our impression that the mapper will be I/O > bound and not so much CPU intensive > > > > Any idea what else can we tune /check. > > > > > > Regards > > > > Noam > > > > > > Information in this e-mail and its attachments is confidential and > privileged under the TEOCO confidentiality terms that can be reviewed here.
