Re: high CPU when using bulk loading

Gabriel Reid Wed, 07 Jan 2015 06:09:00 -0800

Hi Noam,

It doesn't sound all that surprising that you're CPU bound on a batch
import job like this if you consider everything that is going on
within the mappers.

Let's say you're importing data for a table with 20 columns. For each
line of input data, the following is then occurring within the mapper:
* each of the 20 columns is then being deserialized from CSV and then
serialized according to the Phoenix encoding
* there are 21 KeyValues then being created based on these 20 columns.
Each of these KeyValues includes all rows from the primary key in the
table
* the 21 key values are then being added to a buffer where other
KeyValues have been written, then partitioned and sorted based on the
byte array containing all primary key values

The output of this process is also (probably) being compressed before
being written to local disk, so the in-memory structures that are
being serialized and sorted are probably quite a bit bigger than the
amount of physical IO.

In other words, a relatively small amount of input (i.e. one row)
results in quite a few operations.

The structure of your table (e.g. key size) will probably have an
effect on how much CPU is used within the mappers, but I think that
it's not that uncommon for the map phase to be CPU bound.

- Gabriel

On Wed, Jan 7, 2015 at 1:55 PM, Bulvik, Noam <[email protected]> wrote:
>
>
> Hi,
>
>
>
> We  are tuning our system for bulk loading. We managed to load ~250M records
> per hour (~96G of raw input csv data ) on a cluster with 8 nodes. We use MR
> bulk loading tool with pre split table and salted key.
>
>
>
> What we currently see is that while Mappers are working we have 100% CPU
> usage across the cluster. It was our impression that the mapper will be I/O
> bound and not so much CPU intensive
>
>
>
> Any idea what else can we tune /check.
>
>
>
>
>
> Regards
>
>
>
> Noam
>
>
>
>
>
> Information in this e-mail and its attachments is confidential and
> privileged under the TEOCO confidentiality terms that can be reviewed here.

Re: high CPU when using bulk loading

Reply via email to