Hi,

I am working on a pipeline that carries out a number of stages, the last of
which is to build some large JSON objects from information in the preceding
stages.  The JSON objects are then uploaded to Elasticsearch in bulk.

If I carry out a shuffle via a `repartition` call after the JSON documents
have been created, the upload to ES is fast.  But the shuffle itself takes
many tens of minutes and is IO-bound.

If I omit the repartition, the upload to ES takes a long time due to a
complete lack of parallelism.

Currently, the step that precedes the assembling of the JSON documents,
which goes into the final repartition call, is the querying of pairs of
object ids.  In a mapper the ids are resolved to documents by querying
HBase.  The initial pairs of ids are obtained via a query against the SQL
context, and the query result is repartitioned before going into the mapper
that resolves the ids into documents.

It's not clear to me why the final repartition preceding the upload to ES
is required.  I would like to omit it, since it is so expensive and
involves so much network IO, but have not found a way to do this yet.  If I
omit the repartition, the job takes much longer.

Does anyone know what might be going on here, and what I might be able to
do to get rid of the last `repartition` call before the upload to ES?

Eric

Reply via email to