Run a ParDo after a Partition

Tim Robertson Tue, 03 Oct 2017 05:48:50 -0700

Hi folks,

I feel a little daft asking this, and suspect I am missing the obvious...


Can someone please tell me how I can do a ParDo following a Partition?

In spark I'd just repartition(...) and then a map() but I don't spot in the
Beam API how to run a ParDo on each partition in parallel.  Do I need to
multithread manually?

I tried this:

PCollectionList<UntypedOccurrence> partitioned =
verbatimRecords.apply(Partition.of(10, new RecordPartitioner()));

// does not run in parallel on spark...

for (PCollection<UntypedOccurrence> untyped : partitioned.getAll()) {
PCollection<SolrInputDocument> inputDocs = partitioned.get(untyped).apply(
"convert-to-solr-format", ParDo.of(new ParseToSOLRDoc()));

inputDocs.apply(SolrIO.write().to("solr-load"
).withConnectionConfiguration(conn));
}


[For background: I'm using a non splittable custom Hadoop InputFormat which
means I end up with an RDD as a single partition, and need to split it to
run expensive op in parallel]

Thanks,
Tim

Run a ParDo after a Partition

Reply via email to