Hi folks,
I feel a little daft asking this, and suspect I am missing the obvious...
Can someone please tell me how I can do a ParDo following a Partition?
In spark I'd just repartition(...) and then a map() but I don't spot in the
Beam API how to run a ParDo on each partition in parallel. Do I need to
multithread manually?
I tried this:
PCollectionList<UntypedOccurrence> partitioned =
verbatimRecords.apply(Partition.of(10, new RecordPartitioner()));
// does not run in parallel on spark...
for (PCollection<UntypedOccurrence> untyped : partitioned.getAll()) {
PCollection<SolrInputDocument> inputDocs = partitioned.get(untyped).apply(
"convert-to-solr-format", ParDo.of(new ParseToSOLRDoc()));
inputDocs.apply(SolrIO.write().to("solr-load"
).withConnectionConfiguration(conn));
}
[For background: I'm using a non splittable custom Hadoop InputFormat which
means I end up with an RDD as a single partition, and need to split it to
run expensive op in parallel]
Thanks,
Tim