Thanks for confirming that this is unexpected behaviour Tim; certainly the EsIO code looks to handle bundling. For the record, I've also confirmed via debugger that `flushBatch()` is not being triggered by large document size.
I'm sourcing records from Google's BigQuery. I have 2 queries which each create a PCollection. I use a JacksonFactory to convert BigQuery results to a valid JSON string (confirmed valid via debugger + linter). I have a few Transforms to group the records from the 2 queries together, and then convert again to JSON string via Jackson. I do know that the system creates valid requests to the bulk API, it's just that it's only 1 document per request. Thanks for starting the process with this. If there are other specific details that I can provide to be helpful, please let me know. Here are the versions of modules I'm using now: Beam SDK (beam-sdks-java-core): 2.8.0 EsIO (beam-sdks-java-io-elasticsearch): 2.8.0 BigQuery IO (beam-sdks-java-io-google-cloud-platform): 2.8.0 DirectRunner (beam-runners-direct-java): 2.8.0 DataflowRunner (beam-runners-google-cloud-dataflow-java): 2.8.0 On 2018/12/07 06:36:58, Tim <[email protected]> wrote: > Hi Evan > > That is definitely not the expected behaviour and I believe is covered in > tests which use DirectRunner. Are you able to share your pipeline code, or > describe how you source your records please? It could be that something else > is causing EsIO to see bundles sized at only one record. > > I’ll verify ES IO behaviour when I get to a computer too. > > Tim (on phone) > > > On 6 Dec 2018, at 22:00, [email protected] <[email protected]> wrote: > > > > Hi all, > > > > I’m having a bit of trouble with ElasticsearchIO Write transform. I’m able > > to successfully index documents into my elasticsearch cluster, but batching > > does not seem to work. There ends up being a 1:1 ratio between HTTP > > requests sent to `/my-index/_doc/_bulk` and the number of documents in my > > PCollection to which I apply the ElasticsearchIO PTransform. I’ve noticed > > this specifically under the DirectRunner by utilizing a debugger. > > > > Am I missing something? Is this possibly a difference between execution > > environments (Ex. DirectRunner Vs. DataflowRunner)? How can I make sure my > > program is taking advantage of batching/bulk indexing? > > > > Thanks, > > Evan >
