Re: ElasticsearchIO Write Batching Problems

Evan Galpin Fri, 07 Dec 2018 06:37:17 -0800

I forgot to reiterate that the PCollection on which EsIO operates is of type 
String, where each element is a valid JSON document serialized _without_ pretty 
printing (i.e. without line breaks). If the PCollection should be of a 
different type, please let me know. From the EsIO source code, I believe it is 
correct to have a PCollection of String


On 2018/12/07 14:33:17, Evan Galpin <[email protected]> wrote: 
> Thanks for confirming that this is unexpected behaviour Tim; certainly the 
> EsIO code looks to handle bundling. For the record, I've also confirmed via 
> debugger that `flushBatch()` is not being triggered by large document size.
> 
> I'm sourcing records from Google's BigQuery.  I have 2 queries which each 
> create a PCollection. I use a JacksonFactory to convert BigQuery results to a 
> valid JSON string (confirmed valid via debugger + linter). I have a few 
> Transforms to group the records from the 2 queries together, and then convert 
> again to JSON string via Jackson. I do know that the system creates valid 
> requests to the bulk API, it's just that it's only 1 document per request.
> 
> Thanks for starting the process with this. If there are other specific 
> details that I can provide to be helpful, please let me know. Here are the 
> versions of modules I'm using now:
> 
> Beam SDK (beam-sdks-java-core): 2.8.0
> EsIO (beam-sdks-java-io-elasticsearch): 2.8.0
> BigQuery IO (beam-sdks-java-io-google-cloud-platform): 2.8.0
> DirectRunner (beam-runners-direct-java): 2.8.0
> DataflowRunner (beam-runners-google-cloud-dataflow-java): 2.8.0
> 
> 
> On 2018/12/07 06:36:58, Tim <[email protected]> wrote: 
> > Hi Evan
> > 
> > That is definitely not the expected behaviour and I believe is covered in 
> > tests which use DirectRunner. Are you able to share your pipeline code, or 
> > describe how you source your records please? It could be that something 
> > else is causing EsIO to see bundles sized at only one record.
> > 
> > I’ll verify ES IO behaviour when I get to a computer too.
> > 
> > Tim (on phone)
> > 
> > > On 6 Dec 2018, at 22:00, [email protected] <[email protected]> wrote:
> > > 
> > > Hi all,
> > > 
> > > I’m having a bit of trouble with ElasticsearchIO Write transform. I’m 
> > > able to successfully index documents into my elasticsearch cluster, but 
> > > batching does not seem to work. There ends up being a 1:1 ratio between 
> > > HTTP requests sent to `/my-index/_doc/_bulk` and the number of documents 
> > > in my PCollection to which I apply the ElasticsearchIO PTransform. I’ve 
> > > noticed this specifically under the DirectRunner by utilizing a debugger.
> > > 
> > > Am I missing something? Is this possibly a difference between execution 
> > > environments (Ex. DirectRunner Vs. DataflowRunner)? How can I make sure 
> > > my program is taking advantage of batching/bulk indexing?
> > > 
> > > Thanks,
> > > Evan
> > 
>

Re: ElasticsearchIO Write Batching Problems

Reply via email to