Hey all, A while ago, I patched ElasticsearchIO to be able to do partial updates and deletes. However, I did not consider my patch pull-request-worthy as the json parsing was done inefficient (parsed it twice per document).
Since Beam 2.5.0 partial updates are supported, so the only thing I’m missing is the ability to send bulk delete requests. We’re using entity updates for event sourcing in our data lake and need to persist deleted entities in elastic. We’ve been using my patch in production for the last year, but I would like to contribute to get the functionality we need into one of the next releases. I’ve created a gist that works for me, but is still inefficient (parsing twice: once to check the ‘_action` field, once to get the metadata). Each document I want to delete needs an additional ‘_action’ field with the value ‘delete’. It doesn’t matter the document still contains the redundant field, as the delete action only requires the metadata. I’ve added the method isDelete() and made some changes to the processElement() method. https://gist.github.com/wscheep/26cca4bda0145ffd38faf7efaf2c21b9 I would like to make my solution more generic to fit into the current ElasticsearchIO and create a proper pull request. As this would be my first pull request for beam, can anyone point me in the right direction before I spent too much time creating something that will be rejected? Some questions on the top of my mind are: * Is it a good idea it to make the ‘action’ part for the bulk api generic? * Should it be even more generic? (e.g.: set an ‘ActionFn’ on the ElasticsearchIO) * If I want to avoid parsing twice, the parsing should be done outside of the getDocumentMetaData() method. Would this be acceptable? * Is it possible to avoid passing the action as a field in the document? * Is there another or better way to get the delete functionality in general? All feedback is more than welcome. Cheers, Wout
