Re: Does ElasticsearchIO in the latest RC support adding document IDs?

Jean-Baptiste Onofré Wed, 15 Nov 2017 01:08:44 -0800

I think it's also related to the discussion Romain raised on the dev mailinglist (gap between batch size, checkpointing & bundles).


Regards
JB


On 11/15/2017 09:53 AM, Etienne Chauchot wrote:

Hi Chet,
What you say is totally true, docs written using ElasticSearchIO will alwayshave an ES generated id. But it might change in the future, indeed it might be agood thing to allow the user to pass an id. Just in 5 seconds thinking, I see 3possible designs for that.
a.(simplest) use a json special field for the id, if it is provided by the userin the input json then it is used, auto-generated id otherwise.
b. (a bit less user friendly) PCollection<KV> with K as an id. But forces theuser to do a Pardo before writing to ES to output KV pairs of <id, json>
c. (a lot more complex) Allow the IO to serialize/deserialize java beans andhave an String id field. Matching java types to ES types is quite tricky, so,for now we just relied on the user to serialize his beans into json and let ESmatch the types automatically.
Related to the problems you raise bellow:
1. Well, the bundle is the commit entity of beam. Consider the case ofESIO.batchSize being < to bundle size. While processing records, when the numberof elements reaches batchSize, an ES bulk insert will be issued but nofinishBundle. If there is a problem later on in the bundle processing before thefinishBundle, the checkpoint will still be at the beginning of the bundle, soall the bundle will be retried leading to duplicate documents. Thanks forraising that! I'm CCing the dev list so that someone could correct me on thecheckpointing mecanism if I'm missing something. Besides I'm thinking aboutforcing the user to provide an id in all cases to workaround this issue.
2. Correct.

Best,
Etienne

Le 15/11/2017 à 02:16, Chet Aldrich a écrit :
Hello all!
So I’ve been using the ElasticSearchIO sink for a project (unfortunately it’sElasticsearch 5.x, and so I’ve been messing around with the latest RC) and I’mfinding that it doesn’t allow for changing the document ID, but only lets youpass in a record, which means that the document ID is auto-generated. See thisline for what specifically is happening:
https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L838
Essentially the data part of the document is being placed but it doesn’t allowfor other properties, such as the document ID, to be set.
This leads to two problems:
1. Beam doesn’t necessarily guarantee exactly-once execution for a given itemin a PCollection, as I understand it. This means that you may get more thanone record in Elastic for a given item in a PCollection that you pass in.
2. You can’t do partial updates to an index. If you run a batch job once, andthen run the batch job again on the same index without clearing it, you justdouble everything in there.
Is there any good way around this?
I’d be happy to try writing up a PR for this in theory, but not sure how tobest approach it. Also would like to figure out a way to get around this inthe meantime, if anyone has any ideas.
Best,

Chet
P.S. CCed [email protected] <mailto:[email protected]> because it seemslike he’s been doing work related to the elastic sink.


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Does ElasticsearchIO in the latest RC support adding document IDs?

Reply via email to