Re: Does ElasticsearchIO in the latest RC support adding document IDs?

Etienne Chauchot Thu, 16 Nov 2017 07:58:50 -0800

Chet,

FYI, here is the ticket and the design proposal:https://issues.apache.org/jira/browse/BEAM-3201. If you still want tocode that improvement, give me your jira id and I will assign the ticketto you. Otherwise I can code it as well.


Best

Etienne


Le 16/11/2017 à 09:19, Etienne Chauchot a écrit :

Hi,
Thanks for the offer, I'd be happy to review your PR. Just wait a bituntil I have opened a proper ticket for that. I still need to thinkmore about the design. Among other things, I have to check what ES devteam did for other big data ES IO (es_hadoop) on that particularpoint. Besides, I think we also need to deal with the id at read timenot only at write time. I'll give some details in the ticket.
Le 15/11/2017 à 20:08, Chet Aldrich a écrit :
Given that this seems like a change that should probably happen, andI’d like to help contribute if possible, a few questions and mycurrent opinion:
So I’m leaning towards approach B here, which is:
b. (a bit less user friendly) PCollection<KV> with K as an id. Butforces the user to do a Pardo before writing to ES to output KVpairs of <id, json>
I think that the reduction in user-friendliness may be outweighed bythe fact that this obviates some of the issues surrounding a failurewhen finishing a bundle. Additionally, this /forces/ the user toprovide a document id, which I think is probably better practice.
Yes as I wrote before, I think it is better to force the user toprovide an id (at least for index updates, exactly-one semantics is alarger beam subject than this IO scope). Regarding design, plan b isnot the better one IMHO because it changes the IO public API. I'm morein favor of plan a with the ability for the user to tell what field ishis doc id.
This will also probably lead to fewer frustrations around “magic”code that just pulls something in if it happens to be there, anddoesn’t if not. We’ll need to rely on the user catching thisfunctionality in the docs or the code itself to take advantage of it.
IMHO it’d be generally better to enforce this at compile time becauseit does have an effect on whether the pipeline produces duplicates onfailure. Additionally, we get the benefit of relatively intuitivebehavior where if the user passes in the same Key value, it’ll updatea record in ES, and if the key is different then it will create a newrecord.
Totally agree, id enforcement at compile time, no auto-generation
Curious to hear thoughts on this. If this seems reasonable I’ll goahead and create a JIRA for tracking and start working on a PR forthis. Also, if it’d be good to loop in the dev mailing list beforestarting let me know, I’m pretty new to this.
I'll create the ticket and we will loop on design in the comments.
Best
Etienne
Chet
On Nov 15, 2017, at 12:53 AM, Etienne Chauchot <echauc...@apache.org<mailto:echauc...@apache.org>> wrote:
Hi Chet,
What you say is totally true, docs written using ElasticSearchIOwill always have an ES generated id. But it might change in thefuture, indeed it might be a good thing to allow the user to pass anid. Just in 5 seconds thinking, I see 3 possible designs for that.
a.(simplest) use a json special field for the id, if it is providedby the user in the input json then it is used, auto-generated idotherwise.
b. (a bit less user friendly) PCollection<KV> with K as an id. Butforces the user to do a Pardo before writing to ES to output KVpairs of <id, json>
c. (a lot more complex) Allow the IO to serialize/deserialize javabeans and have an String id field. Matching java types to ES typesis quite tricky, so, for now we just relied on the user to serializehis beans into json and let ES match the types automatically.
Related to the problems you raise bellow:
1. Well, the bundle is the commit entity of beam. Consider the caseof ESIO.batchSize being < to bundle size. While processing records,when the number of elements reaches batchSize, an ES bulk insertwill be issued but no finishBundle. If there is a problem later onin the bundle processing before the finishBundle, the checkpointwill still be at the beginning of the bundle, so all the bundle willbe retried leading to duplicate documents. Thanks for raising that!I'm CCing the dev list so that someone could correct me on thecheckpointing mecanism if I'm missing something. Besides I'mthinking about forcing the user to provide an id in all cases toworkaround this issue.
2. Correct.

Best,
Etienne

Le 15/11/2017 à 02:16, Chet Aldrich a écrit :
Hello all!
So I’ve been using the ElasticSearchIO sink for a project(unfortunately it’s Elasticsearch 5.x, and so I’ve been messingaround with the latest RC) and I’m finding that it doesn’t allowfor changing the document ID, but only lets you pass in a record,which means that the document ID is auto-generated. See this linefor what specifically is happening:
https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L838
Essentially the data part of the document is being placed but itdoesn’t allow for other properties, such as the document ID, to beset.
This leads to two problems:
1. Beam doesn’t necessarily guarantee exactly-once execution for agiven item in a PCollection, as I understand it. This means thatyou may get more than one record in Elastic for a given item in aPCollection that you pass in.
2. You can’t do partial updates to an index. If you run a batch jobonce, and then run the batch job again on the same index withoutclearing it, you just double everything in there.
Is there any good way around this?
I’d be happy to try writing up a PR for this in theory, but notsure how to best approach it. Also would like to figure out a wayto get around this in the meantime, if anyone has any ideas.
Best,

Chet
P.S. CCed echauc...@gmail.com <mailto:echauc...@gmail.com> becauseit seems like he’s been doing work related to the elastic sink.

Re: Does ElasticsearchIO in the latest RC support adding document IDs?

Reply via email to