Re: @DoFn.Setup not called

2017-11-16 Thread Eugene Kirpichov
Could you give more details, e.g. a code snippet that reproduces the issue,
and describe how you determine that @Setup hasn't been called?

On Thu, Nov 16, 2017 at 6:58 PM Derek Hao Hu  wrote:

> ​I've been using DoFn.Setup method in Dataflow and it seems to be working
> fine.​
>
> On Thu, Nov 16, 2017 at 4:56 PM, Jacob Marble  wrote:
>
>> This one is weird.
>>
>> A DoFn I wrote:
>> - stateful
>> - used plenty in a streaming pipeline
>> - direct and dataflow runners
>> - works fine
>>
>> Now:
>> - new batch pipeline
>> - @DoFn.Setup method not called
>> - direct runner works properly (logs from setup method are output)
>> - dataflow runner simply doesn't call the setup method
>>
>> Is this possibly a Beam misuse? Javadoc for DoFn.Setup doesn't hint at
>> anything, so I'm suspecting Dataflow bug?
>>
>> Jacob
>>
>
>
>
> --
> Derek Hao Hu
>
> Software Engineer | Snapchat
> Snap Inc.
>


Re: @DoFn.Setup not called

2017-11-16 Thread Derek Hao Hu
​I've been using DoFn.Setup method in Dataflow and it seems to be working
fine.​

On Thu, Nov 16, 2017 at 4:56 PM, Jacob Marble  wrote:

> This one is weird.
>
> A DoFn I wrote:
> - stateful
> - used plenty in a streaming pipeline
> - direct and dataflow runners
> - works fine
>
> Now:
> - new batch pipeline
> - @DoFn.Setup method not called
> - direct runner works properly (logs from setup method are output)
> - dataflow runner simply doesn't call the setup method
>
> Is this possibly a Beam misuse? Javadoc for DoFn.Setup doesn't hint at
> anything, so I'm suspecting Dataflow bug?
>
> Jacob
>



-- 
Derek Hao Hu

Software Engineer | Snapchat
Snap Inc.


Re: Slack Channel

2017-11-16 Thread Lukasz Cwik
Invites have been sent to you Jacob and Chet, please check your inboxes.

On Thu, Nov 16, 2017 at 5:42 PM, Chet Aldrich 
wrote:

> If you wouldn’t mind I’d like an invite as well.
>
> Chet
>
> On Nov 16, 2017, at 4:58 PM, Jacob Marble  wrote:
>
> Me too, if you don't mind.
>
> Jacob
>
> On Thu, Nov 9, 2017 at 2:09 PM, Lukasz Cwik  wrote:
>
>> Invite sent, welcome.
>>
>> On Thu, Nov 9, 2017 at 2:08 PM, Fred Tsang  wrote:
>>
>>> Hi,
>>>
>>> Please add me to the slack channel.
>>>
>>> Thanks,
>>> Fred
>>>
>>> Ps. I think "BeamTV" would be a great YouTube channel ;)
>>>
>>
>>
>
>


Re: Slack Channel

2017-11-16 Thread Chet Aldrich
If you wouldn’t mind I’d like an invite as well. 

Chet

> On Nov 16, 2017, at 4:58 PM, Jacob Marble  wrote:
> 
> Me too, if you don't mind.
> 
> Jacob
> 
> On Thu, Nov 9, 2017 at 2:09 PM, Lukasz Cwik  > wrote:
> Invite sent, welcome.
> 
> On Thu, Nov 9, 2017 at 2:08 PM, Fred Tsang  > wrote:
> Hi, 
> 
> Please add me to the slack channel. 
> 
> Thanks,
> Fred
> 
> Ps. I think "BeamTV" would be a great YouTube channel ;)
> 
> 



Re: Slack Channel

2017-11-16 Thread Jacob Marble
Me too, if you don't mind.

Jacob

On Thu, Nov 9, 2017 at 2:09 PM, Lukasz Cwik  wrote:

> Invite sent, welcome.
>
> On Thu, Nov 9, 2017 at 2:08 PM, Fred Tsang  wrote:
>
>> Hi,
>>
>> Please add me to the slack channel.
>>
>> Thanks,
>> Fred
>>
>> Ps. I think "BeamTV" would be a great YouTube channel ;)
>>
>
>


Re: [VOTE] Choose the "new" Spark runner

2017-11-16 Thread Jacob Marble
[ ] Use Spark 1 & Spark 2 Support Branch
[X] Use Spark 2 Only Branch

Spark 2 has been out for a while, so probably not going to offend many
people.

Jacob

On Thu, Nov 16, 2017 at 5:45 AM, Neville Dipale 
wrote:

> [ ] Use Spark 1 & Spark 2 Support Branch
> [X] Use Spark 2 Only Branch
>
> On 16 November 2017 at 15:08, Jean-Baptiste Onofré 
> wrote:
>
>> Hi guys,
>>
>> To illustrate the current discussion about Spark versions support, you
>> can take a look on:
>>
>> --
>> Spark 1 & Spark 2 Support Branch
>>
>> https://github.com/jbonofre/beam/tree/BEAM-1920-SPARK2-MODULES
>>
>> This branch contains a Spark runner common module compatible with both
>> Spark 1.x and 2.x. For convenience, we introduced spark1 & spark2
>> modules/artifacts containing just a pom.xml to define the dependencies set.
>>
>> --
>> Spark 2 Only Branch
>>
>> https://github.com/jbonofre/beam/tree/BEAM-1920-SPARK2-ONLY
>>
>> This branch is an upgrade to Spark 2.x and "drop" support of Spark 1.x.
>>
>> As I'm ready to merge one of the other in the PR, I would like to
>> complete the vote/discussion pretty soon.
>>
>> Correct me if I'm wrong, but it seems that the preference is to drop
>> Spark 1.x to focus only on Spark 2.x (for the Spark 2 Only Branch).
>>
>> I would like to call a final vote to act the merge I will do:
>>
>> [ ] Use Spark 1 & Spark 2 Support Branch
>> [ ] Use Spark 2 Only Branch
>>
>> This informal vote is open for 48 hours.
>>
>> Please, let me know what your preference is.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 11/13/2017 09:32 AM, Jean-Baptiste Onofré wrote:
>>
>>> Hi Beamers,
>>>
>>> I'm forwarding this discussion & vote from the dev mailing list to the
>>> user mailing list.
>>> The goal is to have your feedback as user.
>>>
>>> Basically, we have two options:
>>> 1. Right now, in the PR, we support both Spark 1.x and 2.x using three
>>> artifacts (common, spark1, spark2). You, as users, pick up spark1 or spark2
>>> in your dependencies set depending the Spark target version you want.
>>> 2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0.
>>> If you still want to use Spark 1.x, then, you will be stuck up to Beam
>>> 2.2.0.
>>>
>>> Thoughts ?
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>>
>>>  Forwarded Message 
>>> Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
>>> Date: Wed, 8 Nov 2017 08:27:58 +0100
>>> From: Jean-Baptiste Onofré 
>>> Reply-To: d...@beam.apache.org
>>> To: d...@beam.apache.org
>>>
>>> Hi all,
>>>
>>> as you might know, we are working on Spark 2.x support in the Spark
>>> runner.
>>>
>>> I'm working on a PR about that:
>>>
>>> https://github.com/apache/beam/pull/3808
>>>
>>> Today, we have something working with both Spark 1.x and 2.x from a code
>>> standpoint, but I have to deal with dependencies. It's the first step of
>>> the update as I'm still using RDD, the second step would be to support
>>> dataframe (but for that, I would need PCollection elements with schemas,
>>> that's another topic on which Eugene, Reuven and I are discussing).
>>>
>>> However, as all major distributions now ship Spark 2.x, I don't think
>>> it's required anymore to support Spark 1.x.
>>>
>>> If we agree, I will update and cleanup the PR to only support and focus
>>> on Spark 2.x.
>>>
>>> So, that's why I'm calling for a vote:
>>>
>>>[ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
>>>[ ] 0 (I don't care ;))
>>>[ ] -1, I would like to still support Spark 1.x, and so having
>>> support of both Spark 1.x and 2.x (please provide specific comment)
>>>
>>> This vote is open for 48 hours (I have the commits ready, just waiting
>>> the end of the vote to push on the PR).
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>


@DoFn.Setup not called

2017-11-16 Thread Jacob Marble
This one is weird.

A DoFn I wrote:
- stateful
- used plenty in a streaming pipeline
- direct and dataflow runners
- works fine

Now:
- new batch pipeline
- @DoFn.Setup method not called
- direct runner works properly (logs from setup method are output)
- dataflow runner simply doesn't call the setup method

Is this possibly a Beam misuse? Javadoc for DoFn.Setup doesn't hint at
anything, so I'm suspecting Dataflow bug?

Jacob


Re: Does ElasticsearchIO in the latest RC support adding document IDs?

2017-11-16 Thread Chet Aldrich
Sure, I’d be happy to take it on. My JIRA ID is chetaldrich. We can continue 
discussion on that ticket. 

Chet

> On Nov 16, 2017, at 7:57 AM, Etienne Chauchot  wrote:
> 
> Chet, 
> FYI, here is the ticket and the design proposal: 
> https://issues.apache.org/jira/browse/BEAM-3201 
> . If you still want to code 
> that improvement, give me your jira id and I will assign the ticket to you. 
> Otherwise I can code it as well.
> 
> Best
> 
> Etienne
> 
> Le 16/11/2017 à 09:19, Etienne Chauchot a écrit :
>> Hi, 
>> Thanks for the offer, I'd be happy to review your PR. Just wait a bit until 
>> I have opened a proper ticket for that. I still need to think more about the 
>> design. Among other things, I have to check what ES dev team did for other 
>> big data ES IO (es_hadoop) on that particular point. Besides, I think we 
>> also need to deal with the id at read time not only at write time. I'll give 
>> some details in the ticket.
>> 
>> Le 15/11/2017 à 20:08, Chet Aldrich a écrit :
>>> Given that this seems like a change that should probably happen, and I’d 
>>> like to help contribute if possible, a few questions and my current 
>>> opinion: 
>>> 
>>> So I’m leaning towards approach B here, which is:
 b. (a bit less user friendly) PCollection with K as an id. But forces 
 the user to do a Pardo before writing to ES to output KV pairs of 
 
>>> I think that the reduction in user-friendliness may be outweighed by the 
>>> fact that this obviates some of the issues surrounding a failure when 
>>> finishing a bundle. Additionally, this forces the user to provide a 
>>> document id, which I think is probably better practice.
>>> 
>> Yes as I wrote before, I think it is better to force the user to provide an 
>> id (at least for index updates, exactly-one semantics is a larger beam 
>> subject than this IO scope). Regarding design, plan b is not the better one 
>> IMHO because it changes the IO public API. I'm more in favor of plan a with 
>> the ability for the user to tell what field is his doc id.
>>> This will also probably lead to fewer frustrations around “magic” code that 
>>> just pulls something in if it happens to be there, and doesn’t if not. 
>>> We’ll need to rely on the user catching this functionality in the docs or 
>>> the code itself to take advantage of it. 
>>> 
>>> IMHO it’d be generally better to enforce this at compile time because it 
>>> does have an effect on whether the pipeline produces duplicates on failure. 
>>> Additionally, we get the benefit of relatively intuitive behavior where if 
>>> the user passes in the same Key value, it’ll update a record in ES, and if 
>>> the key is different then it will create a new record. 
>> Totally agree, id enforcement at compile time, no auto-generation
>>> 
>>> Curious to hear thoughts on this. If this seems reasonable I’ll go ahead 
>>> and create a JIRA for tracking and start working on a PR for this. Also, if 
>>> it’d be good to loop in the dev mailing list before starting let me know, 
>>> I’m pretty new to this. 
>> I'll create the ticket and we will loop on design in the comments.
>> Best
>> Etienne
>>> 
>>> Chet 
>>> 
 On Nov 15, 2017, at 12:53 AM, Etienne Chauchot > wrote:
 
 Hi Chet,
 
 What you say is totally true, docs written using ElasticSearchIO will 
 always have an ES generated id. But it might change in the future, indeed 
 it might be a good thing to allow the user to pass an id. Just in 5 
 seconds thinking, I see 3 possible designs for that. 
 a.(simplest) use a json special field for the id, if it is provided by the 
 user in the input json then it is used, auto-generated id otherwise.
 
 b. (a bit less user friendly) PCollection with K as an id. But forces 
 the user to do a Pardo before writing to ES to output KV pairs of 
 
 c. (a lot more complex) Allow the IO to serialize/deserialize java beans 
 and have an String id field. Matching java types to ES types is quite 
 tricky, so, for now we just relied on the user to serialize his beans into 
 json and let ES match the types automatically.
 
 Related to the problems you raise bellow:
 
 1. Well, the bundle is the commit entity of beam. Consider the case of 
 ESIO.batchSize being < to bundle size. While processing records, when the 
 number of elements reaches batchSize, an ES bulk insert will be issued but 
 no finishBundle. If there is a problem later on in the bundle processing 
 before the finishBundle, the checkpoint will still be at the beginning of 
 the bundle, so all the bundle will be retried leading to duplicate 
 documents. Thanks for raising that! I'm CCing the dev list so that someone 
 could correct me on the checkpointing mecanism if I'm missing something. 
 

Re: Does ElasticsearchIO in the latest RC support adding document IDs?

2017-11-16 Thread Etienne Chauchot

Chet,

FYI, here is the ticket and the design proposal: 
https://issues.apache.org/jira/browse/BEAM-3201. If you still want to 
code that improvement, give me your jira id and I will assign the ticket 
to you. Otherwise I can code it as well.


Best

Etienne


Le 16/11/2017 à 09:19, Etienne Chauchot a écrit :


Hi,

Thanks for the offer, I'd be happy to review your PR. Just wait a bit 
until I have opened a proper ticket for that. I still need to think 
more about the design. Among other things, I have to check what ES dev 
team did for other big data ES IO (es_hadoop) on that particular 
point. Besides, I think we also need to deal with the id at read time 
not only at write time. I'll give some details in the ticket.



Le 15/11/2017 à 20:08, Chet Aldrich a écrit :
Given that this seems like a change that should probably happen, and 
I’d like to help contribute if possible, a few questions and my 
current opinion:


So I’m leaning towards approach B here, which is:


b. (a bit less user friendly) PCollection with K as an id. But 
forces the user to do a Pardo before writing to ES to output KV 
pairs of 


I think that the reduction in user-friendliness may be outweighed by 
the fact that this obviates some of the issues surrounding a failure 
when finishing a bundle. Additionally, this /forces/ the user to 
provide a document id, which I think is probably better practice.


Yes as I wrote before, I think it is better to force the user to 
provide an id (at least for index updates, exactly-one semantics is a 
larger beam subject than this IO scope). Regarding design, plan b is 
not the better one IMHO because it changes the IO public API. I'm more 
in favor of plan a with the ability for the user to tell what field is 
his doc id.


This will also probably lead to fewer frustrations around “magic” 
code that just pulls something in if it happens to be there, and 
doesn’t if not. We’ll need to rely on the user catching this 
functionality in the docs or the code itself to take advantage of it.


IMHO it’d be generally better to enforce this at compile time because 
it does have an effect on whether the pipeline produces duplicates on 
failure. Additionally, we get the benefit of relatively intuitive 
behavior where if the user passes in the same Key value, it’ll update 
a record in ES, and if the key is different then it will create a new 
record.

Totally agree, id enforcement at compile time, no auto-generation


Curious to hear thoughts on this. If this seems reasonable I’ll go 
ahead and create a JIRA for tracking and start working on a PR for 
this. Also, if it’d be good to loop in the dev mailing list before 
starting let me know, I’m pretty new to this.

I'll create the ticket and we will loop on design in the comments.
Best
Etienne


Chet

On Nov 15, 2017, at 12:53 AM, Etienne Chauchot > wrote:


Hi Chet,

What you say is totally true, docs written using ElasticSearchIO 
will always have an ES generated id. But it might change in the 
future, indeed it might be a good thing to allow the user to pass an 
id. Just in 5 seconds thinking, I see 3 possible designs for that.


a.(simplest) use a json special field for the id, if it is provided 
by the user in the input json then it is used, auto-generated id 
otherwise.


b. (a bit less user friendly) PCollection with K as an id. But 
forces the user to do a Pardo before writing to ES to output KV 
pairs of 


c. (a lot more complex) Allow the IO to serialize/deserialize java 
beans and have an String id field. Matching java types to ES types 
is quite tricky, so, for now we just relied on the user to serialize 
his beans into json and let ES match the types automatically.


Related to the problems you raise bellow:

1. Well, the bundle is the commit entity of beam. Consider the case 
of ESIO.batchSize being < to bundle size. While processing records, 
when the number of elements reaches batchSize, an ES bulk insert 
will be issued but no finishBundle. If there is a problem later on 
in the bundle processing before the finishBundle, the checkpoint 
will still be at the beginning of the bundle, so all the bundle will 
be retried leading to duplicate documents. Thanks for raising that! 
I'm CCing the dev list so that someone could correct me on the 
checkpointing mecanism if I'm missing something. Besides I'm 
thinking about forcing the user to provide an id in all cases to 
workaround this issue.


2. Correct.

Best,
Etienne

Le 15/11/2017 à 02:16, Chet Aldrich a écrit :

Hello all!

So I’ve been using the ElasticSearchIO sink for a project 
(unfortunately it’s Elasticsearch 5.x, and so I’ve been messing 
around with the latest RC) and I’m finding that it doesn’t allow 
for changing the document ID, but only lets you pass in a record, 
which means that the document ID is auto-generated. See this line 
for what specifically is happening:



Re: Does ElasticsearchIO in the latest RC support adding document IDs?

2017-11-16 Thread NerdyNick
I'd add to the idea here with the A solution. What about also supporting a
user function to provide the ID given the record. I say this because I'm
starting to also look into how to get the ESIO writer to support dynamic
index based on information contained within the event. For which just
looking at a field would be very pollute when it comes to having fields in
ES just to support the writer.

On Thu, Nov 16, 2017 at 1:33 AM, Jean-Baptiste Onofré 
wrote:

> I think it's the most elegant approach: the user should be able to decide
> the id field he wants to use.
>
> Regards
> JB
>
> On 11/16/2017 09:24 AM, Etienne Chauchot wrote:
>
>> +1, that is what I had in mind, if I recall correctly this is what
>> es_hadoop connector does.
>>
>>
>> Le 15/11/2017 à 20:22, Tim Robertson a écrit :
>>
>>> Hi Chet,
>>>
>>> I'll be a user of this, so thank you.
>>>
>>> It seems reasonable although - did you consider letting folk name the
>>> document ID field explicitly?  It would avoid an unnecessary transformation
>>> and might be simpler:
>>>// instruct the writer to use a provided document ID
>>>ElasticsearchIO.write().withConnectionConfiguration(conn).
>>> withMaxBatchSize(BATCH_SIZE).withDocumentIdField("myID");
>>>
>>> On Wed, Nov 15, 2017 at 8:08 PM, Chet Aldrich <
>>> chet.aldr...@postmates.com > wrote:
>>>
>>> Given that this seems like a change that should probably happen, and
>>> I’d
>>> like to help contribute if possible, a few questions and my current
>>> opinion:
>>>
>>> So I’m leaning towards approach B here, which is:
>>>

 b. (a bit less user friendly) PCollection with K as an id. But
 forces
 the user to do a Pardo before writing to ES to output KV pairs of
 

 I think that the reduction in user-friendliness may be outweighed
>>> by the
>>> fact that this obviates some of the issues surrounding a failure when
>>> finishing a bundle. Additionally, this /forces/ the user to provide a
>>> document id, which I think is probably better practice. This will
>>> also
>>> probably lead to fewer frustrations around “magic” code that just
>>> pulls
>>> something in if it happens to be there, and doesn’t if not. We’ll
>>> need to
>>> rely on the user catching this functionality in the docs or the code
>>> itself to take advantage of it.
>>>
>>> IMHO it’d be generally better to enforce this at compile time
>>> because it
>>> does have an effect on whether the pipeline produces duplicates on
>>> failure. Additionally, we get the benefit of relatively intuitive
>>> behavior
>>> where if the user passes in the same Key value, it’ll update a
>>> record in
>>> ES, and if the key is different then it will create a new record.
>>>
>>> Curious to hear thoughts on this. If this seems reasonable I’ll go
>>> ahead
>>> and create a JIRA for tracking and start working on a PR for this.
>>> Also,
>>> if it’d be good to loop in the dev mailing list before starting let
>>> me
>>> know, I’m pretty new to this.
>>>
>>> Chet
>>>
>>> On Nov 15, 2017, at 12:53 AM, Etienne Chauchot > wrote:

 Hi Chet,

 What you say is totally true, docs written using ElasticSearchIO
 will
 always have an ES generated id. But it might change in the future,
 indeed
 it might be a good thing to allow the user to pass an id. Just in 5
 seconds thinking, I see 3 possible designs for that.

 a.(simplest) use a json special field for the id, if it is provided
 by
 the user in the input json then it is used, auto-generated id
 otherwise.

 b. (a bit less user friendly) PCollection with K as an id. But
 forces
 the user to do a Pardo before writing to ES to output KV pairs of
 

 c. (a lot more complex) Allow the IO to serialize/deserialize java
 beans
 and have an String id field. Matching java types to ES types is
 quite
 tricky, so, for now we just relied on the user to serialize his
 beans
 into json and let ES match the types automatically.

 Related to the problems you raise bellow:

 1. Well, the bundle is the commit entity of beam. Consider the case
 of
 ESIO.batchSize being < to bundle size. While processing records,
 when the
 number of elements reaches batchSize, an ES bulk insert will be
 issued
 but no finishBundle. If there is a problem later on in the bundle
 processing before the finishBundle, the checkpoint will still be at
 the
 beginning of the bundle, so all the bundle will be retried leading
 to
 duplicate documents. Thanks for raising that! I'm CCing the dev
 list so
 that someone could correct me on the checkpointing 

Re: [VOTE] Choose the "new" Spark runner

2017-11-16 Thread Neville Dipale
[ ] Use Spark 1 & Spark 2 Support Branch
[X] Use Spark 2 Only Branch

On 16 November 2017 at 15:08, Jean-Baptiste Onofré  wrote:

> Hi guys,
>
> To illustrate the current discussion about Spark versions support, you can
> take a look on:
>
> --
> Spark 1 & Spark 2 Support Branch
>
> https://github.com/jbonofre/beam/tree/BEAM-1920-SPARK2-MODULES
>
> This branch contains a Spark runner common module compatible with both
> Spark 1.x and 2.x. For convenience, we introduced spark1 & spark2
> modules/artifacts containing just a pom.xml to define the dependencies set.
>
> --
> Spark 2 Only Branch
>
> https://github.com/jbonofre/beam/tree/BEAM-1920-SPARK2-ONLY
>
> This branch is an upgrade to Spark 2.x and "drop" support of Spark 1.x.
>
> As I'm ready to merge one of the other in the PR, I would like to complete
> the vote/discussion pretty soon.
>
> Correct me if I'm wrong, but it seems that the preference is to drop Spark
> 1.x to focus only on Spark 2.x (for the Spark 2 Only Branch).
>
> I would like to call a final vote to act the merge I will do:
>
> [ ] Use Spark 1 & Spark 2 Support Branch
> [ ] Use Spark 2 Only Branch
>
> This informal vote is open for 48 hours.
>
> Please, let me know what your preference is.
>
> Thanks !
> Regards
> JB
>
> On 11/13/2017 09:32 AM, Jean-Baptiste Onofré wrote:
>
>> Hi Beamers,
>>
>> I'm forwarding this discussion & vote from the dev mailing list to the
>> user mailing list.
>> The goal is to have your feedback as user.
>>
>> Basically, we have two options:
>> 1. Right now, in the PR, we support both Spark 1.x and 2.x using three
>> artifacts (common, spark1, spark2). You, as users, pick up spark1 or spark2
>> in your dependencies set depending the Spark target version you want.
>> 2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0.
>> If you still want to use Spark 1.x, then, you will be stuck up to Beam
>> 2.2.0.
>>
>> Thoughts ?
>>
>> Thanks !
>> Regards
>> JB
>>
>>
>>  Forwarded Message 
>> Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
>> Date: Wed, 8 Nov 2017 08:27:58 +0100
>> From: Jean-Baptiste Onofré 
>> Reply-To: d...@beam.apache.org
>> To: d...@beam.apache.org
>>
>> Hi all,
>>
>> as you might know, we are working on Spark 2.x support in the Spark
>> runner.
>>
>> I'm working on a PR about that:
>>
>> https://github.com/apache/beam/pull/3808
>>
>> Today, we have something working with both Spark 1.x and 2.x from a code
>> standpoint, but I have to deal with dependencies. It's the first step of
>> the update as I'm still using RDD, the second step would be to support
>> dataframe (but for that, I would need PCollection elements with schemas,
>> that's another topic on which Eugene, Reuven and I are discussing).
>>
>> However, as all major distributions now ship Spark 2.x, I don't think
>> it's required anymore to support Spark 1.x.
>>
>> If we agree, I will update and cleanup the PR to only support and focus
>> on Spark 2.x.
>>
>> So, that's why I'm calling for a vote:
>>
>>[ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
>>[ ] 0 (I don't care ;))
>>[ ] -1, I would like to still support Spark 1.x, and so having support
>> of both Spark 1.x and 2.x (please provide specific comment)
>>
>> This vote is open for 48 hours (I have the commits ready, just waiting
>> the end of the vote to push on the PR).
>>
>> Thanks !
>> Regards
>> JB
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


[VOTE] Choose the "new" Spark runner

2017-11-16 Thread Jean-Baptiste Onofré

Hi guys,

To illustrate the current discussion about Spark versions support, you can take 
a look on:


--
Spark 1 & Spark 2 Support Branch

https://github.com/jbonofre/beam/tree/BEAM-1920-SPARK2-MODULES

This branch contains a Spark runner common module compatible with both Spark 1.x 
and 2.x. For convenience, we introduced spark1 & spark2 modules/artifacts 
containing just a pom.xml to define the dependencies set.


--
Spark 2 Only Branch

https://github.com/jbonofre/beam/tree/BEAM-1920-SPARK2-ONLY

This branch is an upgrade to Spark 2.x and "drop" support of Spark 1.x.

As I'm ready to merge one of the other in the PR, I would like to complete the 
vote/discussion pretty soon.


Correct me if I'm wrong, but it seems that the preference is to drop Spark 1.x 
to focus only on Spark 2.x (for the Spark 2 Only Branch).


I would like to call a final vote to act the merge I will do:

[ ] Use Spark 1 & Spark 2 Support Branch
[ ] Use Spark 2 Only Branch

This informal vote is open for 48 hours.

Please, let me know what your preference is.

Thanks !
Regards
JB

On 11/13/2017 09:32 AM, Jean-Baptiste Onofré wrote:

Hi Beamers,

I'm forwarding this discussion & vote from the dev mailing list to the user 
mailing list.

The goal is to have your feedback as user.

Basically, we have two options:
1. Right now, in the PR, we support both Spark 1.x and 2.x using three artifacts 
(common, spark1, spark2). You, as users, pick up spark1 or spark2 in your 
dependencies set depending the Spark target version you want.
2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0. If you 
still want to use Spark 1.x, then, you will be stuck up to Beam 2.2.0.


Thoughts ?

Thanks !
Regards
JB


 Forwarded Message 
Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Date: Wed, 8 Nov 2017 08:27:58 +0100
From: Jean-Baptiste Onofré 
Reply-To: d...@beam.apache.org
To: d...@beam.apache.org

Hi all,

as you might know, we are working on Spark 2.x support in the Spark runner.

I'm working on a PR about that:

https://github.com/apache/beam/pull/3808

Today, we have something working with both Spark 1.x and 2.x from a code 
standpoint, but I have to deal with dependencies. It's the first step of the 
update as I'm still using RDD, the second step would be to support dataframe 
(but for that, I would need PCollection elements with schemas, that's another 
topic on which Eugene, Reuven and I are discussing).


However, as all major distributions now ship Spark 2.x, I don't think it's 
required anymore to support Spark 1.x.


If we agree, I will update and cleanup the PR to only support and focus on Spark 
2.x.


So, that's why I'm calling for a vote:

   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
   [ ] 0 (I don't care ;))
   [ ] -1, I would like to still support Spark 1.x, and so having support of 
both Spark 1.x and 2.x (please provide specific comment)


This vote is open for 48 hours (I have the commits ready, just waiting the end 
of the vote to push on the PR).


Thanks !
Regards
JB


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Does ElasticsearchIO in the latest RC support adding document IDs?

2017-11-16 Thread Jean-Baptiste Onofré
I think it's the most elegant approach: the user should be able to decide the id 
field he wants to use.


Regards
JB

On 11/16/2017 09:24 AM, Etienne Chauchot wrote:
+1, that is what I had in mind, if I recall correctly this is what es_hadoop 
connector does.



Le 15/11/2017 à 20:22, Tim Robertson a écrit :

Hi Chet,

I'll be a user of this, so thank you.

It seems reasonable although - did you consider letting folk name the document 
ID field explicitly?  It would avoid an unnecessary transformation and might 
be simpler:

   // instruct the writer to use a provided document ID
   
ElasticsearchIO.write().withConnectionConfiguration(conn).withMaxBatchSize(BATCH_SIZE).withDocumentIdField("myID");

On Wed, Nov 15, 2017 at 8:08 PM, Chet Aldrich > wrote:


Given that this seems like a change that should probably happen, and I’d
like to help contribute if possible, a few questions and my current opinion:

So I’m leaning towards approach B here, which is:


b. (a bit less user friendly) PCollection with K as an id. But forces
the user to do a Pardo before writing to ES to output KV pairs of 


I think that the reduction in user-friendliness may be outweighed by the
fact that this obviates some of the issues surrounding a failure when
finishing a bundle. Additionally, this /forces/ the user to provide a
document id, which I think is probably better practice. This will also
probably lead to fewer frustrations around “magic” code that just pulls
something in if it happens to be there, and doesn’t if not. We’ll need to
rely on the user catching this functionality in the docs or the code
itself to take advantage of it.

IMHO it’d be generally better to enforce this at compile time because it
does have an effect on whether the pipeline produces duplicates on
failure. Additionally, we get the benefit of relatively intuitive behavior
where if the user passes in the same Key value, it’ll update a record in
ES, and if the key is different then it will create a new record.

Curious to hear thoughts on this. If this seems reasonable I’ll go ahead
and create a JIRA for tracking and start working on a PR for this. Also,
if it’d be good to loop in the dev mailing list before starting let me
know, I’m pretty new to this.

Chet


On Nov 15, 2017, at 12:53 AM, Etienne Chauchot > wrote:

Hi Chet,

What you say is totally true, docs written using ElasticSearchIO will
always have an ES generated id. But it might change in the future, indeed
it might be a good thing to allow the user to pass an id. Just in 5
seconds thinking, I see 3 possible designs for that.

a.(simplest) use a json special field for the id, if it is provided by
the user in the input json then it is used, auto-generated id otherwise.

b. (a bit less user friendly) PCollection with K as an id. But forces
the user to do a Pardo before writing to ES to output KV pairs of 

c. (a lot more complex) Allow the IO to serialize/deserialize java beans
and have an String id field. Matching java types to ES types is quite
tricky, so, for now we just relied on the user to serialize his beans
into json and let ES match the types automatically.

Related to the problems you raise bellow:

1. Well, the bundle is the commit entity of beam. Consider the case of
ESIO.batchSize being < to bundle size. While processing records, when the
number of elements reaches batchSize, an ES bulk insert will be issued
but no finishBundle. If there is a problem later on in the bundle
processing before the finishBundle, the checkpoint will still be at the
beginning of the bundle, so all the bundle will be retried leading to
duplicate documents. Thanks for raising that! I'm CCing the dev list so
that someone could correct me on the checkpointing mecanism if I'm
missing something. Besides I'm thinking about forcing the user to provide
an id in all cases to workaround this issue.

2. Correct.

Best,
Etienne

Le 15/11/2017 à 02:16, Chet Aldrich a écrit :

Hello all!

So I’ve been using the ElasticSearchIO sink for a project (unfortunately
it’s Elasticsearch 5.x, and so I’ve been messing around with the latest
RC) and I’m finding that it doesn’t allow for changing the document ID,
but only lets you pass in a record, which means that the document ID is
auto-generated. See this line for what specifically is happening:


https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L838



Essentially the