[Call for items] ❄️ December Beam Newsletter

2018-12-07 Thread Rose Nguyen
Hi folks:

Time for the last newsletter of the year!

*Add to [1] the highlights from November to now (or planned events and
talks) that you want to share by 12/12 11:59 p.m. PDT.*

We will collect the notes via Google docs but send out the final version
directly to the user mailing list. If you do not know how to format
something, it is OK to just put down the info and I will edit. I'll ship
out the newsletter on 12/13.

[1]
https://docs.google.com/document/d/1q4KBkcLR7orr6n_QUHMpAVBKYzKYahlRZw_K_PaIuDo

Cheers,
-- 
Rose Thị Nguyễn


Re: ElasticsearchIO Write Batching Problems

2018-12-07 Thread Tim
Great. Thanks for sharing Evan.

Tim

> On 7 Dec 2018, at 20:06, Evan Galpin  wrote:
> 
> I've actually found that this was just a matter of pipeline processing speed. 
> I removed many layers of transforms such that entities flowed through the 
> pipeline faster, and saw the batch sizes increase. I think I may make a 
> separate pipeline to take full advantage of batch indexing.
> 
> Thanks!
> 
>> On 2018/12/07 14:36:44, Evan Galpin  wrote: 
>> I forgot to reiterate that the PCollection on which EsIO operates is of type 
>> String, where each element is a valid JSON document serialized _without_ 
>> pretty printing (i.e. without line breaks). If the PCollection should be of 
>> a different type, please let me know. From the EsIO source code, I believe 
>> it is correct to have a PCollection of String
>> 
>>> On 2018/12/07 14:33:17, Evan Galpin  wrote: 
>>> Thanks for confirming that this is unexpected behaviour Tim; certainly the 
>>> EsIO code looks to handle bundling. For the record, I've also confirmed via 
>>> debugger that `flushBatch()` is not being triggered by large document size.
>>> 
>>> I'm sourcing records from Google's BigQuery.  I have 2 queries which each 
>>> create a PCollection. I use a JacksonFactory to convert BigQuery results to 
>>> a valid JSON string (confirmed valid via debugger + linter). I have a few 
>>> Transforms to group the records from the 2 queries together, and then 
>>> convert again to JSON string via Jackson. I do know that the system creates 
>>> valid requests to the bulk API, it's just that it's only 1 document per 
>>> request.
>>> 
>>> Thanks for starting the process with this. If there are other specific 
>>> details that I can provide to be helpful, please let me know. Here are the 
>>> versions of modules I'm using now:
>>> 
>>> Beam SDK (beam-sdks-java-core): 2.8.0
>>> EsIO (beam-sdks-java-io-elasticsearch): 2.8.0
>>> BigQuery IO (beam-sdks-java-io-google-cloud-platform): 2.8.0
>>> DirectRunner (beam-runners-direct-java): 2.8.0
>>> DataflowRunner (beam-runners-google-cloud-dataflow-java): 2.8.0
>>> 
>>> 
 On 2018/12/07 06:36:58, Tim  wrote: 
 Hi Evan
 
 That is definitely not the expected behaviour and I believe is covered in 
 tests which use DirectRunner. Are you able to share your pipeline code, or 
 describe how you source your records please? It could be that something 
 else is causing EsIO to see bundles sized at only one record.
 
 I’ll verify ES IO behaviour when I get to a computer too.
 
 Tim (on phone)
 
> On 6 Dec 2018, at 22:00, e...@calabs.ca  wrote:
> 
> Hi all,
> 
> I’m having a bit of trouble with ElasticsearchIO Write transform. I’m 
> able to successfully index documents into my elasticsearch cluster, but 
> batching does not seem to work. There ends up being a 1:1 ratio between 
> HTTP requests sent to `/my-index/_doc/_bulk` and the number of documents 
> in my PCollection to which I apply the ElasticsearchIO PTransform. I’ve 
> noticed this specifically under the DirectRunner by utilizing a debugger.
> 
> Am I missing something? Is this possibly a difference between execution 
> environments (Ex. DirectRunner Vs. DataflowRunner)? How can I make sure 
> my program is taking advantage of batching/bulk indexing?
> 
> Thanks,
> Evan
 
>>> 
>> 


Re: ElasticsearchIO Write Batching Problems

2018-12-07 Thread Evan Galpin
I've actually found that this was just a matter of pipeline processing speed. I 
removed many layers of transforms such that entities flowed through the 
pipeline faster, and saw the batch sizes increase. I think I may make a 
separate pipeline to take full advantage of batch indexing.

Thanks!

On 2018/12/07 14:36:44, Evan Galpin  wrote: 
> I forgot to reiterate that the PCollection on which EsIO operates is of type 
> String, where each element is a valid JSON document serialized _without_ 
> pretty printing (i.e. without line breaks). If the PCollection should be of a 
> different type, please let me know. From the EsIO source code, I believe it 
> is correct to have a PCollection of String
> 
> On 2018/12/07 14:33:17, Evan Galpin  wrote: 
> > Thanks for confirming that this is unexpected behaviour Tim; certainly the 
> > EsIO code looks to handle bundling. For the record, I've also confirmed via 
> > debugger that `flushBatch()` is not being triggered by large document size.
> > 
> > I'm sourcing records from Google's BigQuery.  I have 2 queries which each 
> > create a PCollection. I use a JacksonFactory to convert BigQuery results to 
> > a valid JSON string (confirmed valid via debugger + linter). I have a few 
> > Transforms to group the records from the 2 queries together, and then 
> > convert again to JSON string via Jackson. I do know that the system creates 
> > valid requests to the bulk API, it's just that it's only 1 document per 
> > request.
> > 
> > Thanks for starting the process with this. If there are other specific 
> > details that I can provide to be helpful, please let me know. Here are the 
> > versions of modules I'm using now:
> > 
> > Beam SDK (beam-sdks-java-core): 2.8.0
> > EsIO (beam-sdks-java-io-elasticsearch): 2.8.0
> > BigQuery IO (beam-sdks-java-io-google-cloud-platform): 2.8.0
> > DirectRunner (beam-runners-direct-java): 2.8.0
> > DataflowRunner (beam-runners-google-cloud-dataflow-java): 2.8.0
> > 
> > 
> > On 2018/12/07 06:36:58, Tim  wrote: 
> > > Hi Evan
> > > 
> > > That is definitely not the expected behaviour and I believe is covered in 
> > > tests which use DirectRunner. Are you able to share your pipeline code, 
> > > or describe how you source your records please? It could be that 
> > > something else is causing EsIO to see bundles sized at only one record.
> > > 
> > > I’ll verify ES IO behaviour when I get to a computer too.
> > > 
> > > Tim (on phone)
> > > 
> > > > On 6 Dec 2018, at 22:00, e...@calabs.ca  wrote:
> > > > 
> > > > Hi all,
> > > > 
> > > > I’m having a bit of trouble with ElasticsearchIO Write transform. I’m 
> > > > able to successfully index documents into my elasticsearch cluster, but 
> > > > batching does not seem to work. There ends up being a 1:1 ratio between 
> > > > HTTP requests sent to `/my-index/_doc/_bulk` and the number of 
> > > > documents in my PCollection to which I apply the ElasticsearchIO 
> > > > PTransform. I’ve noticed this specifically under the DirectRunner by 
> > > > utilizing a debugger.
> > > > 
> > > > Am I missing something? Is this possibly a difference between execution 
> > > > environments (Ex. DirectRunner Vs. DataflowRunner)? How can I make sure 
> > > > my program is taking advantage of batching/bulk indexing?
> > > > 
> > > > Thanks,
> > > > Evan
> > > 
> > 
> 


Re: Moving to spark 2.4

2018-12-07 Thread Tim Robertson
To clarify Ismaël's comment

Cloudera repo indicates Cloudera 6.1 will have spark 2.4 but CDH is
currently still on 6.0.

https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-core_2.11/2.4.0-cdh6.1.0/

With the HWX / Cloudera merger the release cycle is not announced but 6.1
will likely not be there until early 2019. Also note that many folk will
still be on CDH 5 as CDH 6 is relatively new.







On Fri, Dec 7, 2018 at 5:06 PM Ismaël Mejía  wrote:

> It seems that Cloudera has it now, not sure if worth to wait for the
> Hortonworks maybe worth waiting for EMR.
>
> https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-core_2.11/
>
> A pro move to Spark 2.4.0 argument is for the future oriented (non
> hadoop friends), because the support for kubernetes has improved a lot
> in this release.
>
> On Fri, Dec 7, 2018 at 4:56 PM David Morávek 
> wrote:
> >
> > +1 for waiting for HDP and CDH adoption
> >
> > Sent from my iPhone
> >
> > On 7 Dec 2018, at 16:38, Alexey Romanenko 
> wrote:
> >
> > I agree with Ismael and I’d wait until the new Spark version will be
> supported by major BigData distributors.
> >
> > On 7 Dec 2018, at 14:57, Vishwas Bm  wrote:
> >
> > Hi Ismael,
> >
> > We have upgraded the spark to 2.4.
> > In our setup we had run few basic tests and found it to be pretty stable.
> >
> >
> > Thanks & Regards,
> > Vishwas
> >
> >
> > On Fri, Dec 7, 2018 at 2:53 PM Ismaël Mejía  wrote:
> >>
> >> Hello Vishwas,
> >>
> >> The spark dependency in the spark runner is provided so you can
> >> already pass the dependencies of spark 2.4 and it should work out of
> >> the box.
> >>
> >> JB did a PR to upgrade the version of Spark in the runner, but maybe
> >> it is worth to wait a bit before merging it, at least until some of
> >> the Big Data distributions has spark 2.4.x support available, so far
> >> nobody has upgraded it (well apart of databricks).
> >>
> >> What do others think, should we move ahead or are you aware of any
> >> issue introduced by version 2.4.0? (Notice that the PR just updates
> >> the version so code compatibility should be ok).
> >>
> >> Ismaël
> >>
> >> On Thu, Dec 6, 2018 at 12:14 PM Jean-Baptiste Onofré 
> wrote:
> >> >
> >> > Hi Vishwas
> >> >
> >> > Yes, I already started the update.
> >> >
> >> > Regards
> >> > JB
> >> >
> >> > On 06/12/2018 07:39, Vishwas Bm wrote:
> >> > > Hi,
> >> > >
> >> > > Currently I see that the spark version dependency used in Beam is
> >> > > //"2.3.2".
> >> > > As spark 2.4 is released now, is there a plan to upgrade Beam spark
> >> > > dependency ?
> >> > >
> >> > >
> >> > > *Thanks & Regards,*
> >> > > *Vishwas
> >> > > *
> >> > > *Mob : 9164886653*
> >> >
> >> > --
> >> > Jean-Baptiste Onofré
> >> > jbono...@apache.org
> >> > http://blog.nanthrax.net
> >> > Talend - http://www.talend.com
> >
> >
>


Re: Moving to spark 2.4

2018-12-07 Thread Ismaël Mejía
It seems that Cloudera has it now, not sure if worth to wait for the
Hortonworks maybe worth waiting for EMR.
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-core_2.11/

A pro move to Spark 2.4.0 argument is for the future oriented (non
hadoop friends), because the support for kubernetes has improved a lot
in this release.

On Fri, Dec 7, 2018 at 4:56 PM David Morávek  wrote:
>
> +1 for waiting for HDP and CDH adoption
>
> Sent from my iPhone
>
> On 7 Dec 2018, at 16:38, Alexey Romanenko  wrote:
>
> I agree with Ismael and I’d wait until the new Spark version will be 
> supported by major BigData distributors.
>
> On 7 Dec 2018, at 14:57, Vishwas Bm  wrote:
>
> Hi Ismael,
>
> We have upgraded the spark to 2.4.
> In our setup we had run few basic tests and found it to be pretty stable.
>
>
> Thanks & Regards,
> Vishwas
>
>
> On Fri, Dec 7, 2018 at 2:53 PM Ismaël Mejía  wrote:
>>
>> Hello Vishwas,
>>
>> The spark dependency in the spark runner is provided so you can
>> already pass the dependencies of spark 2.4 and it should work out of
>> the box.
>>
>> JB did a PR to upgrade the version of Spark in the runner, but maybe
>> it is worth to wait a bit before merging it, at least until some of
>> the Big Data distributions has spark 2.4.x support available, so far
>> nobody has upgraded it (well apart of databricks).
>>
>> What do others think, should we move ahead or are you aware of any
>> issue introduced by version 2.4.0? (Notice that the PR just updates
>> the version so code compatibility should be ok).
>>
>> Ismaël
>>
>> On Thu, Dec 6, 2018 at 12:14 PM Jean-Baptiste Onofré  
>> wrote:
>> >
>> > Hi Vishwas
>> >
>> > Yes, I already started the update.
>> >
>> > Regards
>> > JB
>> >
>> > On 06/12/2018 07:39, Vishwas Bm wrote:
>> > > Hi,
>> > >
>> > > Currently I see that the spark version dependency used in Beam is
>> > > //"2.3.2".
>> > > As spark 2.4 is released now, is there a plan to upgrade Beam spark
>> > > dependency ?
>> > >
>> > >
>> > > *Thanks & Regards,*
>> > > *Vishwas
>> > > *
>> > > *Mob : 9164886653*
>> >
>> > --
>> > Jean-Baptiste Onofré
>> > jbono...@apache.org
>> > http://blog.nanthrax.net
>> > Talend - http://www.talend.com
>
>


Re: Moving to spark 2.4

2018-12-07 Thread David Morávek
+1 for waiting for HDP and CDH adoption

Sent from my iPhone

> On 7 Dec 2018, at 16:38, Alexey Romanenko  wrote:
> 
> I agree with Ismael and I’d wait until the new Spark version will be 
> supported by major BigData distributors.
> 
>> On 7 Dec 2018, at 14:57, Vishwas Bm  wrote:
>> 
>> Hi Ismael,
>> 
>> We have upgraded the spark to 2.4.  
>> In our setup we had run few basic tests and found it to be pretty stable.
>> 
>> 
>> Thanks & Regards,
>> Vishwas 
>> 
>> 
>>> On Fri, Dec 7, 2018 at 2:53 PM Ismaël Mejía  wrote:
>>> Hello Vishwas,
>>> 
>>> The spark dependency in the spark runner is provided so you can
>>> already pass the dependencies of spark 2.4 and it should work out of
>>> the box.
>>> 
>>> JB did a PR to upgrade the version of Spark in the runner, but maybe
>>> it is worth to wait a bit before merging it, at least until some of
>>> the Big Data distributions has spark 2.4.x support available, so far
>>> nobody has upgraded it (well apart of databricks).
>>> 
>>> What do others think, should we move ahead or are you aware of any
>>> issue introduced by version 2.4.0? (Notice that the PR just updates
>>> the version so code compatibility should be ok).
>>> 
>>> Ismaël
>>> 
>>> On Thu, Dec 6, 2018 at 12:14 PM Jean-Baptiste Onofré  
>>> wrote:
>>> >
>>> > Hi Vishwas
>>> >
>>> > Yes, I already started the update.
>>> >
>>> > Regards
>>> > JB
>>> >
>>> > On 06/12/2018 07:39, Vishwas Bm wrote:
>>> > > Hi,
>>> > >
>>> > > Currently I see that the spark version dependency used in Beam is
>>> > > //"2.3.2".
>>> > > As spark 2.4 is released now, is there a plan to upgrade Beam spark
>>> > > dependency ?
>>> > >
>>> > >
>>> > > *Thanks & Regards,*
>>> > > *Vishwas
>>> > > *
>>> > > *Mob : 9164886653*
>>> >
>>> > --
>>> > Jean-Baptiste Onofré
>>> > jbono...@apache.org
>>> > http://blog.nanthrax.net
>>> > Talend - http://www.talend.com
> 


Re: Moving to spark 2.4

2018-12-07 Thread Alexey Romanenko
I agree with Ismael and I’d wait until the new Spark version will be supported 
by major BigData distributors.

> On 7 Dec 2018, at 14:57, Vishwas Bm  wrote:
> 
> Hi Ismael,
> 
> We have upgraded the spark to 2.4.  
> In our setup we had run few basic tests and found it to be pretty stable.
> 
> 
> Thanks & Regards,
> Vishwas 
> 
> 
> On Fri, Dec 7, 2018 at 2:53 PM Ismaël Mejía  > wrote:
> Hello Vishwas,
> 
> The spark dependency in the spark runner is provided so you can
> already pass the dependencies of spark 2.4 and it should work out of
> the box.
> 
> JB did a PR to upgrade the version of Spark in the runner, but maybe
> it is worth to wait a bit before merging it, at least until some of
> the Big Data distributions has spark 2.4.x support available, so far
> nobody has upgraded it (well apart of databricks).
> 
> What do others think, should we move ahead or are you aware of any
> issue introduced by version 2.4.0? (Notice that the PR just updates
> the version so code compatibility should be ok).
> 
> Ismaël
> 
> On Thu, Dec 6, 2018 at 12:14 PM Jean-Baptiste Onofré  > wrote:
> >
> > Hi Vishwas
> >
> > Yes, I already started the update.
> >
> > Regards
> > JB
> >
> > On 06/12/2018 07:39, Vishwas Bm wrote:
> > > Hi,
> > >
> > > Currently I see that the spark version dependency used in Beam is
> > > //"2.3.2".
> > > As spark 2.4 is released now, is there a plan to upgrade Beam spark
> > > dependency ?
> > >
> > >
> > > *Thanks & Regards,*
> > > *Vishwas
> > > *
> > > *Mob : 9164886653*
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org 
> > http://blog.nanthrax.net 
> > Talend - http://www.talend.com 



Problem updating pipeline on Dataflow

2018-12-07 Thread Leonardo Miguel
Hi guys,

I'd like to check if some of you have experienced the following issue:

We have a pipeline with several branches, like:

> PCollection inputData = p.apply(KafkaIO.read())
> .apply(Reshuffle.viaRandomKey())
> /* some transform */
>


/*branch1*/
> inputData.apply(Filter.by( ... ))
> /* some transform */
> .apply("Write1", BigQueryIO.write())
>


/*branch2*/
> inputData.apply(Filter.by( ... ))
> /* some transform */
> .apply("Write2", BigQueryIO.write())



/*branch3*/
> inputData.apply(Filter.by( ... ))
> /* some transform */
> .apply("Write3", BigQueryIO.write())

...


We need to remove some of the branches, so the result would be something
like:

> PCollection inputData = p.apply(KafkaIO.read())
> .apply(Reshuffle.viaRandomKey())
> /* some transform */


>
/*branch1*/
> inputData.apply(Filter.by( ... ))
> /* some transform */
> .apply("Write1", BigQueryIO.write())
>


/*branch3*/
> inputData.apply(Filter.by( ... ))
> /* some transform */
> .apply("Write3", BigQueryIO.write())
> ...


But when doing so and using Dataflow's update feature to update production
pipeline, we get the following error message:

Workflow failed. Causes: The new job is not compatible with  id>. The original job has not been aborted., The stage that begins with
> step Reshuffle.ViaRandomKey/Reshuffle/GroupByKey no longer produces data to
> the steps
> Write3/StreamingInserts/StreamingWriteTables/Reshuffle/GroupByKey. If these
> steps have been renamed or deleted, please specify them with the update
> command.


This error doesn't make sense since we step Write3 is still receiving data
from Reshuffle.

I tried using the option transformNameMapping to inform Dataflow that some
steps were deleted. I also tried passing as argument that Write3 is still
called Write3 but the error persists.

Would you have any ideas of what may be happening?

I noticed that some steps doesn't have unique names (so the runner sets
it), like filter steps in the snippet before, but none of them are related
directly to the problem I mentioned, at least they are not mentioned in the
error.
I tried using transformNameMapping to map theses names too without success.

Thanks in advance.
-- 
[]s

Leonardo Alves Miguel
Data Engineer
(16) 3509-5515 | www.arquivei.com.br

[image: Arquivei.com.br – Inteligência em Notas Fiscais]

[image: Google seleciona Arquivei para imersão e mentoria no Vale do
Silício]






Re: Beam Metrics using FlinkRunner

2018-12-07 Thread Etienne Chauchot
Hi Phil,
MetricsPusher is tested on all the runners in both batch and streaming mode. I 
just ran this test in Flink in streaming
mode and it works.
What is the command line you are using and which version of Beam?

Please also remember that, as discussed,  metrics (other flink features ) do 
not work if flink is used in detached mode.

Etienne


Le mardi 04 décembre 2018 à 12:49 -0600, Phil Franklin a écrit :
> I’m having difficulty accessing Beam metrics when using FlinkRunner in 
> streaming mode. I don’t get any metrics from MetricsPusher, though the same 
> setup delivered metrics from SparkRunner.  Probably for the same reason that 
> MetricsPusher doesn’t work, I also don’t get any output when I call an 
> instance of MetricsHttpSink directly.  The problem seems to be that Flink 
> never returns from pipeline.run(), an issue that others have referred to as 
> FlinkRunner hanging.  
> 
> Is there a solution for getting metrics in this case that I’m missing?
> 
> Thanks!
> -Phil


Re: ElasticsearchIO Write Batching Problems

2018-12-07 Thread Evan Galpin
I forgot to reiterate that the PCollection on which EsIO operates is of type 
String, where each element is a valid JSON document serialized _without_ pretty 
printing (i.e. without line breaks). If the PCollection should be of a 
different type, please let me know. From the EsIO source code, I believe it is 
correct to have a PCollection of String

On 2018/12/07 14:33:17, Evan Galpin  wrote: 
> Thanks for confirming that this is unexpected behaviour Tim; certainly the 
> EsIO code looks to handle bundling. For the record, I've also confirmed via 
> debugger that `flushBatch()` is not being triggered by large document size.
> 
> I'm sourcing records from Google's BigQuery.  I have 2 queries which each 
> create a PCollection. I use a JacksonFactory to convert BigQuery results to a 
> valid JSON string (confirmed valid via debugger + linter). I have a few 
> Transforms to group the records from the 2 queries together, and then convert 
> again to JSON string via Jackson. I do know that the system creates valid 
> requests to the bulk API, it's just that it's only 1 document per request.
> 
> Thanks for starting the process with this. If there are other specific 
> details that I can provide to be helpful, please let me know. Here are the 
> versions of modules I'm using now:
> 
> Beam SDK (beam-sdks-java-core): 2.8.0
> EsIO (beam-sdks-java-io-elasticsearch): 2.8.0
> BigQuery IO (beam-sdks-java-io-google-cloud-platform): 2.8.0
> DirectRunner (beam-runners-direct-java): 2.8.0
> DataflowRunner (beam-runners-google-cloud-dataflow-java): 2.8.0
> 
> 
> On 2018/12/07 06:36:58, Tim  wrote: 
> > Hi Evan
> > 
> > That is definitely not the expected behaviour and I believe is covered in 
> > tests which use DirectRunner. Are you able to share your pipeline code, or 
> > describe how you source your records please? It could be that something 
> > else is causing EsIO to see bundles sized at only one record.
> > 
> > I’ll verify ES IO behaviour when I get to a computer too.
> > 
> > Tim (on phone)
> > 
> > > On 6 Dec 2018, at 22:00, e...@calabs.ca  wrote:
> > > 
> > > Hi all,
> > > 
> > > I’m having a bit of trouble with ElasticsearchIO Write transform. I’m 
> > > able to successfully index documents into my elasticsearch cluster, but 
> > > batching does not seem to work. There ends up being a 1:1 ratio between 
> > > HTTP requests sent to `/my-index/_doc/_bulk` and the number of documents 
> > > in my PCollection to which I apply the ElasticsearchIO PTransform. I’ve 
> > > noticed this specifically under the DirectRunner by utilizing a debugger.
> > > 
> > > Am I missing something? Is this possibly a difference between execution 
> > > environments (Ex. DirectRunner Vs. DataflowRunner)? How can I make sure 
> > > my program is taking advantage of batching/bulk indexing?
> > > 
> > > Thanks,
> > > Evan
> > 
> 


Re: ElasticsearchIO Write Batching Problems

2018-12-07 Thread Evan Galpin
Thanks for confirming that this is unexpected behaviour Tim; certainly the EsIO 
code looks to handle bundling. For the record, I've also confirmed via debugger 
that `flushBatch()` is not being triggered by large document size.

I'm sourcing records from Google's BigQuery.  I have 2 queries which each 
create a PCollection. I use a JacksonFactory to convert BigQuery results to a 
valid JSON string (confirmed valid via debugger + linter). I have a few 
Transforms to group the records from the 2 queries together, and then convert 
again to JSON string via Jackson. I do know that the system creates valid 
requests to the bulk API, it's just that it's only 1 document per request.

Thanks for starting the process with this. If there are other specific details 
that I can provide to be helpful, please let me know. Here are the versions of 
modules I'm using now:

Beam SDK (beam-sdks-java-core): 2.8.0
EsIO (beam-sdks-java-io-elasticsearch): 2.8.0
BigQuery IO (beam-sdks-java-io-google-cloud-platform): 2.8.0
DirectRunner (beam-runners-direct-java): 2.8.0
DataflowRunner (beam-runners-google-cloud-dataflow-java): 2.8.0


On 2018/12/07 06:36:58, Tim  wrote: 
> Hi Evan
> 
> That is definitely not the expected behaviour and I believe is covered in 
> tests which use DirectRunner. Are you able to share your pipeline code, or 
> describe how you source your records please? It could be that something else 
> is causing EsIO to see bundles sized at only one record.
> 
> I’ll verify ES IO behaviour when I get to a computer too.
> 
> Tim (on phone)
> 
> > On 6 Dec 2018, at 22:00, e...@calabs.ca  wrote:
> > 
> > Hi all,
> > 
> > I’m having a bit of trouble with ElasticsearchIO Write transform. I’m able 
> > to successfully index documents into my elasticsearch cluster, but batching 
> > does not seem to work. There ends up being a 1:1 ratio between HTTP 
> > requests sent to `/my-index/_doc/_bulk` and the number of documents in my 
> > PCollection to which I apply the ElasticsearchIO PTransform. I’ve noticed 
> > this specifically under the DirectRunner by utilizing a debugger.
> > 
> > Am I missing something? Is this possibly a difference between execution 
> > environments (Ex. DirectRunner Vs. DataflowRunner)? How can I make sure my 
> > program is taking advantage of batching/bulk indexing?
> > 
> > Thanks,
> > Evan
> 


Re: Moving to spark 2.4

2018-12-07 Thread Vishwas Bm
Hi Ismael,

We have upgraded the spark to 2.4.
In our setup we had run few basic tests and found it to be pretty stable.


*Thanks & Regards,*

*Vishwas *


On Fri, Dec 7, 2018 at 2:53 PM Ismaël Mejía  wrote:

> Hello Vishwas,
>
> The spark dependency in the spark runner is provided so you can
> already pass the dependencies of spark 2.4 and it should work out of
> the box.
>
> JB did a PR to upgrade the version of Spark in the runner, but maybe
> it is worth to wait a bit before merging it, at least until some of
> the Big Data distributions has spark 2.4.x support available, so far
> nobody has upgraded it (well apart of databricks).
>
> What do others think, should we move ahead or are you aware of any
> issue introduced by version 2.4.0? (Notice that the PR just updates
> the version so code compatibility should be ok).
>
> Ismaël
>
> On Thu, Dec 6, 2018 at 12:14 PM Jean-Baptiste Onofré 
> wrote:
> >
> > Hi Vishwas
> >
> > Yes, I already started the update.
> >
> > Regards
> > JB
> >
> > On 06/12/2018 07:39, Vishwas Bm wrote:
> > > Hi,
> > >
> > > Currently I see that the spark version dependency used in Beam is
> > > //"2.3.2".
> > > As spark 2.4 is released now, is there a plan to upgrade Beam spark
> > > dependency ?
> > >
> > >
> > > *Thanks & Regards,*
> > > *Vishwas
> > > *
> > > *Mob : 9164886653*
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>


Re: Latin America Community

2018-12-07 Thread Leonardo Miguel
Sure thing, Ismael.

I know of some users from Sao Paulo but there are few who use beam in
production.
I'll keep looking for them to create a list and start doing some meetups if
possible.

Em sex, 7 de dez de 2018 às 07:35, Ismaël Mejía 
escreveu:

> Hello,
>
> It is a great idea to try to grow the community in the region. Notice that
> already there are multiple latino members in the dev community (e.g. Pablo,
> Gris and me). However no Brasilians so far, so glad that you want to be
> part.
>
> I suppose that given Sao Paulo's size it is probably the 'easiest' place
> to find people interested. Wondering if there are others in the mailing
> list.
> In any case don't hesitate to contact us for questions / support.
>
> Regards,
> Ismaël
>
>
>
>
>
> On Tue, Dec 4, 2018 at 3:42 PM Eryx  wrote:
>
>> Hi Leonardo,
>>
>> I'm Héctor Eryx from Guadalajara, México. I'm currently using Beam for
>> personal projects, plus giving some training/mentoring on how to use to
>> local communities.
>>
>> Also, I'm in touch with some friends at IBM Mexico who are using Beam to
>> run data storage events analysis.
>>
>> We are few, but we are strong 💪, hehe.
>>
>> Kind regards,
>> Héctor Eryx Paredes Camacho
>>
>> El mar., 4 de diciembre de 2018 6:20 a. m., Leonardo Miguel <
>> leonardo.mig...@arquivei.com.br> escribió:
>>
>>> Hi guys,
>>>
>>> Just want to check if there is someone using Beam and/or Scio at this
>>> side of the globe.
>>> I'd like to know also if there is any event near or some related
>>> community.
>>> If you are using Beam and/or Scio, please let me know.
>>>
>>> Let me start first:
>>> I'm located at Sao Carlos, Sao Paulo, Brazil.
>>> We use Beam and Scio running on Google Dataflow to serve data products
>>> (streaming and batch) over fiscal documents.
>>>
>>> Thanks!
>>>
>>>
>>> --
>>> []s
>>>
>>> Leonardo Alves Miguel
>>> Data Engineer
>>> (16) 3509-5515 | www.arquivei.com.br
>>> 
>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>>> 
>>> [image: Google seleciona Arquivei para imersão e mentoria no Vale do
>>> Silício]
>>> 
>>> 
>>> 
>>> 
>>>
>>

-- 
[]s

Leonardo Alves Miguel
Data Engineer
(16) 3509-5515 | www.arquivei.com.br

[image: Arquivei.com.br – Inteligência em Notas Fiscais]

[image: Google seleciona Arquivei para imersão e mentoria no Vale do
Silício]






Re: Beam Metrics using FlinkRunner

2018-12-07 Thread Kaymak, Tobias
I am using the Flink Prometheus connector and I get metrics in Prometheus
for my running pipelines. - I am not looking at metrics in the Flink
dashboard directly.
Have you tried that? (Talk with background and example:
https://github.com/mbode/flink-prometheus-example)

On Tue, Dec 4, 2018 at 7:49 PM Phil Franklin 
wrote:

> I’m having difficulty accessing Beam metrics when using FlinkRunner in
> streaming mode. I don’t get any metrics from MetricsPusher, though the same
> setup delivered metrics from SparkRunner.  Probably for the same reason
> that MetricsPusher doesn’t work, I also don’t get any output when I call an
> instance of MetricsHttpSink directly.  The problem seems to be that Flink
> never returns from pipeline.run(), an issue that others have referred to as
> FlinkRunner hanging.
>
> Is there a solution for getting metrics in this case that I’m missing?
>
> Thanks!
> -Phil



-- 
Tobias Kaymak
Data Engineer

tobias.kay...@ricardo.ch
www.ricardo.ch


Re: Latin America Community

2018-12-07 Thread Ismaël Mejía
Hello,

It is a great idea to try to grow the community in the region. Notice that
already there are multiple latino members in the dev community (e.g. Pablo,
Gris and me). However no Brasilians so far, so glad that you want to be
part.

I suppose that given Sao Paulo's size it is probably the 'easiest' place to
find people interested. Wondering if there are others in the mailing list.
In any case don't hesitate to contact us for questions / support.

Regards,
Ismaël





On Tue, Dec 4, 2018 at 3:42 PM Eryx  wrote:

> Hi Leonardo,
>
> I'm Héctor Eryx from Guadalajara, México. I'm currently using Beam for
> personal projects, plus giving some training/mentoring on how to use to
> local communities.
>
> Also, I'm in touch with some friends at IBM Mexico who are using Beam to
> run data storage events analysis.
>
> We are few, but we are strong 💪, hehe.
>
> Kind regards,
> Héctor Eryx Paredes Camacho
>
> El mar., 4 de diciembre de 2018 6:20 a. m., Leonardo Miguel <
> leonardo.mig...@arquivei.com.br> escribió:
>
>> Hi guys,
>>
>> Just want to check if there is someone using Beam and/or Scio at this
>> side of the globe.
>> I'd like to know also if there is any event near or some related
>> community.
>> If you are using Beam and/or Scio, please let me know.
>>
>> Let me start first:
>> I'm located at Sao Carlos, Sao Paulo, Brazil.
>> We use Beam and Scio running on Google Dataflow to serve data products
>> (streaming and batch) over fiscal documents.
>>
>> Thanks!
>>
>>
>> --
>> []s
>>
>> Leonardo Alves Miguel
>> Data Engineer
>> (16) 3509-5515 | www.arquivei.com.br
>> 
>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>> 
>> [image: Google seleciona Arquivei para imersão e mentoria no Vale do
>> Silício]
>> 
>> 
>> 
>> 
>>
>


Re: Moving to spark 2.4

2018-12-07 Thread Ismaël Mejía
Hello Vishwas,

The spark dependency in the spark runner is provided so you can
already pass the dependencies of spark 2.4 and it should work out of
the box.

JB did a PR to upgrade the version of Spark in the runner, but maybe
it is worth to wait a bit before merging it, at least until some of
the Big Data distributions has spark 2.4.x support available, so far
nobody has upgraded it (well apart of databricks).

What do others think, should we move ahead or are you aware of any
issue introduced by version 2.4.0? (Notice that the PR just updates
the version so code compatibility should be ok).

Ismaël

On Thu, Dec 6, 2018 at 12:14 PM Jean-Baptiste Onofré  wrote:
>
> Hi Vishwas
>
> Yes, I already started the update.
>
> Regards
> JB
>
> On 06/12/2018 07:39, Vishwas Bm wrote:
> > Hi,
> >
> > Currently I see that the spark version dependency used in Beam is
> > //"2.3.2".
> > As spark 2.4 is released now, is there a plan to upgrade Beam spark
> > dependency ?
> >
> >
> > *Thanks & Regards,*
> > *Vishwas
> > *
> > *Mob : 9164886653*
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com