from:"Jean\-Baptiste Onofré"

Re: Create IO connector for HTTP or ParDO

2023-06-23 Thread Jean-Baptiste Onofré

Hi,

While ago (at the very early stage of Beam :)), I proposed to create a
HTTP/REST source/sink (we should still have the Jira :)).
However, we didn't have a consensus in terms of features (I proposed
something very simple). Splittable DoFn didn't exist at that time.

So, if we want to move forward on HTTP/REST, we have to list the
features and expected behavior.

Regards
JB

On Sat, Jun 24, 2023 at 1:47 AM Juan Romero  wrote:
>
> Hi guys. I have a doubt related with it make sense to create an HTTP 
> connector in Apache Beam or simply I can create a PArdo Function that make 
> the http request. I want to know which advantages I would have creating an IO 
> HTTP connector.

Re: [PROPOSAL] Stop Spark 2 support in Spark Runner

2022-04-29 Thread Jean-Baptiste Onofré

+1, it makes sense to me. Users wanting "old" spark version can take
previous Beam releases.

Regards
JB

On Fri, Apr 29, 2022 at 12:39 PM Alexey Romanenko
 wrote:
>
> Any objections or comments from Spark 2 users on this topic?
>
> —
> Alexey
>
>
> On 20 Apr 2022, at 19:17, Alexey Romanenko  wrote:
>
> Hi everyone,
>
> A while ago, we already discussed on dev@ that there are several reasons to 
> stop provide a support of Spark2 in Spark Runner (in all its variants that we 
> have for now - RDD, Dataset, Portable) [1]. In two words, it brings some 
> burden to Spark runner support that we would like to avoid in the future.
>
> From the devs perspective I don’t see any objections about this. So, I’d like 
> to know if there are users that still uses Spark2 for their Beam pipelines 
> and it will be critical for them to keep using it.
>
> Please, share any your opinion on this!
>
> —
> Alexey
>
> [1] https://lists.apache.org/thread/opfhg3xjb9nptv878sygwj9gjx38rmco
>
> > On 31 Mar 2022, at 17:51, Alexey Romanenko  wrote:
> >
> > Hi everyone,
> >
> > For the moment, Beam Spark Runner supports two versions of Spark - 2.x and 
> > 3.x.
> >
> > Taking into account the several things that:
> > - almost all cloud providers already mostly moved to Spark 3.x as a main 
> > supported version;
> > - the latest Spark 2.x release (Spark 2.4.8, maintenance release) was done 
> > almost a year ago;
> > - Spark 3 is considered as a mainstream Spark version for development and 
> > bug fixing;
> > - better to avoid the burden of maintenance (there are some 
> > incompatibilities between Spark 2 and 3) of two versions;
> >
> > I’d suggest to stop support Spark 2 for the Spark Runner in the one of the 
> > next Beam releases.
> >
> > What are your thoughts on this? Are there any principal objections or 
> > reasons for not doing this that I probably missed?
> >
> > —
> > Alexey
> >
> >

Re: Beam AmqpIO

2019-09-18 Thread Jean-Baptiste Onofré

Hi,

As the original author of AmqpIO, I can update, but it requires some
internal changes to the IO (especially the way we are dealing with
checkpoint).

I will create a Jira and work on an update.

Regards
JB

On 18/09/2019 12:50, Alexey Romanenko wrote:
> It seems that we use pretty outdated version of proton-j, current version is 
> 0.33.2. 
> 
> In the same time, Messenger API was deprecated and removed a while ago. So, 
> updating to new version won’t be so easy. What can be as alternative for 
> this? 
> 
>> On 18 Sep 2019, at 10:47, Jean-Baptiste Onofré  wrote:
>>
>> Hi,
>>
>> It's provided by Apache QPid proton-j: org.apache.qpid:proton-j:0.13.1
>>
>> Regards
>> JB
>>
>> On 18/09/2019 10:34, kasper wrote:
>>> Hello,
>>>
>>> AmqpIO seems to have a dependency on the
>>> org.apache.qpid.proton.messenger package in usage but I do not seem to
>>> find that package anywhere in a public nexus repository. The net effect
>>> seems to be a classdefnotfound concerning
>>> org.apache.qpid.proton.messenger.Messenger$Factory. Any hints?
>>>
>>> Kasper
>>>
>>
>> -- 
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Beam AmqpIO

2019-09-18 Thread Jean-Baptiste Onofré

Hi,

It's provided by Apache QPid proton-j: org.apache.qpid:proton-j:0.13.1

Regards
JB

On 18/09/2019 10:34, kasper wrote:
> Hello,
> 
> AmqpIO seems to have a dependency on the
> org.apache.qpid.proton.messenger package in usage but I do not seem to
> find that package anywhere in a public nexus repository. The net effect
> seems to be a classdefnotfound concerning
> org.apache.qpid.proton.messenger.Messenger$Factory. Any hints?
> 
> Kasper
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Using RedisIO for hashes and selecting a database

2019-08-25 Thread Jean-Baptiste Onofré

Hi,

RedisIO currently doesn't support option to select a database.

Can you please create a Jira I will do the improvement ?

Thanks,
Regards
JB

On 25/08/2019 03:43, Steve973 wrote:
> I am new to Beam and I am familiarizing myself with various aspects of
> this cool framework.  I need to be able to get key/value pairs from a
> Redis hash, and not just the key/value store in the default database
> (database 0).  I have only seen the ability to get key/value pairs that
> have been set in Redis, but the source code does not make it seem like I
> can look up keys/values in a hash, and I have not seen a way to select a
> database number.  Have I missed something, or is this feature not (yet)
> implemented?  If I want to use this, am I going to have to extend the
> RedisIO class?  Thanks in advance!

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Query about JdbcIO.readRows()

2019-08-02 Thread Jean-Baptiste Onofré

Agree. I will fix that.

Regards
JB

Le 2 août 2019 à 17:15, à 17:15, Vishwas Bm  a écrit:
>Hi Kishor,
>
>+ dev (d...@beam.apache.org)
>
>This looks like a bug.  The attribute statementPreparator is nullable
>It should have been handled in the same way as in the expand method of
>Read class.
>
>
>*Thanks & Regards,*
>
>*Vishwas *
>
>
>On Fri, Aug 2, 2019 at 2:48 PM Kishor Joshi  wrote:
>
>> Hi,
>>
>> I am using the just released 2.14 version for JdbcIO with the newly
>added
>> "readRows" functionality.
>>
>> I want to read table data with a query without parameters (select *
>from
>> table_name).
>> As per my understanding, this should not require
>"StatementPreperator".
>> However, if I use the newly added "readRows" function, I get an
>exception
>> that seems to force me to use the "StatementPreperator".
>> Stacktrace below.
>>
>> java.lang.IllegalArgumentException: statementPreparator can not be
>null
>> at
>>
>org.apache.beam.vendor.guava.v20_0.com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
>> at
>>
>org.apache.beam.sdk.io.jdbc.JdbcIO$Read.withStatementPreparator(JdbcIO.java:600)
>> at
>> org.apache.beam.sdk.io.jdbc.JdbcIO$ReadRows.expand(JdbcIO.java:499)
>> at
>> org.apache.beam.sdk.io.jdbc.JdbcIO$ReadRows.expand(JdbcIO.java:410)
>> at
>org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
>> at
>org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:471)
>> at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:44)
>> at
>>
>com.nokia.csf.dfle.transforms.DfleRdbmsSource.expand(DfleRdbmsSource.java:34)
>> at
>>
>com.nokia.csf.dfle.transforms.DfleRdbmsSource.expand(DfleRdbmsSource.java:10)
>> at
>org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
>> at
>org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:488)
>> at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:56)
>> at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:182)
>> at
>> com.nokia.csf.dfle.dsl.DFLEBeamMain.dagWireUp(DFLEBeamMain.java:49)
>> at
>com.nokia.csf.dfle.dsl.DFLEBeamMain.main(DFLEBeamMain.java:120)
>>
>>
>>
>> The test added in JdbcIOTest.java for this functionality only tests
>for
>> queries with parameters.
>> Is this new function supported only in the above case and not for
>normal
>> "withQuery" (without parameters) ?
>>
>>
>> Thanks & regards,
>> Kishor
>>

Re: a fix to send RabbitMq messages

2019-05-24 Thread Jean-Baptiste Onofré

Hi,

You can create a PullRequest, I will do the review.

The coder is set on the RabbitMQIO PTransform, so, it should work.

AFAIR, we have a Jira about that and I already started to check. Not yet
completed yet.

Regards
JB

On 24/05/2019 11:01, Nicolas Delsaux wrote:
> Hi all
> 
> I'm currently evaluationg Apache Beam to transfer messages from RabbitMq
> to kafka with some transform in between.
> 
> Doing, so, i've discovered some differences between direct runner
> behaviour and Google Dataflow runner.
> 
> But first, a small introduction to what I know.
> 
> From what I understand, elements transmitted between two different
> transforms are serialized/deserialized.
> 
> This (de)serialization process is normally managed by Coder, in which
> the most used is obviously the Serializablecoder, which takes a
> serializable object and (de)serialize it using classical java mechanisms.
> 
> On direct runner, i had issues with rabbitMq messages, as they contain
> in their headers objects that are LongString, an interface implemented
> solely in a private static class of RabbitMq, and used for large text
> messages.
> 
> So I wrote a RabbitMqMessageCoder, and installed it in my pipeline
> (using
> pipeline.getCoderregistry().registerCoderForClass(RabbitMqMessage.class,
> new MyCoder())
> 
> And it worked ! well, not in dataflow runner.
> 
> 
> indeed, it seems like dataflow runner don't use this coder registry
> mechanism (for reasons I absolutely don't understand).
> 
> So my fix didn't work.
> 
> After various tries, I finally gave up and directly modified the
> RabbitMqIO class (from Apache Beam) to handle my case.
> 
> This fix is available on my Beam fork on GitHub, and i would like to
> have it integrated.
> 
> What is the procedure to do so ?
> 
> Thanks !
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: kafka 0.9 support

2019-04-03 Thread Jean-Baptiste Onofré

 we should fail with a clear error
> message telling users to use the Kafka 0.9 IO.
> 
> On Tue, Apr 2, 2019 at 9:34 AM Alexey Romanenko
>  <mailto:aromanenko@gmail.com>> wrote:
> 
> > How are multiple versions of Kafka supported? Are
> they all in one client, or is there a case for forks
> like ElasticSearchIO?
> 
> They are supported in one client but we have
> additional “ConsumerSpEL” adapter which unifies
> interface difference among different Kafka client
> versions (mostly to support old ones 0.9-0.10.0).
> 
> On the other hand, we warn user in Javadoc of
> KafkaIO (which is Unstable, btw) by the following:
> /“KafkaIO relies on kafka-clients for all its
> interactions with the Kafka
> cluster.//kafka-clients versions 0.10.1 and newer
> are supported at runtime. The older versions
> 0.9.x //- 0.10.0.0 are also supported, but are
> deprecated and likely be removed in near future.”/
> /
> /
> Despite the fact that, personally, I’d prefer to
> have only one unified client interface but, since
> people still use Beam with old Kafka instances, we,
> likely, should stick with it till Beam 3.0.
> 
> WDYT?
> 
>> On 2 Apr 2019, at 02:27, Austin Bennett
>> > <mailto:whatwouldausti...@gmail.com>> wrote:
>>
>> FWIW -- 
>>
>> On my (desired, not explicitly job-function)
>> roadmap is to tap into a bunch of our corporate
>> Kafka queues to ingest that data to places I can
>> use.  Those are 'stuck' 0.9, with no upgrade in
>> sight (am told the upgrade path isn't trivial, is
>> very critical flows, and they are scared for it to
>> break, so it just sits behind firewalls, etc). 
>> But, I wouldn't begin that for probably at least
>> another quarter.  
>>
>> I don't contribute to nor understand the burden of
>> maintaining the support for the older version, so
>> can't reasonably lobby for that continued pain.  
>>
>> Anecdotally, this could be a place many
>> enterprises are at (though I also wonder whether
>> many of the people that would be 'stuck' on such
>> versions would also have Beam on their current
>> radar).  
>>
>>
>> On Mon, Apr 1, 2019 at 2:29 PM Kenneth Knowles
>> mailto:k...@apache.org>> wrote:
>>
>> This could be a backward-incompatible change,
>>     though that notion has many interpretations.
>> What matters is user pain. Technically if we
>> don't break the core SDK, users should be able
>> to use Java SDK >=2.11.0 with KafkaIO 2.11.0
>> forever.
>>
>> How are multiple versions of Kafka supported?
>> Are they all in one client, or is there a case
>> for forks like ElasticSearchIO?
>>
>> Kenn
>>
>> On Mon, Apr 1, 2019 at 10:37 AM Jean-Baptiste
>> Onofré > <mailto:j...@nanthrax.net>> wrote:
>>
>> +1 to remove 0.9 support.
>>
>> I think it's more interesting to test and
>> verify Kafka 2.2.0 than 0.9 ;)
>>
>> Regards
>> JB
>>
>> On 01/04/2019 19:36, David Morávek wrote:
>> > Hello,
>> >
>> > is there still a reason to keep Kafka
>> 0.9

Re: JDBCIO Connection Pooling

2019-04-01 Thread Jean-Baptiste Onofré

Hi,

You can directly provide the datasource in the datasource configuration,
the IO should use directly the datasource (not rewrap it). It's the way
I wrote the IO initially.

IMHO, the buildDatasource() method should not systematically wrap as a
poolable datasource. We should add a configuration in datasource
configuration to let the user control if he wants to wrap as poolable
datasource or not.

If you don't mind I will create a Jira about that and work on it.

Thoughts ?

Regards
JB

On 29/03/2019 02:53, hgu2hw+2g0aed6fdo...@guerrillamail.com wrote:
> Hello, I have recently created a streaming google dataflow program with 
> apache beam using the java SDK. When files land in cloud-storage they fire 
> off pubsub messages with the filename, which I consume and then write to a 
> cloud sql database. Everything works great for the most part. However I've 
> been testing it more thoroughly recently and noticed that if I start reading 
> in multiple files that database connections slowly grow  and grow until they 
> hit the default limit of 100 connections. Strangely the idle connections 
> never seem to disappear and the program might run for hours watching for 
> pubsub messages so this creates a problem. 
> 
> My initial idea was to create a c3p0 connection pool and pass that in as the 
> datasource through the JdbcIO.DataSourceConfiguration.create method. I 
> noticed this didn't seem to make a difference which perplexed me even with my 
> aggressive pool connections. After some debugging I noticed that the 
> datasource was still being wrapped in a pooling datasource..even through it 
> already is a pooled datasource. I was wondering what strangeness this caused, 
> so locally I hacked JdbcIO to just return my c3p0 datasource and do nothing 
> else in the buildDatasource method ( 
> https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java
>  - line 331). It seemed to alleviate the connection problems and now I see 
> the idle connections slowly start disappearing in cloud sql. Everything 
> appears to be working smoothly. Obviously this isn't the solution I want 
> moving forward. Is there some other way to achieve this? What grave mistakes 
> have I done by bypassing the standard way of doing it?
> 
> 
> 
> 
> 
> 
> Sent using Guerrillamail.com
> Block or report abuse: 
> https://www.guerrillamail.com//abuse/?a=VFJxFy0CRrUYxg%2Bk8X0Xd1rIX80%3D
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Moving to spark 2.4

2018-12-06 Thread Jean-Baptiste Onofré

Hi Vishwas

Yes, I already started the update.

Regards
JB

On 06/12/2018 07:39, Vishwas Bm wrote:
> Hi,
> 
> Currently I see that the spark version dependency used in Beam is
> //"2.3.2".
> As spark 2.4 is released now, is there a plan to upgrade Beam spark
> dependency ?
> 
> 
> *Thanks & Regards,*
> *Vishwas
> *
> *Mob : 9164886653*

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: ElasticIO retry configuration exception

2018-10-10 Thread Jean-Baptiste Onofré

Hi Wout,

what's the elasticsearch version ? (just to try to reproduce)

Thanks,
Regards
JB

On 10/10/2018 15:31, Wout Scheepers wrote:
> Hey all,
> 
>  
> 
> When using .withRetryConfiguration()for ElasticsearchIO, I get the
> following stacktrace:
> 
>  
> 
> Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException:
> No content to map due to end-of-input
> 
> at [Source: (org.apache.http.nio.entity.ContentInputStream); line: 1,
> column: 0]
> 
>    at
> com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:59)
> 
>    at
> com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:4133)
> 
>    at
> com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3988)
> 
>    at
> com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3058)
> 
>    at
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO.parseResponse(ElasticsearchIO.java:167)
> 
>    at
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO.checkForErrors(ElasticsearchIO.java:171)
> 
>    at
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO$Write$WriteFn.flushBatch(ElasticsearchIO.java:1213)
> 
>    at
> org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO$Write$WriteFn.finishBundle(ElasticsearchIO.java:1183)
> 
>  
> 
> I’ve been breaking my head on this one.
> 
> Apparently the elastic Response object can’t be parsed anymore in the
> checkForErrors() method.
> 
> However, it is parsed successfully in the default RetryPredicate’s test
> method, which is called in flushBatch() in the if clause related to the
> retryConfig (ElasticsearchIO:1201).
> 
> As far as I know, the Response object is not altered.
> 
>  
> 
> Any clues why this doesn’t work for me?
> 
> I really need this feature, as inserting 40M documents into elastic
> results in too many retry timeouts ☺.
> 
>  
> 
> Thanks!
> Wout
> 
>  
> 
>  
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Issue with GroupByKey in BeamSql using SparkRunner

2018-10-10 Thread Jean-Baptiste Onofré

It's maybe related: I have a pipeline (streaming with sliding windows)
that works fine with Direct and Flink runners, but I don't have any
result when using the Spark runner.

I gonna investigate this using my beam-samples.

Regards
JB

On 10/10/2018 11:16, Ismaël Mejía wrote:
> Are you trying this in a particular spark distribution or just locally ?
> I ask this because there was a data corruption issue with Spark 2.3.1
> (previous version used by Beam)
> https://issues.apache.org/jira/browse/SPARK-23243
> 
> Current Beam master (and next release) moves Spark to version 2.3.2
> and that should fix some of the data correctness issues (maybe yours
> too).
> Can you give it a try and report back if this fixes your issue.
> 
> 
> On Tue, Oct 9, 2018 at 6:45 PM Vishwas Bm  wrote:
>>
>> Hi Kenn,
>>
>> We are using Beam 2.6 and using Spark_submit to submit jobs to Spark 2.2 
>> cluster on Kubernetes.
>>
>>
>> On Tue, Oct 9, 2018, 9:29 PM Kenneth Knowles  wrote:
>>>
>>> Thanks for the report! I filed 
>>> https://issues.apache.org/jira/browse/BEAM-5690 to track the issue.
>>>
>>> Can you share what version of Beam you are using?
>>>
>>> Kenn
>>>
>>> On Tue, Oct 9, 2018 at 3:18 AM Vishwas Bm  wrote:
>>>>
>>>> We are trying to setup a pipeline with using BeamSql and the trigger used 
>>>> is default (AfterWatermark crosses the window).
>>>> Below is the pipeline:
>>>>
>>>>KafkaSource (KafkaIO) ---> Windowing (FixedWindow 1min) ---> BeamSql 
>>>> ---> KafkaSink (KafkaIO)
>>>>
>>>> We are using Spark Runner for this.
>>>> The BeamSql query is:
>>>>  select Col3, count(*) as count_col1 from PCOLLECTION GROUP BY 
>>>> Col3
>>>>
>>>> We are grouping by Col3 which is a string. It can hold values string[0-9].
>>>>
>>>> The records are getting emitted out at 1 min to kafka sink, but the output 
>>>> record in kafka is not as expected.
>>>> Below is the output observed: (WST and WET are indicators for window start 
>>>> time and window end time)
>>>>
>>>> {"count_col1":1,"Col3":"string5","WST":"2018-10-09  09-55-00   
>>>> +","WET":"2018-10-09  09-56-00   +"}
>>>> {"count_col1":3,"Col3":"string7","WST":"2018-10-09  09-55-00   
>>>> +","WET":"2018-10-09  09-56-00   +"}
>>>> {"count_col1":2,"Col3":"string8","WST":"2018-10-09  09-55-00   
>>>> +","WET":"2018-10-09  09-56-00   +"}
>>>> {"count_col1":1,"Col3":"string2","WST":"2018-10-09  09-55-00   
>>>> +","WET":"2018-10-09  09-56-00   +"}
>>>> {"count_col1":1,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>>> +","WET":"2018-10-09  09-56-00   +"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>>> +","WET":"2018-10-09  09-56-00   +"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>>> +","WET":"2018-10-09  09-56-00   +"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>>> +","WET":"2018-10-09  09-56-00   +"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>>> +","WET":"2018-10-09  09-56-00   +"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>>> +","WET":"2018-10-09  09-56-00   +"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>>> +","WET":"2018-10-09  09-56-00   +"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>>> +","WET":"2018-10-09  09-56-00   +"}
>>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>>> +","WET":"2018-10-09  09-56-00   +"}
>>>>
>>>> We ran the same pipeline using direct and flink runner and we dont see 0 
>>>> entries for count_col1.
>>>>
>>>> As per beam matrix page 
>>>> (https://beam.apache.org/documentation/runners/capability-matrix/#cap-summary-what),
>>>>  GroupBy is not fully supported,is this one of those cases ?
>>>> Thanks & Regards,
>>>> Vishwas
>>>>

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Apache Beam UI job creator

2018-10-09 Thread Jean-Baptiste Onofré

Hi,

I  don't know any open source UI right now.

Regards
JB

On 09/10/2018 04:09, Karan Kumar wrote:
> HI Juan and Jean
> Thanks for the reply. We were looking to adopt an open source codebase.
> Any pointers in that direction?
> 
> 
> On Mon, Oct 8, 2018 at 9:05 PM Jean-Baptiste Onofré  <mailto:j...@nanthrax.net>> wrote:
> 
> Hi
> 
> We have such tool at Talend (named datastreams), already available
> (beta) as Amazon ami.
> 
> Regards
> JB
> Le 8 oct. 2018, à 12:24, Karan Kumar  <mailto:karankumar1...@gmail.com>> a écrit:
> 
> Hello
> 
> We want to expose a GUI for our engineers/business analysts to
> create real time pipelines using drag and drop constructs.
> Projects such as https://github.com/TouK/nussknacker for flink
> and https://github.com/hortonworks/streamline for storm match
> our requirements.
> 
> We wanted to understand if a UI job creator is on the road map
> for the beam community or 
> if there are any projects which have taken a stab at solving
> this problem.
> 
> -- 
> Thanks
> Karan 
> 
> 
> 
> -- 
> Thanks
> Karan

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Apache Beam UI job creator

2018-10-08 Thread Jean-Baptiste Onofré

Hi

We have such tool at Talend (named datastreams), already available (beta) as 
Amazon ami.

Regards
JB

Le 8 oct. 2018 à 12:24, à 12:24, Karan Kumar  a écrit:
>Hello
>
>We want to expose a GUI for our engineers/business analysts to create
>real
>time pipelines using drag and drop constructs. Projects such as
>https://github.com/TouK/nussknacker for flink and
>https://github.com/hortonworks/streamline for storm match our
>requirements.
>
>We wanted to understand if a UI job creator is on the road map for the
>beam
>community or
>if there are any projects which have taken a stab at solving this
>problem.
>
>-- 
>Thanks
>Karan

Re: FYI on Slack Channels

2018-10-08 Thread Jean-Baptiste Onofré

Hi

If you don't have a apache.org e-mail address, you have to be invited.

What's your email (to use for slack) ?

Regards
JB

Le 8 oct. 2018 à 17:42, à 17:42, "Filip Popić"  a écrit:
>How can we join the slack workspace?
>
>https://the-asf.slack.com/signup  is accepting only apache emails and
>the
>join link on
>
>https://beam.apache.org/community/contact-us/
>
>is not working anymore.
>
>On Wed, 27 Jun 2018 at 01:01, Jean-Baptiste Onofré 
>wrote:
>
>> Great idea. Thanks !
>>
>> Regards
>> JB
>> Le 27 juin 2018, à 06:51, Griselda Cuevas  a écrit:
>>>
>>> This is awesome, thanks Rafael!
>>>
>>>
>>>
>>>
>>> On Tue, 26 Jun 2018 at 13:46, Scott Wegner 
>wrote:
>>>
>>>> This is great, thanks Rafael!
>>>>
>>>> On Tue, Jun 26, 2018 at 6:45 AM Rafael Fernandez
>
>>>> wrote:
>>>>
>>>>> Ah! Didn't know -- thanks Romain!
>>>>>
>>>>> Done for all channels I could find. Also, here is a list of
>channels:
>>>>>
>>>>> #beam
>>>>> #beam-events-meetups
>>>>> #beam-go
>>>>> #beam-java
>>>>> #beam-portability
>>>>> #beam-python
>>>>> #beam-sql
>>>>> #beam-testing
>>>>>
>>>>>
>>>>> On Tue, Jun 26, 2018 at 1:18 AM Romain Manni-Bucau <
>>>>> rmannibu...@gmail.com> wrote:
>>>>>
>>>>>> +1 sounds very good
>>>>>>
>>>>>> side note: any channel must invite @asfarchivebot, I did it for
>the
>>>>>> ones before "etc" but if you add others please ensure it is done
>>>>>>
>>>>>> Romain Manni-Bucau
>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>>
><https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>
>>>>>>
>>>>>> Le mar. 26 juin 2018 à 01:05, Lukasz Cwik  a
>écrit :
>>>>>>
>>>>>>> +user@beam.apache.org 
>>>>>>>
>>>>>>> On Mon, Jun 25, 2018 at 4:04 PM Rafael Fernandez
>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello!
>>>>>>>>
>>>>>>>> I took the liberty to create area-specific channels (such as
>>>>>>>> #beam-java, #beam-python, #beam-go, etc.) As our project and
>community
>>>>>>>> grows, I am seeing more and more "organic" interest groups
>forming -- this
>>>>>>>> may help us chat more online. If they don't, we can delete
>later.
>>>>>>>>
>>>>>>>> Any thoughts? (I am having second thoughts... #beam-go should
>>>>>>>> probably be #beam-burrow ;p )
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> r
>>>>>>>>
>>>>>>>

Re: Beam SparkRunner and Spark KryoSerializer problem

2018-10-03 Thread Jean-Baptiste Onofré

Hi Juan

I'm on it.

Regards
JB

Le 4 oct. 2018 à 07:19, à 07:19, Juan Carlos Garcia  a 
écrit:
>Bump,
>
>can someone from the core-dev provide a feedback about:
>   https://issues.apache.org/jira/browse/BEAM-4597
>
>Thanks
>
>On Mon, Jul 30, 2018 at 3:15 PM Juan Carlos Garcia
>
>wrote:
>
>> Hi Jean,
>>
>> Thanks for taking a look.
>>
>>
>> On Mon, Jul 30, 2018 at 2:49 PM, Jean-Baptiste Onofré
>
>> wrote:
>>
>>> Hi Juan,
>>>
>>> it seems that has been introduce by the metrics layer in the core
>runner
>>> API.
>>>
>>> Let me check.
>>>
>>> Regards
>>> JB
>>>
>>> On 30/07/2018 14:47, Juan Carlos Garcia wrote:
>>> > Bump!
>>> >
>>> > Does any of the core-dev roam around here?
>>> >
>>> > Can someone provide a feedback about BEAM-4597
>>> > <https://issues.apache.org/jira/browse/BEAM-4597>
>>> >
>>> > Thanks and regards,
>>> >
>>> > On Thu, Jul 19, 2018 at 3:41 PM, Juan Carlos Garcia <
>>> jcgarc...@gmail.com
>>> > <mailto:jcgarc...@gmail.com>> wrote:
>>> >
>>> > Folks,
>>> >
>>> > Its someone using the SparkRunner out there with the Spark
>>> > KryoSerializer ?
>>> >
>>> > We are being force to use the not so efficient
>'JavaSerializer' with
>>> > Spark because we face the following exception:
>>> >
>>> > 
>>> > Exception in thread "main" java.lang.RuntimeException:
>>> > org.apache.spark.SparkException: Job aborted due to stage
>failure:
>>> > Exception while getting task result:
>>> > com.esotericsoftware.kryo.KryoException: Unable to find class:
>>> >
>>>
>org.apache.beam.runners.core.metrics.MetricsContainerImpl$$Lambda$31/1875283985
>>> > Serialization trace:
>>> > factory (org.apache.beam.runners.core.metrics.MetricsMap)
>>> > counters
>(org.apache.beam.runners.core.metrics.MetricsContainerImpl)
>>> > metricsContainers
>>> > (org.apache.beam.runners.core.metrics.MetricsContainerStepMap)
>>> > metricsContainers
>>> >   
>(org.apache.beam.runners.spark.io.SparkUnboundedSource$Metadata)
>>> > at
>>> >
>>>
>org.apache.beam.runners.spark.SparkPipelineResult.runtimeExceptionFrom(SparkPipelineResult.java:55)
>>> > at
>>> >
>>>
>org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:72)
>>> > at
>>> >
>>>
>org.apache.beam.runners.spark.SparkPipelineResult.access$000(SparkPipelineResult.java:41)
>>> > at
>>> >
>>>
>org.apache.beam.runners.spark.SparkPipelineResult$StreamingMode.stop(SparkPipelineResult.java:163)
>>> > at
>>> >
>>>
>org.apache.beam.runners.spark.SparkPipelineResult.offerNewState(SparkPipelineResult.java:198)
>>> > at
>>> >
>>>
>org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:101)
>>> > at
>>> >
>>>
>org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:87)
>>> > at
>>> >
>>>
>org.apache.beam.examples.BugWithKryoOnSpark.main(BugWithKryoOnSpark.java:75)
>>> > 
>>> >
>>> > I created a jira ticket and attached a project example on it,
>>> > https://issues.apache.org/jira/browse/BEAM-4597
>>> > <https://issues.apache.org/jira/browse/BEAM-4597>
>>> >
>>> > Any feedback is appreciated.
>>> >
>>> > --
>>> >
>>> > JC
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> >
>>> > JC
>>> >
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>
>>
>> --
>>
>> JC
>>
>>
>
>--
>
>JC

Re: Agenda for the Beam Summit London 2018

2018-09-27 Thread Jean-Baptiste Onofré

Great !! Thanks Gris.

Looking forward to see you all next Monday in London.

Regards
JB

Le 27 sept. 2018 à 18:03, à 18:03, Griselda Cuevas  a écrit:
>Hi Beam Community,
>
>We have finalized the agenda for the Beam Summit London 2018, it's
>here:
>https://www.linkedin.com/feed/update/urn:li:activity:6450125487321735168/
>
>
>We had a great amount of talk proposals, thank you so much to everyone
>who
>submitted one! We also sold out the event, so we're very excited to see
>the
>community growing.
>
>
>See you around,
>
>Gris on behalf of the Organizing Committee

Re: Filtering data using external source.

2018-07-30 Thread Jean-Baptiste Onofré

Hi Jose,

so basically, you create two PCollections with the same keys and then
you join/filter/flatten ?

Regards
JB

On 30/07/2018 15:09, Jose Bermeo wrote:
> Hi, question guys.
> 
> I have to filter an unbounded collection based on data from a redshift
> DB. I cannot use a side input as redshift data could change. One way to
> do it would be to group common elements, make a query to filter each
> group, finally flatten the pipe again.Do you know if this is the best
> way to do it? and what would be the way to run the query agains redshift?.
> 
> Thaks.

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Beam SparkRunner and Spark KryoSerializer problem

2018-07-30 Thread Jean-Baptiste Onofré

Hi Juan,

it seems that has been introduce by the metrics layer in the core runner
API.

Let me check.

Regards
JB

On 30/07/2018 14:47, Juan Carlos Garcia wrote:
> Bump!
> 
> Does any of the core-dev roam around here?
> 
> Can someone provide a feedback about BEAM-4597
> <https://issues.apache.org/jira/browse/BEAM-4597>
> 
> Thanks and regards,
> 
> On Thu, Jul 19, 2018 at 3:41 PM, Juan Carlos Garcia  <mailto:jcgarc...@gmail.com>> wrote:
> 
> Folks,
> 
> Its someone using the SparkRunner out there with the Spark
> KryoSerializer ?
> 
> We are being force to use the not so efficient 'JavaSerializer' with
> Spark because we face the following exception:
> 
> 
> Exception in thread "main" java.lang.RuntimeException:
> org.apache.spark.SparkException: Job aborted due to stage failure:
> Exception while getting task result:
> com.esotericsoftware.kryo.KryoException: Unable to find class:
> 
> org.apache.beam.runners.core.metrics.MetricsContainerImpl$$Lambda$31/1875283985
> Serialization trace:
> factory (org.apache.beam.runners.core.metrics.MetricsMap)
> counters (org.apache.beam.runners.core.metrics.MetricsContainerImpl)
> metricsContainers
> (org.apache.beam.runners.core.metrics.MetricsContainerStepMap)
> metricsContainers
> (org.apache.beam.runners.spark.io.SparkUnboundedSource$Metadata)
> at
> 
> org.apache.beam.runners.spark.SparkPipelineResult.runtimeExceptionFrom(SparkPipelineResult.java:55)
> at
> 
> org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:72)
> at
> 
> org.apache.beam.runners.spark.SparkPipelineResult.access$000(SparkPipelineResult.java:41)
> at
> 
> org.apache.beam.runners.spark.SparkPipelineResult$StreamingMode.stop(SparkPipelineResult.java:163)
> at
> 
> org.apache.beam.runners.spark.SparkPipelineResult.offerNewState(SparkPipelineResult.java:198)
> at
> 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:101)
> at
> 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:87)
> at
> 
> org.apache.beam.examples.BugWithKryoOnSpark.main(BugWithKryoOnSpark.java:75)
> 
> 
> I created a jira ticket and attached a project example on it,
> https://issues.apache.org/jira/browse/BEAM-4597
> <https://issues.apache.org/jira/browse/BEAM-4597>
> 
> Any feedback is appreciated.
> 
> -- 
> 
> JC 
> 
> 
> 
> 
> -- 
> 
> JC 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Routing events by key

2018-07-07 Thread Jean-Baptiste Onofré

Hi Raghu,

AFAIR, Reshuffle is considered as deprecated. Maybe it would be better
to avoid to use it no ?

Regards
JB

On 06/07/2018 18:28, Raghu Angadi wrote:
> I would use Reshuffle()[1] with entity id as the key. It internally does
> a GroupByKey and sets up windowing such that it does not buffer anything.
> 
> [1]
> : 
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L51
>  
> 
> On Fri, Jul 6, 2018 at 8:01 AM Niels Basjes  <mailto:ni...@basjes.nl>> wrote:
> 
> Hi,
> 
> I have an unbounded stream of change events each of which has the id
> of the entity that is changed.
> To avoid the need for locking in the persistence layer that is
> needed in part of my processing I want to route all events based on
> this entity id.
> That way I know for sure that all events around a single entity go
> through the same instance of my processing sequentially, hence no
> need for locking or other synchronization regarding this persistence.
> 
> At this point my best guess is that I need to use the GroupByKey but
> that seems to need a Window. 
> I think I don't want a window because the stream is unbounded and I
> want the lowest possible latency (i.e. a Window of 1 second would be
> ok for this usecase).
> Also I want to be 100% sure that all events for a specific id go to
> only a single instance because I do not want any race conditions.
> 
> My simple question is: What does that look like in the Beam Java API?
> 
> -- 
> Best regards / Met vriendelijke groeten,
> 
> Niels Basjes
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Routing events by key

2018-07-06 Thread Jean-Baptiste Onofré

Hi Niels,

as you have an Unbounded PCollection, you need a Window to GroupByKey
and then "forward" the data.

Another option would be to use a DoFn working element per element and
eventually batching then. It's what most of the IOs are doing for the
Write part.

Regards
JB

On 06/07/2018 17:01, Niels Basjes wrote:
> Hi,
> 
> I have an unbounded stream of change events each of which has the id of
> the entity that is changed.
> To avoid the need for locking in the persistence layer that is needed in
> part of my processing I want to route all events based on this entity id.
> That way I know for sure that all events around a single entity go
> through the same instance of my processing sequentially, hence no need
> for locking or other synchronization regarding this persistence.
> 
> At this point my best guess is that I need to use the GroupByKey but
> that seems to need a Window. 
> I think I don't want a window because the stream is unbounded and I want
> the lowest possible latency (i.e. a Window of 1 second would be ok for
> this usecase).
> Also I want to be 100% sure that all events for a specific id go to only
> a single instance because I do not want any race conditions.
> 
> My simple question is: What does that look like in the Beam Java API?
> 
> -- 
> Best regards / Met vriendelijke groeten,
> 
> Niels Basjes

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

[ANN] Apache Beam 2.5.0 has been released!

2018-06-30 Thread Jean-Baptiste Onofré

The Apache Beam team is pleased to announce the release of 2.5.0 version!

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes the following major new features & improvements:
- Go SDK support
- new ParquetIO
- Build migrated to Gradle (including for the release)
- Improvements on Nexmark as Kafka support
- Improvements on Beam SQL DSL
- Improvements on Portability
- New metrics support pushing generic to all runners

You can take a look on the following blog post and Release Notes for
details:

https://beam.apache.org/blog/2018/06/26/beam-2.5.0.html

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12342847

Enjoy !!

--
JB on behalf of The Apache Beam team

Re: FYI on Slack Channels

2018-06-26 Thread Jean-Baptiste Onofré

Great idea. Thanks !

Regards
JB

Le 27 juin 2018 à 06:51, à 06:51, Griselda Cuevas  a écrit:
>This is awesome, thanks Rafael!
>
>
>
>
>On Tue, 26 Jun 2018 at 13:46, Scott Wegner  wrote:
>
>> This is great, thanks Rafael!
>>
>> On Tue, Jun 26, 2018 at 6:45 AM Rafael Fernandez
>
>> wrote:
>>
>>> Ah! Didn't know -- thanks Romain!
>>>
>>> Done for all channels I could find. Also, here is a list of
>channels:
>>>
>>> #beam
>>> #beam-events-meetups
>>> #beam-go
>>> #beam-java
>>> #beam-portability
>>> #beam-python
>>> #beam-sql
>>> #beam-testing
>>>
>>>
>>> On Tue, Jun 26, 2018 at 1:18 AM Romain Manni-Bucau
>
>>> wrote:
>>>
 +1 sounds very good

 side note: any channel must invite @asfarchivebot, I did it for the
>ones
 before "etc" but if you add others please ensure it is done

 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github
  | LinkedIn
  | Book

>


 Le mar. 26 juin 2018 à 01:05, Lukasz Cwik  a
>écrit :

> +user@beam.apache.org 
>
> On Mon, Jun 25, 2018 at 4:04 PM Rafael Fernandez
>
> wrote:
>
>> Hello!
>>
>> I took the liberty to create area-specific channels (such as
>> #beam-java, #beam-python, #beam-go, etc.) As our project and
>community
>> grows, I am seeing more and more "organic" interest groups
>forming -- this
>> may help us chat more online. If they don't, we can delete later.
>>
>> Any thoughts? (I am having second thoughts... #beam-go should
>probably
>> be #beam-burrow ;p )
>>
>> Cheers,
>> r
>>
>

Re: Dose beam-2.3.0 support spark 1.6.x?

2018-06-26 Thread Jean-Baptiste Onofré

Hi

Since Beam 2.3.0, only Spark 2.x is supported.

If you want to use Spark 1.x, you have to use Beam 2.2.0.

Regards
JB

Le 26 juin 2018 à 13:06, à 13:06, Reminia Scarlet  a 
écrit:
>In the release note  of 2.3.0, it seems that spark 2. x is supported,
>
>How about the compatibility with spark 1.6.x?

Re: Where is the mapreduce runner?

2018-06-20 Thread Jean-Baptiste Onofré

Hi,

as said, the M/R runner should be updated to leverage the portability
layer (especially Job API).

I will create the Jira and I will take look to evaluate the effort.

Regards
JB

On 20/06/2018 12:37, Reminia Scarlet wrote:
> I'm very glad that if there's anything I can do to push the release of
> MR runner. 
> 
> 
> 
> 
> On Wed, Jun 20, 2018 at 4:53 PM, Jean-Baptiste Onofré  <mailto:j...@nanthrax.net>> wrote:
> 
> Hi,
> 
> Pei and I worked on the MapReduce runner. There's some work to be done
> (like leveraging the new Job API, etc).
> 
> No short-term plan to include on master for now (so no release soon).
> I'm working on this runner only on spare time.
> 
> If you are interested to contribute, please let me know.
> 
> Thanks,
> Regards
> JB
> 
> On 20/06/2018 08:23, Reminia Scarlet wrote:
> > @Kenneth  When will the runner be released?  Does the community have a
> > plan on that ?
> > 
> > I've read the second link. The MR runner artifact cant be found in maven
> > central. Maybe it 
> > 
> > has not beed deployed to maven.
> > 
> > On Wed, Jun 20, 2018 at 11:49 AM, Kenneth Knowles  <mailto:k...@google.com>
> > <mailto:k...@google.com <mailto:k...@google.com>>> wrote:
> > 
> >     The MapReduce runner is still on a feature branch.
> >     See https://beam.apache.org/contribute/#mapreduce-runner
> <https://beam.apache.org/contribute/#mapreduce-runner>
> >     <https://beam.apache.org/contribute/#mapreduce-runner
> <https://beam.apache.org/contribute/#mapreduce-runner>>
> > 
> >     We should remove unreleased runners from the capability matrix.
> > 
> >     May I ask how you arrived at
> >     https://beam.apache.org/documentation/runners/mapreduce/
> <https://beam.apache.org/documentation/runners/mapreduce/>
> >     <https://beam.apache.org/documentation/runners/mapreduce/
> <https://beam.apache.org/documentation/runners/mapreduce/>> ?
> > 
> >     Kenn
> > 
> >     On Tue, Jun 19, 2018 at 8:36 PM Reminia Scarlet
> >     mailto:reminia.scar...@gmail.com>
> <mailto:reminia.scar...@gmail.com
> <mailto:reminia.scar...@gmail.com>>> wrote:
> > 
> >         I've found that mapreduce runner is supported  in beam from
> >         below links:
> > 
> >         https://beam.apache.org/documentation/runners/mapreduce/
> <https://beam.apache.org/documentation/runners/mapreduce/>
> >         <https://beam.apache.org/documentation/runners/mapreduce/
> <https://beam.apache.org/documentation/runners/mapreduce/>>
> >
> >       
>  https://beam.apache.org/documentation/runners/capability-matrix/
> <https://beam.apache.org/documentation/runners/capability-matrix/>
> >       
>      <https://beam.apache.org/documentation/runners/capability-matrix/
> <https://beam.apache.org/documentation/runners/capability-matrix/>>
> >
> >         But I cant find the runner on maven.
> >
> >
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org <mailto:jbono...@apache.org>
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Open Source Challenge, Guadalajara on August 2nd.

2018-06-20 Thread Jean-Baptiste Onofré

Hi,

You can count on my to help new contributors !

I'm pretty sure that others will jump on too.

Thanks !

I'm looking forward new contributions !

Regards
JB

On 20/06/2018 15:10, Arianne Lisset Navarro Lepe wrote:
> Hi Beam Community,
>  
> Here in Mexico we are starting a consortium to foster an
> open-source-first culture, we have named it OSOM ( Open Source Mexico).
>  
> As par of our activities, we are hosting the Open Source Challenge
> <https://docs.google.com/document/d/1jvwcpzn1mp5xCqcDnqcxne-1_RS5zY-DH35-Wr5eTng/edit?usp=sharing>
> on August 2nd at the IBM Guadalajara offices, with the intention to
> bring on board new contributors to the open community.
>  
>  
> *We are looking for Beam coaches, that can provide guidance (remote) to
> the teams that choose Beam for the challenge.  *
>  
>  
> Take a look at the event description and we have some blank spaces in
> the page 4 for the coaches and speakers. Feel free to write your name =)
> then we can coordinate times.
>  
> https://docs.google.com/document/d/1jvwcpzn1mp5xCqcDnqcxne-1_RS5zY-DH35-Wr5eTng/edit?usp=sharing
> 
> Thanks
>  
>  
> - Arianne Navarro
>  
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: ParquetIO javadocs

2018-06-20 Thread Jean-Baptiste Onofré

Hi,

ParquetIO will be included in 2.5.0 release (currently in vote). The
website PR updating Javadoc has been created, but it's in hold waiting
the completion of 2.5.0 release.

Regards
JB

On 20/06/2018 16:38, Akanksha Sharma B wrote:
> Hi,
> 
> 
> From the built-in io-transforms list
> (https://beam.apache.org/documentation/io/built-in/), I can find Parquet
> being supported. However, I could not find its javadocs.
> 
> Built-in I/O Transforms - Apache Beam
> <https://beam.apache.org/documentation/io/built-in/>
> beam.apache.org
> Apache Beam is an open source, unified model and set of
> language-specific SDKs for defining and executing data processing
> workflows, and also data ingestion and integration flows, supporting
> Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs).
> 
> Can you please help and direct me to javadocs for Parquet IO.
> 
> 
> Regards,
> 
> Akanksha
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Where is the mapreduce runner?

2018-06-20 Thread Jean-Baptiste Onofré

Hi,

Pei and I worked on the MapReduce runner. There's some work to be done
(like leveraging the new Job API, etc).

No short-term plan to include on master for now (so no release soon).
I'm working on this runner only on spare time.

If you are interested to contribute, please let me know.

Thanks,
Regards
JB

On 20/06/2018 08:23, Reminia Scarlet wrote:
> @Kenneth  When will the runner be released?  Does the community have a
> plan on that ?
> 
> I've read the second link. The MR runner artifact cant be found in maven
> central. Maybe it 
> 
> has not beed deployed to maven.
> 
> On Wed, Jun 20, 2018 at 11:49 AM, Kenneth Knowles  <mailto:k...@google.com>> wrote:
> 
> The MapReduce runner is still on a feature branch.
> See https://beam.apache.org/contribute/#mapreduce-runner
> <https://beam.apache.org/contribute/#mapreduce-runner>
> 
> We should remove unreleased runners from the capability matrix.
> 
> May I ask how you arrived at
> https://beam.apache.org/documentation/runners/mapreduce/
> <https://beam.apache.org/documentation/runners/mapreduce/> ?
> 
> Kenn
> 
> On Tue, Jun 19, 2018 at 8:36 PM Reminia Scarlet
> mailto:reminia.scar...@gmail.com>> wrote:
> 
> I've found that mapreduce runner is supported  in beam from
> below links:
> 
> https://beam.apache.org/documentation/runners/mapreduce/
> <https://beam.apache.org/documentation/runners/mapreduce/>
> 
> https://beam.apache.org/documentation/runners/capability-matrix/
> <https://beam.apache.org/documentation/runners/capability-matrix/>
> 
> But I cant find the runner on maven.
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Custom Unbounded Sources for 2.4.0 SDK

2018-06-10 Thread Jean-Baptiste Onofré

Ok, so, you should not use native source as it would break the portability.

Instead, just use KafkaIO !

Regards
JB

On 09/06/2018 22:12, Abdul Qadeer wrote:
> Alternate way while using FlinkKafkaConsumer* (:
> Will the Beam Flink runner code need several changes for this?
> 
> On Sat, 9 Jun 2018 at 12:42, Abdul Qadeer  <mailto:quadeer@gmail.com>> wrote:
> 
> I want to use 'FlinkKafkaConsumer' with 'UnboundedFlinkSource' as
> present in 0.6.0 SDK and it's examples (
> 
> https://github.com/dataArtisans/flink-dataflow/blob/master/examples/src/main/java/com/dataartisans/flink/dataflow/examples/streaming/KafkaWindowedWordCountExample.java
> ). 
> 
> As I understand you are saying this support is not there now,
> correct? How else could I use a 'FlinkKafkaConsumer' at source
> level? Is there an alternate you would suggest if not possible?
> 
> On Sat, 9 Jun 2018 at 07:23, Jean-Baptiste Onofré  <mailto:j...@nanthrax.net>> wrote:
> 
> By the away, if you mean that your custom source is implemented for
> Flink, it's not supported. I meant the Beam source.
> 
> On 09/06/2018 09:31, Abdul Qadeer wrote:
> > Hi!
> >
> > I would like to know if there is any way I can use 2.4.0
> Beam's Source
> > API for Flink 1.4.0 runner? I have a custom unbounded source
> implemented
> > for Flink runner but I can not find the documentation to use
> it for 2.x
>     > Beam SDK. Looks like it was only supported in 0.x SDK? Any
> help appreciated.
> 
> -- 
> Jean-Baptiste Onofré
>     jbono...@apache.org <mailto:jbono...@apache.org>
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Custom Unbounded Sources for 2.4.0 SDK

2018-06-09 Thread Jean-Baptiste Onofré

By the away, if you mean that your custom source is implemented for
Flink, it's not supported. I meant the Beam source.

On 09/06/2018 09:31, Abdul Qadeer wrote:
> Hi!
> 
> I would like to know if there is any way I can use 2.4.0 Beam's Source
> API for Flink 1.4.0 runner? I have a custom unbounded source implemented
> for Flink runner but I can not find the documentation to use it for 2.x
> Beam SDK. Looks like it was only supported in 0.x SDK? Any help appreciated.

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Custom Unbounded Sources for 2.4.0 SDK

2018-06-09 Thread Jean-Baptiste Onofré

Hi,

Not sure I understand your question.

You can use any source with Flink runner, all in Beam 2.4.0
(IOs/SDK/runner).

You can see usage of different IOs (sources) with different runner in
the beam-samples:

https://github.com/jbonofre/beam-samples

Regards
JB

On 09/06/2018 09:31, Abdul Qadeer wrote:
> Hi!
> 
> I would like to know if there is any way I can use 2.4.0 Beam's Source
> API for Flink 1.4.0 runner? I have a custom unbounded source implemented
> for Flink runner but I can not find the documentation to use it for 2.x
> Beam SDK. Looks like it was only supported in 0.x SDK? Any help appreciated.

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: How to process data

2018-05-28 Thread Jean-Baptiste Onofré

Hi,

I guess you want to process based on the watermark or the timestamp
(event time/processing time).

You can take a look on beam-samples to create window after watermark.

What kind of use case you want to implement ?

Regards
JB

On 28/05/2018 17:29, maxime.dej...@orange.com wrote:
> Hello, I want to process data, which comes from Kafka, with a process
> “Event time” but I haven’t found the code for doing this in the
> documentation.
> 
> Could you give me some information.
> 
>  
> 
> Thank you in advance,
> 
> Maxime Dejeux
> 
> _
> 
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou 
> falsifie. Merci.
> 
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been 
> modified, changed or falsified.
> Thank you.
> 

-- 
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Generate Beam Pipeline from JSON

2018-05-25 Thread Jean-Baptiste Onofré

Hi,

That's on the Declarative DSLs scope: XML and JSON.

It's not yet ready but it's a work in progress.

You can follow:

https://issues.apache.org/jira/browse/BEAM-14

Regards
JB

On 25/05/2018 17:15, S. Sahayaraj wrote:
> Hello,
> 
>     I would like to create the Beam pipeline (in Java) from
> the definitions given in JSON file. Is there any Beam Compiler
> available? Is any specification that guides me on how to do it?  I don’t
> want to write the driver program for every pipeline. Please suggest.
> 
>  
> 
> Cheers,
> 
> S. Sahayaraj
> 

-- 
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Hiring Data Engineers - $100/hr - LA

2018-05-03 Thread Jean-Baptiste Onofré

At least, we (you and I) are "active" moderators ;) I didn't see the approval
message,  so you are right,  maybe Michael already sent a message. We can check
with ezml-help.

Anyway,  my comment is also for Michael: promoting/hiring message should be
approved before sent.

My €0.01 ;)

Regards
JB

On 05/04/2018 05:54 AM, Davor Bonaci wrote:
> @moderator, please be careful when you accept message.
> 
> 
> I don't think we have any active moderators. Also, moderation doesn't catches
> "content", just messages from non-subscribers. Michael is probably a 
> subscriber,
> and the message went directly, without any moderation.
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Hiring Data Engineers - $100/hr - LA

2018-05-03 Thread Jean-Baptiste Onofré

Hi,

this kind of message should not go on the mailing list. @moderator, please be
careful when you accept message. Thanks.

Regards
JB

On 05/04/2018 12:52 AM, Michael Johns wrote:
> I am looking for Data Engineers / Data Architects for a contract or contract 
> to
> hire opportunity _onsite_ in the Los Angeles area. The role requires strong
> Google Cloud Platform (Cloud DataFlow, BigQuery and BigTable - Airflow is a
> plus) and strong Java development skills. 
> 
>  The pay rate on this is actually open.  Please contact me if you are 
> interested
> or can recommend anyone.  I can also offer a referral fee!
> 
> Thanks,
> *
> *
> *Michael Johns
> *TechLink Resources, Inc
> 310-566-7153  general
> 310-717-0476  cell
> 310-566-7158  fax
> www.techlink.org
> <http://email.bullhorn.com/wf/click?upn=rzKdQ51uZboFy3cCqn2pFeTJH0DZ8Y8MQgl0Hdilyp0-3D_HDu-2BON2WuckNVJ2U1s3AlLuqUThZA8dBICv2wgZ-2Bj-2BzArJOTFZPrz-2Fx272EZnPaZIdkCGQ-2FFveinKPYkHgtUw2rgusPd0ZU1GGWMC-2FkzaS02ApsbNQGVjKr1ze-2F-2BOnlCG-2BeKjqpNSG87webZbVV6LivNE3sUvZ-2FyCJ4EB0U-2FUulmYuBaNr-2F-2BoiyQKSjUQI0i8yygsZirwSgc7uiV27to3GPGtXJ7KFovYyb-2BhH24QJc-3D>
>  
> 13101 West Washington Boulevard, Suite 246, Culver City, CA 90066

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Scio 0.5.2 released

2018-04-05 Thread Jean-Baptiste Onofré

Hi Neville,

thanks for the announcement !

Any progress about the donation of Scio in Beam ? Is it out of scope from your
side ?

Thanks
Regards
JB

On 04/05/2018 09:33 PM, Neville Li wrote:
> Hi all,
> 
> We just released Scio 0.5.2 with a few enhancements and bug fixes.
> 
> Cheers,
> Neville
> 
> https://github.com/spotify/scio/releases/tag/v0.5.2
> 
> /"Kobus kob"/
> 
> 
>   Features
> 
>   * Add Java Converters #1013
> <https://github.com/spotify/scio/issues/1013> #1076
> <https://github.com/spotify/scio/pull/1076>
>   * Add unionAll on ScioContext to support empty lists #1092
> <https://github.com/spotify/scio/issues/1092> #1095
> <https://github.com/spotify/scio/pull/1095>
>   * Support custom data validation in type-safe BigQuery #1075
> <https://github.com/spotify/scio/pull/1075>
> 
> 
>   Bug fixes
> 
>   * Fix scio-repl on Windows #1093
> <https://github.com/spotify/scio/issues/1093> #1093
> <https://github.com/spotify/scio/issues/1093>
>   * Fix missing schema on AvroIO TypedWrite #1088
> <https://github.com/spotify/scio/issues/1088> #1089
> <https://github.com/spotify/scio/pull/1089>
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Advice on parallelizing network calls in DoFn

2018-03-15 Thread Jean-Baptiste Onofré

By the way, you can take a look on JdbcIO which does a reshuffle transform to 
avoid the "fusion" issue.

Regards
JB

Le 15 mars 2018 à 10:44, à 10:44, Raghu Angadi  a écrit:
>In streaming, a simple way is to add a reshuffle to increase
>parallelism.
>When you are external-call bound, extra cost of reshuffle is
>negligible.
>e.g.
>https://stackoverflow.com/questions/46116443/dataflow-streaming-job-not-scaleing-past-1-worker
>
>Note that by default Dataflow workers use a couple of hundred threads
>as
>required. This can be increased with a pipeline option if you prefer. I
>am
>not sure of other runners.
>
>On Thu, Mar 15, 2018 at 8:25 AM Falcon Taylor-Carter <
>fal...@bounceexchange.com> wrote:
>
>> Hello Pablo,
>>
>> Thanks for checking up (I'm working with Josh on this problem). It
>seems
>> there isn't a built-in process for this kind of use case currently,
>and
>> that the best process right now is to handle our own bundling and
>threading
>> in the DoFn. If you had any other suggestions, or anything to keep in
>mind
>> in doing this, let us know!
>>
>> Falcon
>>
>> On Tue, Mar 13, 2018 at 4:52 PM, Pablo Estrada 
>wrote:
>>
>>> I'd just like to close the loop. Josh, did you get an
>answer/guidance on
>>> how to proceed with your pipeline?
>>> Or maybe we'll need a new thread to figure that out : )
>>> Best
>>> -P.
>>>
>>>
>>> On Fri, Mar 9, 2018 at 1:39 PM Josh Ferge
>
>>> wrote:
>>>
 Hello all:

 Our team has a pipeline that make external network calls. These
 pipelines are currently super slow, and the hypothesis is that they
>are
 slow because we are not threading for our network calls. The github
>issue
 below provides some discussion around this:

 https://github.com/apache/beam/pull/957

 In beam 1.0, there was IntraBundleParallelization, which helped
>with
 this. However, this was removed because it didn't comply with a few
>BEAM
 paradigms.

 Questions going forward:

 What is advised for jobs that make blocking network calls? It seems
 bundling the elements into groups of size X prior to passing to the
>DoFn,
 and managing the threading within the function might work.
>thoughts?
 Are these types of jobs even suitable for beam?
 Are there any plans to develop features that help with this?

 Thanks

>>> --
>>> Got feedback? go/pabloem-feedback
>>> 
>>>
>>
>>

Re: Reducing database connection with JdbcIO

2018-03-14 Thread Jean-Baptiste Onofré

Mar 14, 2018 at 12:11 PM Aleksandr
<aleksandr...@gmail.com
<mailto:aleksandr...@gmail.com>> wrote:

Hello, we had similar problem. Current jdbcio
will cause alot of connection errors.

Typically you have more than one prepared
statement. Jdbcio will create for each prepared
statement new connection(and close only in
teardown) So it is possible that connection will
get timeot or in case in case of auto scaling
you will get to many connections to sql.
Our solution was to create connection pool in
setup and get connection and return back to pool
in processElement.

Best Regards,
    Aleksandr Gortujev.

14. märts 2018 8:52 PM kirjutas kuupäeval
"Jean-Baptiste Onofré" <j...@nanthrax.net
<mailto:j...@nanthrax.net>>:

Agree especially using the current JdbcIO
impl that creates connection in the @Setup.
Or it means that @Teardown is never called ?

Regards
JB
Le 14 mars 2018, à 11:40, Eugene Kirpichov
<kirpic...@google.com
<mailto:kirpic...@google.com>> a écrit:

Hi Derek - could you explain where does
the "3000 connections" number come from,
i.e. how did you measure it? It's weird
that 5-6 workers would use 3000 connections.

On Wed, Mar 14, 2018 at 3:50 AM Derek
Chan <derek...@gmail.com
<mailto:derek...@gmail.com>> wrote:

Hi,

We are new to Beam and need some help.

We are working on a flow to ingest
events and writes the aggregated
counts to a database. The input rate
is rather low (~2000 message per
sec), but the processing is
relatively heavy, that we need to
scale out
to 5~6 nodes. The output (via JDBC)
is aggregated, so the volume is also
low. But because of the number of
workers, it keeps 3000 connections to
the database and it keeps hitting
the database connection limits.

Is there a way that we can reduce
the concurrency only at the output
stage? (In Spark we would have done
a repartition/coalesce).

And, if it matters, we are using
Apache Beam 2.2 via Scio, on Google
Dataflow.

Thank you in advance!

Re: Reducing database connection with JdbcIO

2018-03-14 Thread Jean-Baptiste Onofré

Agree especially using the current JdbcIO impl that creates connection in the 
@Setup. Or it means that @Teardown is never called ?

Regards
JB

Le 14 mars 2018 à 11:40, à 11:40, Eugene Kirpichov  a 
écrit:
>Hi Derek - could you explain where does the "3000 connections" number
>come
>from, i.e. how did you measure it? It's weird that 5-6 workers would
>use
>3000 connections.
>
>On Wed, Mar 14, 2018 at 3:50 AM Derek Chan  wrote:
>
>> Hi,
>>
>> We are new to Beam and need some help.
>>
>> We are working on a flow to ingest events and writes the aggregated
>> counts to a database. The input rate is rather low (~2000 message per
>> sec), but the processing is relatively heavy, that we need to scale
>out
>> to 5~6 nodes. The output (via JDBC) is aggregated, so the volume is
>also
>> low. But because of the number of workers, it keeps 3000 connections
>to
>> the database and it keeps hitting the database connection limits.
>>
>> Is there a way that we can reduce the concurrency only at the output
>> stage? (In Spark we would have done a repartition/coalesce).
>>
>> And, if it matters, we are using Apache Beam 2.2 via Scio, on Google
>> Dataflow.
>>
>> Thank you in advance!
>>
>>
>>
>>

Re: Reducing database connection with JdbcIO

2018-03-14 Thread Jean-Baptiste Onofré


Hi Derek,

I think you could be interested by:

https://github.com/apache/beam/pull/4461

related to BEAM-3500.

I introduced an internal poolable datasource.

I hope it could help.

Regards
JB

On 14/03/2018 11:49, Derek Chan wrote:

Hi,

We are new to Beam and need some help.

We are working on a flow to ingest events and writes the aggregated 
counts to a database. The input rate is rather low (~2000 message per 
sec), but the processing is relatively heavy, that we need to scale out 
to 5~6 nodes. The output (via JDBC) is aggregated, so the volume is also 
low. But because of the number of workers, it keeps 3000 connections to 
the database and it keeps hitting the database connection limits.


Is there a way that we can reduce the concurrency only at the output 
stage? (In Spark we would have done a repartition/coalesce).


And, if it matters, we are using Apache Beam 2.2 via Scio, on Google 
Dataflow.


Thank you in advance!

Re: Is Mapreduce runner available?

2018-03-08 Thread Jean-Baptiste Onofré

Hi

As Kenn said, I gonna update the branch and add a README.

I will do that today or tomorrow.

Regards
JB

On 03/09/2018 02:28 AM, Dummy Core wrote:
> Which version of Beam Mapreduce runner can run against? It looks like the
> development of Mapreduce runner is suspended. I'd like to check the current
> support of Mapreduce runner but the branch is broken. Does anyone have
> experience on building it?
> 
> Thank you!
> 
> On Mar 9, 2018 06:17, "Kenneth Knowles" <k...@google.com 
> <mailto:k...@google.com>>
> wrote:
> 
> I've filed https://issues.apache.org/jira/browse/BEAM-3814
> <https://issues.apache.org/jira/browse/BEAM-3814> to update both of those
> links to make it more clear that this is just a prototype checked into a 
> dev
> branch that is not available in any release.
> 
> Kenn
> 
> On Thu, Mar 8, 2018 at 8:15 AM Jean-Baptiste Onofré <j...@nanthrax.net
> <mailto:j...@nanthrax.net>> wrote:
> 
> Hi,
> 
> Yes it has been merged on a feature branch:
> 
> https://github.com/apache/beam/pull/3705
> <https://github.com/apache/beam/pull/3705>
> 
> https://github.com/apache/beam/tree/mr-runner
> <https://github.com/apache/beam/tree/mr-runner>
> 
> It's not yet merged on master (main upstream).
> 
> Regards
> JB
> 
> On 03/08/2018 05:10 PM, Dummy Core wrote:
> > Hi all,
> >
> > In https://beam.apache.org/documentation/runners/capability-matrix/
> <https://beam.apache.org/documentation/runners/capability-matrix/>
> > <https://beam.apache.org/documentation/runners/capability-matrix/
> <https://beam.apache.org/documentation/runners/capability-matrix/>>, 
> it
> looks
> > like Mapreduce runner is already supported for batch processing. But
> there's
> > little helpful information
> > in https://beam.apache.org/documentation/runners/mapreduce/
> <https://beam.apache.org/documentation/runners/mapreduce/>
> > <https://beam.apache.org/documentation/runners/mapreduce/
> <https://beam.apache.org/documentation/runners/mapreduce/>> to explain
> how to
> > deploy Mapreduce runner.
> >
> > Could anyone give me some help? Thank you!
> 
> --
> Jean-Baptiste Onofré
> jbono...@apache.org <mailto:jbono...@apache.org>
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Is Mapreduce runner available?

2018-03-08 Thread Jean-Baptiste Onofré

Hi,

Yes it has been merged on a feature branch:

https://github.com/apache/beam/pull/3705

https://github.com/apache/beam/tree/mr-runner

It's not yet merged on master (main upstream).

Regards
JB

On 03/08/2018 05:10 PM, Dummy Core wrote:
> Hi all,
> 
> In https://beam.apache.org/documentation/runners/capability-matrix/
> <https://beam.apache.org/documentation/runners/capability-matrix/>, it looks
> like Mapreduce runner is already supported for batch processing. But there's
> little helpful information
> in https://beam.apache.org/documentation/runners/mapreduce/
> <https://beam.apache.org/documentation/runners/mapreduce/> to explain how to
> deploy Mapreduce runner.
> 
> Could anyone give me some help? Thank you!

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: spark runner for Scala 2.10

2018-02-22 Thread Jean-Baptiste Onofré

Let me create a branch on my github and share it with you. Let's continue the
discussion with direct message (to avoid to flood the mailing list).

Regards
JB

On 02/22/2018 06:51 PM, Gary Dusbabek wrote:
> Yes, that would be very helpful. If possible, I'd like to understand how it is
> constructed so that I can maintain it. A link to a git repo would be great.
> 
> I've spent some time trying to understand how the Beam project is 
> built/managed.
> It looks like the poms are intended primarily for developers and packaging,
> while the gradle components are intended for CI, etc.
> 
> Kind Regards,
> 
> Gary.
> 
> On Thu, Feb 22, 2018 at 5:45 PM, Jean-Baptiste Onofré <j...@nanthrax.net
> <mailto:j...@nanthrax.net>> wrote:
> 
> OK, do you want me to provide a Scala 2.10 build for you ?
> 
> Regards
> JB
> 
> On 02/22/2018 06:44 PM, Gary Dusbabek wrote:
> > Jean-Baptiste,
> >
> > Thanks for responding. I agree--it would be better to use Scala 2.11. 
> I'm in the
> > process of creating a Beam POC with an existing platform and upgrading
> > everything in that platform to Scala 2.11 as a prerequisite is out of 
> scope.
> >
> > It would be helpful to know if Beam in it's current state is backward
> > incompatible with Scala 2.10 for reasons other than the dependencies.
> >
> > But if there is a way to make it work to enable a POC, I would 
> appreciate some
> > pointers, as it doesn't seem to be as simple as changing the "*_2.11" 
> references
> > in the poms.
> >
> > Cheers,
> >
> > Gary.
> >
> >
> >
> > On Thu, Feb 22, 2018 at 5:34 PM, Jean-Baptiste Onofré 
> <j...@nanthrax.net <mailto:j...@nanthrax.net>
> > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> wrote:
> >
> >     Hi Gary,
> >
> >     Beam 2.3.0 and the Spark runner use Scala 2.11.
> >
> >     I can help you to have a smooth transition by creating a local 
> branch
> using
> >     Scala 2.10. However,  I strongly advice to upgrade to 2.11 as some
> other part of
> >     Beam  (other runners and IOs) use 2.11 already.
> >
> >     Regards
> >     JB
> >
> >     On 02/22/2018 05:55 PM, Gary Dusbabek wrote:
> >     > Hi,
> >     >
> >     > My apologies if this belongs on the dev list. If it does, let me
> know and I'll
> >     > shoot things over that way...
> >     >
> >     > For the last day or so, I've been trying to create a Spark Runner 
> that
> >     will work
> >     > on older deployments using Scala 2.10. I've taken a few 
> approaches:
> >     >
> >     > 1. selectively changing a few dependencies in 
> beam-runners-spark.pom
> (and
> >     a few
> >     > other places in the parent)
> >     > 2. updating every dependency that references *_2.11 to be *_2.10
> >     >
> >     > In the former case the sticking point in both cases is that there 
> is
> a library
> >     > incompatibility with jackson-module-scala_2.xx. In the latter case
> there is a
> >     > problem with SourceRDD.SourcePartitioning not [correctly] 
> implementing
> >     > `equals(...)` from the parent trait.
> >     >
> >     > Posts on the mailing list made me think that the move to Scala 
> 2.11
> >     started only
> >     > last fall, so I figured it should be easy to make the switch back.
> >     >
> >     > However, I have a feeling that it could be the case that I just 
> don't
> >     understand
> >     > the Beam build system well enough to produce the right outcome (a 
> custom
> >     version
> >     > that can be used with older Scala).
> >     >
> >     > Is there a correct or better way of achieving this?
> >     >
> >     > Kind Regards,
> >     >
> >     > Gary Dusbabek
> >
> >     --
> >     Jean-Baptiste Onofré
> >     jbono...@apache.org <mailto:jbono...@apache.org>
> <mailto:jbono...@apache.org <mailto:jbono...@apache.org>>
> >     http://blog.nanthrax.net
> >     Talend - http://www.talend.com
> >
> >
> 
> --
> Jean-Baptiste Onofré
> jbono...@apache.org <mailto:jbono...@apache.org>
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: spark runner for Scala 2.10

2018-02-22 Thread Jean-Baptiste Onofré

OK, do you want me to provide a Scala 2.10 build for you ?

Regards
JB

On 02/22/2018 06:44 PM, Gary Dusbabek wrote:
> Jean-Baptiste,
> 
> Thanks for responding. I agree--it would be better to use Scala 2.11. I'm in 
> the
> process of creating a Beam POC with an existing platform and upgrading
> everything in that platform to Scala 2.11 as a prerequisite is out of scope.
> 
> It would be helpful to know if Beam in it's current state is backward
> incompatible with Scala 2.10 for reasons other than the dependencies.
> 
> But if there is a way to make it work to enable a POC, I would appreciate some
> pointers, as it doesn't seem to be as simple as changing the "*_2.11" 
> references
> in the poms.
> 
> Cheers,
> 
> Gary.
> 
> 
> 
> On Thu, Feb 22, 2018 at 5:34 PM, Jean-Baptiste Onofré <j...@nanthrax.net
> <mailto:j...@nanthrax.net>> wrote:
> 
> Hi Gary,
> 
> Beam 2.3.0 and the Spark runner use Scala 2.11.
> 
> I can help you to have a smooth transition by creating a local branch 
> using
> Scala 2.10. However,  I strongly advice to upgrade to 2.11 as some other 
> part of
> Beam  (other runners and IOs) use 2.11 already.
> 
> Regards
> JB
> 
> On 02/22/2018 05:55 PM, Gary Dusbabek wrote:
> > Hi,
> >
> > My apologies if this belongs on the dev list. If it does, let me know 
> and I'll
> > shoot things over that way...
> >
> > For the last day or so, I've been trying to create a Spark Runner that
> will work
> > on older deployments using Scala 2.10. I've taken a few approaches:
> >
> > 1. selectively changing a few dependencies in beam-runners-spark.pom 
> (and
> a few
> > other places in the parent)
> > 2. updating every dependency that references *_2.11 to be *_2.10
> >
> > In the former case the sticking point in both cases is that there is a 
> library
> > incompatibility with jackson-module-scala_2.xx. In the latter case 
> there is a
> > problem with SourceRDD.SourcePartitioning not [correctly] implementing
> > `equals(...)` from the parent trait.
> >
> > Posts on the mailing list made me think that the move to Scala 2.11
> started only
> > last fall, so I figured it should be easy to make the switch back.
> >
> > However, I have a feeling that it could be the case that I just don't
> understand
> > the Beam build system well enough to produce the right outcome (a custom
> version
> > that can be used with older Scala).
> >
>     > Is there a correct or better way of achieving this?
> >
> > Kind Regards,
> >
> > Gary Dusbabek
> 
> --
> Jean-Baptiste Onofré
> jbono...@apache.org <mailto:jbono...@apache.org>
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: spark runner for Scala 2.10

2018-02-22 Thread Jean-Baptiste Onofré

Hi Gary,

Beam 2.3.0 and the Spark runner use Scala 2.11.

I can help you to have a smooth transition by creating a local branch using
Scala 2.10. However,  I strongly advice to upgrade to 2.11 as some other part of
Beam  (other runners and IOs) use 2.11 already.

Regards
JB

On 02/22/2018 05:55 PM, Gary Dusbabek wrote:
> Hi,
> 
> My apologies if this belongs on the dev list. If it does, let me know and I'll
> shoot things over that way...
> 
> For the last day or so, I've been trying to create a Spark Runner that will 
> work
> on older deployments using Scala 2.10. I've taken a few approaches:
> 
> 1. selectively changing a few dependencies in beam-runners-spark.pom (and a 
> few
> other places in the parent)
> 2. updating every dependency that references *_2.11 to be *_2.10
> 
> In the former case the sticking point in both cases is that there is a library
> incompatibility with jackson-module-scala_2.xx. In the latter case there is a
> problem with SourceRDD.SourcePartitioning not [correctly] implementing
> `equals(...)` from the parent trait.
> 
> Posts on the mailing list made me think that the move to Scala 2.11 started 
> only
> last fall, so I figured it should be easy to make the switch back.
> 
> However, I have a feeling that it could be the case that I just don't 
> understand
> the Beam build system well enough to produce the right outcome (a custom 
> version
> that can be used with older Scala).
> 
> Is there a correct or better way of achieving this?
> 
> Kind Regards,
> 
> Gary Dusbabek

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

[ANN] Apache Beam 2.3.0 has been released!

2018-02-19 Thread Jean-Baptiste Onofré

The Apache Beam team is pleased to announce the release of 2.3.0 version!

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes the following major new features & improvements:
- full Java 8 support
- Spark 2.x support in Spark runner
- Amazon WS S3 filesystem support
- General-purpose writing to files (FileIO)
- Splittable DoFn support in Python SDK
- Improvements on Portability layer
- Improvements on SDKs & runners
- Improvements on several IOs

You can take a look on the following blog post and Release Notes for details:

https://beam.apache.org/blog/2018/02/19/beam-2.3.0.html

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12341608

Enjoy !!

--
JB on behalf of The Apache Beam team

Travel Assistance applications open for ApacheCon NA '18

2018-02-14 Thread Jean-Baptiste Onofré

Hi all,

I inform you that the travel assistance applications
(http://www.apache.org/travel/) are now open.

If you plan to attend to ApacheCon, but we can't for financial reason, you can
apply to the program.

Lot of committers, PMC members and members will be present, so it's a great
opportunity to discuss and meet all together.

I should be there and looking forward to meet some of you in person !

Regards
JB

-
The Travel Assistance Committee (TAC) are pleased to announce that travel
assistance applications for ApacheCon NA 2018 are now open!

We will be supporting ApacheCon NA Montreal, Canada on 24th - 29th September 
2018

 TAC exists to help those that would like to attend ApacheCon events, but are
unable to do so for financial reasons. For more info on this years applications
and qualifying criteria, please visit the TAC website at <
http://www.apache.org/travel/  >. Applications
are now open and will close 1st May.

Important: Applications close on May 1st, 2018. Applicants have until the
closing date above to submit their applications (which should contain as much
supporting material as required to efficiently and accurately process their
request), this will enable TAC to announce successful awards shortly afterwards.
As usual, TAC expects to deal with a range of applications from a diverse range
of backgrounds. We therefore encourage (as always) anyone thinking about sending
in an application to do so ASAP.   We look forward to greeting many of you in
Montreal

Kind Regards, Gavin - (On behalf of the Travel Assistance Committee)
—

Re: Handling errors in IOs

2018-02-12 Thread Jean-Baptiste Onofré

Hi Motty,

yes, you can configure reconnect, timeout, etc on the ConnectionFactory
(especially when you use ActiveMqPooledConnectionFactory).

For the split, I didn't mean regarding the Spark cluster but more in term of
workers.
When you use something like:

pipeline.apply(JmsIO.read().fromQueue("foo"))

the runner can send a desired num of splits
(https://github.com/apache/beam/blob/master/sdks/java/io/jms/src/main/java/org/apache/beam/sdk/io/jms/JmsIO.java#L407).
Then the IO will create a consumer per split to the queue.

It scales in term of speed of consuming as we have concurrent consumers on the
queue.

Regards
JB

On 02/12/2018 07:32 PM, Motty Gruda wrote:
> Hi,
> 
> I managed to set the automatic reconnect through the ConnectionFactory, I 
> didn't
> know it was possible. Thanks!
> 
> What do you mean by using "split"? Now when running on the spark runner, if 
> one
> of the brokers becomes unavailable the entire pipeline is stuck on the 
> following
> job:
> 
> org.apache.spark.streaming.dstream.DStream.(DStream.scala:115)
> org.apache.beam.runners.spark.io.SparkUnboundedSource$ReadReportDStream.(SparkUnboundedSource.java:171)
> org.apache.beam.runners.spark.io.SparkUnboundedSource.read(SparkUnboundedSource.java:113)
> org.apache.beam.runners.spark.translation.streaming.StreamingTransformTranslator$2.evaluate(StreamingTransformTranslator.java:125)
> org.apache.beam.runners.spark.translation.streaming.StreamingTransformTranslator$2.evaluate(StreamingTransformTranslator.java:119)
> org.apache.beam.runners.spark.SparkRunner$Evaluator.doVisitTransform(SparkRunner.java:413)
> org.apache.beam.runners.spark.SparkRunner$Evaluator.visitPrimitiveTransform(SparkRunner.java:399)
> org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:663)
> org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:655)
> org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:655)
> org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600(TransformHierarchy.java:311)
> org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:245)
> org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:446)
> org.apache.beam.runners.spark.translation.streaming.SparkRunnerStreamingContextFactory.call(SparkRunnerStreamingContextFactory.java:88)
> org.apache.beam.runners.spark.translation.streaming.SparkRunnerStreamingContextFactory.call(SparkRunnerStreamingContextFactory.java:47)
> org.apache.spark.streaming.api.java.JavaStreamingContext$$anonfun$10.apply(JavaStreamingContext.scala:776)
> org.apache.spark.streaming.api.java.JavaStreamingContext$$anonfun$10.apply(JavaStreamingContext.scala:775)
> scala.Option.getOrElse(Option.scala:120)
> org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:864)
> org.apache.spark.streaming.api.java.JavaStreamingContext$.getOrCreate(JavaStreamingContext.scala:775)
> 
> 
> 
> The updated code:
> Pipeline p =
> Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
> 
> ConnectionFactory factory_a = new
> ActiveMQConnectionFactory("failover://(tcp://activemq-a:61616)?initialReconnectDelay=2000=-1");
> ConnectionFactory factory_b = new
> ActiveMQConnectionFactory("failover://(tcp://activemq-b:61616)?initialReconnectDelay=2000=-1");
> ConnectionFactory factory_c = new
> ActiveMQConnectionFactory("tcp://activemq-c:61616");
> 
> PCollection a = p.apply("ReadFromQueue",
> JmsIO.read().withConnectionFactory(factory_a).withUsername("admin").withPassword("admin")
> .withQueue("a"));
> PCollection b = p.apply("ReadFromQueue2",
> JmsIO.read().withConnectionFactory(factory_b).withUsername("admin").withPassword("admin")
> .withQueue("b"));
> 
> PCollection combined =
> PCollectionList.of(a).and(b).apply(Flatten.pCollections());
> 
> combined.apply(MapElements.into(TypeDescriptors.strings()).via((JmsRecord r) 
> ->
> r.getPayload()))
> .apply("WriteToQueue",
> JmsIO.write().withConnectionFactory(factory_c).withUsername("admin")
> .withPassword("admin").withQueue("c"));
> 
> p.run().waitUntilFinish();
> 
> *Trying to run the code above on spark 2.2.0 and beam 2.3.0-rc3, most of the
> messages simply disappeared inside the system!!!*
> *The messages were read from queues "a" and "b" but most of them didn't arrive
> at queue "c".*
> *
> *
> *
> *
> 
> On Mon, Feb 12, 2018 at 5:30 AM, Jean-Baptiste Onofré <j...@nanthrax.net
> <mailto:j...@nanthrax.net>> wrote:
> 
> Hi,
> 
> here you don't use split, but differ

Re: Handling errors in IOs

2018-02-11 Thread Jean-Baptiste Onofré

68)
... 148 more
Caused by: java.lang.NullPointerException
at 
org.apache.beam.sdk.io.jms.JmsIO$UnboundedJmsReader.advance(JmsIO.java:436)
... 151 more
18/02/11 11:39:59 INFO CacheManager: Partition rdd_1661_1 not found, computing 
it
18/02/11 11:39:59 INFO CacheManager: Partition rdd_1657_1 not found, computing 
it
18/02/11 11:39:59 INFO CacheManager: Partition rdd_1653_1 not found, computing 
it
18/02/11 11:39:59 INFO CacheManager: Partition rdd_1649_1 not found, computing 
it
18/02/11 11:39:59 INFO CacheManager: Partition rdd_1645_1 not found, computing 
it
18/02/11 11:39:59 INFO CacheManager: Partition rdd_1641_1 not found, computing 
it
18/02/11 11:39:59 INFO CacheManager: Partition rdd_1637_1 not found, computing 
it
18/02/11 11:39:59 INFO CacheManager: Partition rdd_1633_1 not found, computing 
it
18/02/11 11:39:59 INFO CacheManager: Partition rdd_1629_1 not found, computing 
it
18/02/11 11:39:59 INFO CacheManager: Partition rdd_1625_1 not found, computing 
it
18/02/11 11:39:59 INFO CacheManager: Partition rdd_1621_1 not found, computing 
it
18/02/11 11:39:59 INFO CacheManager: Partition rdd_1617_1 not found, computing 
it
18/02/11 11:39:59 INFO CacheManager: Partition rdd_1613_1 not found, computing 
it



Trying to run the same code on spark 2.2.0 and beam 2.3.0-rc3 only half 
of the message sent ended up in queue c!




On Sun, Feb 11, 2018 at 8:56 AM, Jean-Baptiste Onofré <j...@nanthrax.net 
<mailto:j...@nanthrax.net>> wrote:


Hi Motty,

For JMS, it depends if you are using queues or topics.

Using queues, JmsIO create several readers (concurrent consumer) on
the same
queue. The checkpoint used is based on the ACK (it's a client ACK,
and the
source send the ACK when the checkpoint is finalized). If you close
a connection
for one source, the other sources should continue to consume.

Can you explain exactly your scenario (runner, pipeline, broker) ?

Regards
JB

On 02/11/2018 07:43 AM, Motty Gruda wrote:
 > Hey,
 >
 > How errors in the IOs can be treated (for example connection
errors)?
 >
 > I've tested few scenarios with the JmsIO. When I read from two
different jms
 > connections and I closed only one of them, the entire pipeline
failed/froze.
 > I would expect it to continue running with one source and try to
reconnect to
 > the second source until it's available again.
 >
 > Is this a bug in the IO itself? In the SDK? In the runner (I've
tested with the
 > direct runner and the spark runner)?

--
Jean-Baptiste Onofré
jbono...@apache.org <mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com




--
  מוטי גרודה 
           Motty Gruda

Re: Handling errors in IOs

2018-02-10 Thread Jean-Baptiste Onofré

Hi Motty,

For JMS, it depends if you are using queues or topics.

Using queues, JmsIO create several readers (concurrent consumer) on the same
queue. The checkpoint used is based on the ACK (it's a client ACK, and the
source send the ACK when the checkpoint is finalized). If you close a connection
for one source, the other sources should continue to consume.

Can you explain exactly your scenario (runner, pipeline, broker) ?

Regards
JB

On 02/11/2018 07:43 AM, Motty Gruda wrote:
> Hey,
> 
> How errors in the IOs can be treated (for example connection errors)? 
> 
> I've tested few scenarios with the JmsIO. When I read from two different jms
> connections and I closed only one of them, the entire pipeline failed/froze.
> I would expect it to continue running with one source and try to reconnect to
> the second source until it's available again.
> 
> Is this a bug in the IO itself? In the SDK? In the runner (I've tested with 
> the
> direct runner and the spark runner)?

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Deprecate and remove support for Kafka 0.9.x and older version

2018-02-06 Thread Jean-Baptiste Onofré

+1 to flag as deprecated, but I would wait a bit before simply removing it.

Regards
JB

On 02/03/2018 01:12 AM, Raghu Angadi wrote:
> Is anyone using Apache Beam with Kafka 0.9.x and older?
> 
> I am thinking of deprecating 0.9.x and 0.10.0 in Beam 2.4 and remove support 
> in
> 2.5 or later. 0.10.1 and up will be supported. 0.10.1+ includes much better
> timestamp support. 
> 
> By deprecation I mean KafkaIO would continue to work with an older version at
> runtime, but would not build with it (e.g. `mvn -Dkafka.clients.version=0.9.1`
> fails).  We can print a deprecation warning at runtime. 
> 
> [1]: 
> https://github.com/apache/kafka/commit/23c69d62a0cabf06c4db8b338f0ca824dc6d81a7

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Slack invite please

2018-02-06 Thread Jean-Baptiste Onofré

Done, you should have received an invite.

Welcome aboard !

Regards
JB

On 02/06/2018 02:51 PM, Mark Theunissen wrote:
> Hi there, could I please get an invite to the Slack channel?
> 
> Thanks
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: TextIO operation not supported by Flink for Beam master

2018-01-31 Thread Jean-Baptiste Onofré

Hi Thomas,

Weird as the validate runner tests worked and we use TextIO there.

Let me try to reproduce.

Thanks for the report !

Regards
JB

On 02/01/2018 02:36 AM, Thomas Pelletier wrote:
> Hi,
> 
> I'm trying to run a pipeline containing just a TextIO.read() step on a Flink
> cluster, using the latest Beam git revision (ff37337
> <https://github.com/apache/beam/commit/ff37337d85aa5af23418f3be4611b913395ccc88>).
> The job fails to start with the Exception:
> 
>   java.lang.UnsupportedOperationException: The transform  is currently not
> supported.
> 
> It does work with Beam 2.2.0 though. All code, logs, and reproduction steps on
> this repository <https://github.com/pelletier/beam-flink-example>.
> 
> Any idea what might be going on?
> 
> Thanks,
> Thomas

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Slack Request

2018-01-30 Thread Jean-Baptiste Onofré

Invite sent.

Welcome aboard !

Regards
JB

On 01/31/2018 01:56 AM, Luke Zhu wrote:
> Hello,
> 
> Could I get an invite to the Slack group?
> 
> Thanks!

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] State of the project: Feature roadmap for 2018

2018-01-22 Thread Jean-Baptiste Onofré

Hi Ben,

about the "technical roadmap", we have a thread about "Beam 3.x roadmap".

It already provides ideas for points 3 & 4.

Regards
JB

On 01/22/2018 09:15 PM, Ben Chambers wrote:
> Thanks Davor for starting the state of the project discussions [1].
> 
> 
> In this fork of the state of the project discussion, I’d like to start the
> discussion of the feature roadmap for 2018 (and beyond).
> 
> 
> To kick off the discussion, I think the features could be divided into several
> areas, as follows:
> 
>  1.
> 
> Enabling Contributions: How do we make it easier to add new features to 
> the
> supported runners? Can we provide a common intermediate layer below the
> existing functionality that features are translated to so that runners 
> only
> need to support the intermediate layer and new features only need to 
> target
> it? What other ways can we make it easier to contribute to the development
> of Beam?
> 
>  2.
> 
> Realizing Portability: What gaps are there in the promise of portability?
> For example in [1] we discussed the fact that users must write per-runner
> code to push system metrics from runners to their monitoring platform. 
> This
> limits their ability to actually change runners. Credential management for
> different environments also falls into this category.
> 
>  3.
> 
> Large Features: What major features (like Beam SQL, Beam Python, etc.) 
> would
> increase the Beam user base in 2018?
> 
>  4.
> 
> Improvements: What small changes could make Beam more appealing to users?
> Are there API improvements we could make or common mistakes we could 
> detect
> and/or prevent?
> 
> 
> Thanks in advance for participating in the discussion. I believe that 2018 
> could
> be a great year for Beam, providing easier, more complete runner portability 
> and
> features that make Beam easier to use for everyone.
> 
> 
> Ben
> 
> 
> [1]
> https://lists.apache.org/thread.html/f750f288af8dab3f468b869bf5a3f473094f4764db419567f33805d0@%3Cdev.beam.apache.org%3E
> 
> [2]
> https://lists.apache.org/thread.html/01a80d62f2df6b84bfa41f05e15fda900178f882877c294fed8be91e@%3Cdev.beam.apache.org%3E

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Strata Conference this March 6-8

2018-01-18 Thread Jean-Baptiste Onofré

I think Matthias has already some plan for London meetup later this year (it's 
what he said to me).


Stay tuned !

Regards
JB

On 01/18/2018 06:29 PM, Ismaël Mejía wrote:

My excuses I somehow misread the dates and thought you referred to the
London conference, but well in the end this becomes two good ideas :)

- A meetup in London for the week of May 21
- A meetup in San Jose if someone can organize it for the March dates.


On Thu, Jan 18, 2018 at 12:00 AM, Holden Karau <hol...@pigscanfly.ca> wrote:

So doing a streaming BoF join in would probably require meeting somewhere
other than a coffee shop so as not to be jerks in the coffee shop.

On Wed, Jan 17, 2018 at 2:53 PM, Matthias Baetens
<matthias.baet...@datatonic.com> wrote:


Sure, I'd be very happy to organise something. This is about Strata San
Jose though right? Maybe we can organise a remote session in which we can
join (depending on when you would organise the BoF) or have a channel set-up
if the talks would be broadcasted?

Also: will there be any Beam talks on Strata London or is this not known
yet? Keen to get involved and set things up around that date as well.

On Wed, Jan 17, 2018 at 8:37 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:


That's a great idea ! I'm sure that Matthias (organizer of the Beam
London Meetup) can help us to plan something.

Regards
JB


On 01/17/2018 08:57 AM, Ismaël Mejía wrote:


Maybe a good idea to try to organize a Beam meetup in london in the
same dates in case some of the people around can jump in and talk too.

On Wed, Jan 17, 2018 at 2:51 AM, Ron Gonzalez <zlgonza...@yahoo.com>
wrote:


Works for me...

On Tuesday, January 16, 2018, 5:45:33 PM PST, Holden Karau
<hol...@pigscanfly.ca> wrote:


How would folks feel about during the afternoon break (3:20-4:20) on
the
Wednesday (same day as Eugene's talk)? We could do the Philz which is a
bit
of a walk but gets us away from the big crowd and also lets folks not
attending the conference but in the area join us.

On Tue, Jan 16, 2018 at 5:29 PM, Ron Gonzalez <zlgonza...@yahoo.com>
wrote:

Cool, let me know if you guys finally schedule it. I will definitely
try to
make it to Eugene's talk but having an informal BoF in the area would
be
nice...

Thanks,
Ron

On Tuesday, January 16, 2018, 5:06:53 PM PST, Boris Lublinsky
<boris.lublin...@lightbend.com > wrote:


All for it

Boris Lublinsky
FDP Architect
boris.lublin...@lightbend.com
https://www.lightbend.com/

On Jan 16, 2018, at 7:01 PM, Ted Yu <yuzhih...@gmail.com> wrote:

+1 to BoF

On Tue, Jan 16, 2018 at 5:00 PM, Dmitry Demeshchuk
<dmi...@postmates.com>
wrote:

Probably won't be attending the conference, but totally down for a BoF.

On Tue, Jan 16, 2018 at 4:58 PM, Holden Karau <hol...@pigscanfly.ca>
wrote:

Do interested folks have any timing constraints around a BoF?

On Tue, Jan 16, 2018 at 4:30 PM, Jesse Anderson
<je...@bigdatainstitute.io>
wrote:

+1 to BoF. I don't know if any Beam talks will be on the schedule.


We could do an informal BoF at the Philz nearby or similar?






--
Twitter: https://twitter.com/h oldenkarau




--
Best regards,
Dmitry Demeshchuk.






--
Twitter: https://twitter.com/holdenkarau



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--


Matthias Baetens


datatonic | data power unleashed

office +44 203 668 3680  |  mobile +44 74 918 20646

Level24 | 1 Canada Square | Canary Wharf | E14 5AB London


We've been announced as one of the top global Google Cloud Machine
Learning partners.





--
Twitter: https://twitter.com/holdenkarau


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Strata Conference this March 6-8

2018-01-17 Thread Jean-Baptiste Onofré

That's a great idea ! I'm sure that Matthias (organizer of the Beam London 
Meetup) can help us to plan something.


Regards
JB

On 01/17/2018 08:57 AM, Ismaël Mejía wrote:

Maybe a good idea to try to organize a Beam meetup in london in the
same dates in case some of the people around can jump in and talk too.

On Wed, Jan 17, 2018 at 2:51 AM, Ron Gonzalez <zlgonza...@yahoo.com> wrote:

Works for me...

On Tuesday, January 16, 2018, 5:45:33 PM PST, Holden Karau
<hol...@pigscanfly.ca> wrote:


How would folks feel about during the afternoon break (3:20-4:20) on the
Wednesday (same day as Eugene's talk)? We could do the Philz which is a bit
of a walk but gets us away from the big crowd and also lets folks not
attending the conference but in the area join us.

On Tue, Jan 16, 2018 at 5:29 PM, Ron Gonzalez <zlgonza...@yahoo.com> wrote:

Cool, let me know if you guys finally schedule it. I will definitely try to
make it to Eugene's talk but having an informal BoF in the area would be
nice...

Thanks,
Ron

On Tuesday, January 16, 2018, 5:06:53 PM PST, Boris Lublinsky
<boris.lublin...@lightbend.com > wrote:


All for it

Boris Lublinsky
FDP Architect
boris.lublin...@lightbend.com
https://www.lightbend.com/

On Jan 16, 2018, at 7:01 PM, Ted Yu <yuzhih...@gmail.com> wrote:

+1 to BoF

On Tue, Jan 16, 2018 at 5:00 PM, Dmitry Demeshchuk <dmi...@postmates.com>
wrote:

Probably won't be attending the conference, but totally down for a BoF.

On Tue, Jan 16, 2018 at 4:58 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

Do interested folks have any timing constraints around a BoF?

On Tue, Jan 16, 2018 at 4:30 PM, Jesse Anderson <je...@bigdatainstitute.io>
wrote:

+1 to BoF. I don't know if any Beam talks will be on the schedule.


We could do an informal BoF at the Philz nearby or similar?





--
Twitter: https://twitter.com/h oldenkarau




--
Best regards,
Dmitry Demeshchuk.






--
Twitter: https://twitter.com/holdenkarau


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Question on Beam and Talend

2018-01-16 Thread Jean-Baptiste Onofré


Hi Ron,

Actually, you are right, it's not the perfect mailing list for this question ;)

Don't hesitate to ping me directly for more details.

Thanks
Regards
JB

On 01/16/2018 08:59 AM, zlgonzalez wrote:

Hi,
   Not sure if this is the right place, but what Talend open source product 
supports beam with no licensing fees?
   I've been going through the documentation and searching and all I find are 
news items and blogs that it's already available in Talend.


Thanks,
Ron



Sent via the Samsung Galaxy S7 active, an AT 4G LTE smartphone


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Slack channel

2018-01-15 Thread Jean-Baptiste Onofré


Hi,

you should have received the invite.

Regards
JB

On 01/16/2018 06:52 AM, Ron's Yahoo! wrote:

Hi,
   I’m trying to get access to the Apache Beam Slack channel, but it’s asking 
me for an apache.org, google.com or radicalbit.io email address.
   Can you tell me how I can sign up for the channel if I don’t have one of 
these email addresses?

Thanks,
Ron



--
Jean-Baptiste Onofré
jbono...@apache.org.
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Regarding Beam Slack Channel

2018-01-08 Thread Jean-Baptiste Onofré


Done, you should have received the invite.

Welcome !

Regards
JB

On 01/09/2018 06:31 AM, Shashank Prabhakara wrote:

I'd also like to be added, please.

Thanks.

On 2018-01-04 11:58, Jean-Baptiste Onofr� <j...@nanthrax.net 
<mailto:j...@nanthrax.net>> wrote:

 > Hi,>
 >
 > you should have received the invite.>
 >
 > Welcome aboard !>
 >
 > Regards>
 > JB>
 >
 > On 01/04/2018 01:38 AM, David Sabater Dinter wrote:>
 > > Hi,>
 > > Can I join also, please?>
 > > >
 > > Thanks!>
 > > >
 > > On Thu, Jan 4, 2018 at 12:54 AM Maria Tjahjadi 
<maria.tjahj...@tokopedia.com <mailto:maria.tjahj...@tokopedia.com> >

 > > <ma...@tokopedia.com <mailto:ma...@tokopedia.com>>> wrote:>
 > > >
 > >     Hi,>
 > > >
 > >     Could I join also?>
 > > >
 > >     Thanks,>
 > >     Maria Tjahjadi>
 > > >
 > >     On 4 Jan 2018, at 06.15, Lukasz Cwik <lc...@google.com 
<mailto:lc...@google.com>>

 > >     <ma...@google.com <mailto:ma...@google.com>>> wrote:>
 > > >
 > >>     Invite sent, welcome.>
 > >>>
 > >>     On Wed, Jan 3, 2018 at 3:01 PM, Carlos Alonso <car...@mrcalonso.com 
<mailto:car...@mrcalonso.com>>

 > >>     <ma...@mrcalonso.com <mailto:ma...@mrcalonso.com>>> wrote:>
 > >>>
 > >>         Hi, I'd like to be added too, please!>
 > >>>
 > >>         Thanks!>
 > >>>
 > >>         On Tue, Dec 19, 2017 at 3:41 PM Jean-Baptiste Onofr� 
<j...@nanthrax.net <mailto:j...@nanthrax.net>>

 > >>         <ma...@nanthrax.net <mailto:ma...@nanthrax.net>>> wrote:>
 > >>>
 > >>             Done,>
 > >>>
 > >>             you should have received an invite.>
 > >>>
 > >>             Regards>
 > >>             JB>
 > >>>
 > >>             On 12/19/2017 03:20 PM, Unais T wrote:>
 > >>             > Hello>
 > >>             >>
 > >>             > Can someone please add me to the Beam slack channel?>
 > >>             >>
 > >>             > Thanks.>
 > >>             >>
 > >>             >>
 > >>>
 > >>             -->
 > >>             Jean-Baptiste Onofr�>
 > >> jbono...@apache.org <mailto:jbono...@apache.org> <ma...@apache.org 
<mailto:ma...@apache.org>>>

 > >> http://blog.nanthrax.net>
 > >>             Talend - http://www.talend.com>
 > >>>
 > >>>
 >
 > -- >
 > Jean-Baptiste Onofr�>
 > jbono...@apache.org <mailto:jbono...@apache.org>>
 > http://blog.nanthrax.net>
 > Talend - http://www.talend.com>
 >


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [POLL] Dropping support for Java 7 in Beam 2.3

2018-01-07 Thread Jean-Baptiste Onofré


Agree, good one !

Thanks !
Regards
JB

On 01/08/2018 07:28 AM, Eugene Kirpichov wrote:
1 month has passed since the vote, with <5% votes on Twitter against the switch, 
and votes on this thread only in favor of the switch.


The vote is hence concluded. We can proceed with switching Beam to Java 8. Beam 
2.3 will be Java8-only. Woohoo!


On Thu, Jan 4, 2018 at 4:55 PM Ted Yu <yuzhih...@gmail.com 
<mailto:yuzhih...@gmail.com>> wrote:


+1 on dropping Java 7 support.



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Scio 0.4.7 released

2018-01-04 Thread Jean-Baptiste Onofré


Hi

I remember when working with Scio, there was some specific IOs in Scio not using 
the ones provided by Beam.


I didn't check recently, maybe it has changed.

Regards
JB

On 01/04/2018 09:51 PM, Neville Li wrote:
Not sure what you mean. We use Beam file IOs whenever possible which should 
already use the FileSystems API.


On Thu, Jan 4, 2018 at 3:48 PM Jean-Baptiste Onofré <j...@nanthrax.net 
<mailto:j...@nanthrax.net>> wrote:


Hi Neville,

Great  Thanks for the update.

By the way, any plan to leverage the Beam filesystems in Scio now ?

Regards
JB

On 01/04/2018 09:41 PM, Neville Li wrote:
 > Hi all,
 >
 > We just release Scio 0.4.7. This release fixed a join performance
regression and
 > introduced several improvements.
 >
 > https://github.com/spotify/scio/releases/tag/v0.4.7
 >
 > /"Hydrochoerus hydrochaeris"/
 >
 >
 >       Features
 >
 >   * Add support for TFRecordSpec #990
<https://github.com/spotify/scio/pull/990>
 >   * Add convenience methods to randomSpit #987
 >     <https://github.com/spotify/scio/pull/987>
 >   * Add BigtableDoFn #922 <https://github.com/spotify/scio/issues/922> 
#931
 >     <https://github.com/spotify/scio/pull/931>
 >   * Add optional arguments validation #979
 >     <https://github.com/spotify/scio/pull/979>
 >   * Performance improvement in Avro and BigQuery macro converters #989
 >     <https://github.com/spotify/scio/issues/989>
 >   * Update new bigquery datetime and timestamp specification #982
 >     <https://github.com/spotify/scio/pull/982>
 >
 >
 >       Bug fixes
 >
 >   * Fix join performance regression introduced in 0.4.4 #976
 >     <https://github.com/spotify/scio/issues/976> #983
 >     <https://github.com/spotify/scio/pull/983>
 >   * Fine tune dynamic sink API and and integration tests #993
 >     <https://github.com/spotify/scio/issues/993> #995
 >     <https://github.com/spotify/scio/pull/995>
 >

--
Jean-Baptiste Onofré
jbono...@apache.org <mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Scio 0.4.7 released

2018-01-04 Thread Jean-Baptiste Onofré


Hi Neville,

Great  Thanks for the update.

By the way, any plan to leverage the Beam filesystems in Scio now ?

Regards
JB

On 01/04/2018 09:41 PM, Neville Li wrote:

Hi all,

We just release Scio 0.4.7. This release fixed a join performance regression and 
introduced several improvements.


https://github.com/spotify/scio/releases/tag/v0.4.7

/"Hydrochoerus hydrochaeris"/


  Features

  * Add support for TFRecordSpec #990 <https://github.com/spotify/scio/pull/990>
  * Add convenience methods to randomSpit #987
<https://github.com/spotify/scio/pull/987>
  * Add BigtableDoFn #922 <https://github.com/spotify/scio/issues/922> #931
<https://github.com/spotify/scio/pull/931>
  * Add optional arguments validation #979
<https://github.com/spotify/scio/pull/979>
  * Performance improvement in Avro and BigQuery macro converters #989
<https://github.com/spotify/scio/issues/989>
  * Update new bigquery datetime and timestamp specification #982
<https://github.com/spotify/scio/pull/982>


  Bug fixes

  * Fix join performance regression introduced in 0.4.4 #976
<https://github.com/spotify/scio/issues/976> #983
<https://github.com/spotify/scio/pull/983>
  * Fine tune dynamic sink API and and integration tests #993
<https://github.com/spotify/scio/issues/993> #995
    <https://github.com/spotify/scio/pull/995>



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Regarding Beam Slack Channel

2018-01-03 Thread Jean-Baptiste Onofré

Hi,

you should have received the invite.

Welcome aboard !

Regards
JB

On 01/04/2018 12:54 AM, Maria Tjahjadi wrote:

Hi,

Could I join also?

Thanks,
Maria Tjahjadi

On 4 Jan 2018, at 06.15, Lukasz Cwik <lc...@google.com 
<mailto:lc...@google.com>> wrote:

Invite sent, welcome.

On Wed, Jan 3, 2018 at 3:01 PM, Carlos Alonso <car...@mrcalonso.com 
<mailto:car...@mrcalonso.com>> wrote:

Hi, I'd like to be added too, please!

Thanks!

On Tue, Dec 19, 2017 at 3:41 PM Jean-Baptiste Onofré <j...@nanthrax.net
<mailto:j...@nanthrax.net>> wrote:

Done,

you should have received an invite.

Regards
JB

On 12/19/2017 03:20 PM, Unais T wrote:
> Hello
>
> Can someone please add me to the Beam slack channel?
>
> Thanks.
    >
>

--
Jean-Baptiste Onofré
jbono...@apache.org <mailto:jbono...@apache.org>
http://blog.nanthrax.net
    Talend - http://www.talend.com

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Regarding Beam Slack Channel

2017-12-19 Thread Jean-Baptiste Onofré


Done,

you should have received an invite.

Regards
JB

On 12/19/2017 03:20 PM, Unais T wrote:

Hello

Can someone please add me to the Beam slack channel?

Thanks.




--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [INFO] Spark runner updated to Spark 2.2.1

2017-12-18 Thread Jean-Baptiste Onofré

By the way, Flink has been updated to Flink 1.4.0 as well (as the artifacts 
already used Scala 2.11).


Regards
JB

On 12/18/2017 11:50 AM, Jean-Baptiste Onofré wrote:

Hi all,

We are pleased to announce that Spark 2.x support in Spark runner has been 
merged this morning. It supports Spark 2.2.1.


In the same PR, we did update to Scala 2.11, including Flink artifacts update to 
2.11 (it means it's already ready to upgrade to Flink 1.4 !).


It also means, as planned, that Spark 2.x support will be included in next Beam 
2.3.0 release.


Now, we are going to work on improvements in the Spark runner.

If you have any issue with the Spark runner, please let us know.

Thanks !
Regards
JB


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

[INFO] Spark runner updated to Spark 2.2.1

2017-12-18 Thread Jean-Baptiste Onofré


Hi all,

We are pleased to announce that Spark 2.x support in Spark runner has been 
merged this morning. It supports Spark 2.2.1.


In the same PR, we did update to Scala 2.11, including Flink artifacts update to 
2.11 (it means it's already ready to upgrade to Flink 1.4 !).


It also means, as planned, that Spark 2.x support will be included in next Beam 
2.3.0 release.


Now, we are going to work on improvements in the Spark runner.

If you have any issue with the Spark runner, please let us know.

Thanks !
Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [POLL] Dropping support for Java 7 in Beam 2.3

2017-12-11 Thread Jean-Baptiste Onofré

 blocker or hindrance to adopting the
new release for me [e.g. I expect that I'll have strong reasons to
update to Beam 2.3, but I expect that it will be difficult because
of lack of Java 7 support]



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Apache Beam, version 2.2.0

2017-12-04 Thread Jean-Baptiste Onofré


My apologizes, I thought we had a consensus already.

Regards
JB

On 12/04/2017 11:22 PM, Eugene Kirpichov wrote:
Thanks JB for sending the detailed notes about new stuff in 2.2.0! A lot of 
exciting things indeed.


Regarding Java 8: I thought our consensus was to have the release notes say that 
we're *considering* going Java8-only, and use that to get more opinions from the 
user community - but I can't find the emails that made me think so.


+Ismaël Mejía <mailto:ieme...@gmail.com> - do you think we should formally 
conclude the vote on the thread [VOTE] [DISCUSSION] Remove support for Java 7?
Or should we take more steps - e.g. perhaps tweet a link to that thread from the 
Beam twitter account, ask people to chime in, and wait for say 2 weeks before 
declaring a conclusion?


Let's also have a process JIRA for going Java8. I've filed one: 
https://issues.apache.org/jira/browse/BEAM-3285


On Mon, Dec 4, 2017 at 1:58 AM Jean-Baptiste Onofré <j...@nanthrax.net 
<mailto:j...@nanthrax.net>> wrote:


Just an important note that we forgot to mention.

!! The 2.2.0 release will be the last one supporting Spark 1.x and Java 7 !!

Starting from Beam 2.3.0, the Spark runner will work only with Spark 2.x 
and we
will focus only Java 8.

Regards
JB

On 12/04/2017 10:15 AM, Jean-Baptiste Onofré wrote:
 > Thanks Reuven !
 >
 > I would like to emphasize on some highlights in 2.2.0 release:
 >
 > - New IOs have been introduced:
 >   * TikaIO leveraging Apache Tika, allowing the deal with a lot of 
different
 > data formats
 >   * RedisIO to read and write key/value pairs from a Redis server. This
IO will
 > be soon extended to Redis PubSub.
 >   * FileIO provides transforms for working with files (raw). Especially, 
it
 > provides matching file patterns and read on patterns. It can be easily
extended
 > for a specific format (like we do in AvroIO or TextIO now).
 >   * SolrIO to interact with Apache Solr (Lucene)
 >
 > - On the other hand, improvements have been performed on existing IOs:
 >   * We started to introduce readAll pattern in IOs (AvroIO, TextIO, 
JdbcIO,
 > ...), allowing to pass "request" arguments via an input PCollection.
 >   * ElasticsearchIO has an improved support of different Elasticsearch
version
 > (including Elasticsearch 5.x). It also now supports SSL/TLS.
 >   * HBaseIO is now able to do dynamic work rebalancing
 >   * KinesisIO uses a more accurate watermark (based on
approximateArrivalTimestamp)
 >   * TextIO now supports custom delimiter and like AvroIO, supports the
readAll
 > pattern,
 >   * Performance improvements on JdbcIO when it has to read lot of rows
 >   * Kafka write supports Exactly-Once pattern (introduce in Kafka 0.11.x)
 >
 > - A new DSL has been introduced: the SQL DSL !
 >
 > We are now focus on 2.3.0 release with new improvements and features !
 >
 > Stay tuned !
 >
 > JB on behalf of the Apache Beam community.
 >
 > On 12/02/2017 11:40 PM, Reuven Lax wrote:
 >> The Apache Beam community is pleased to announce the availability of the
 >> 2.2.0 release.
 >>
 >> This release adds support for generic file sources and sinks (beyond 
TextIO
 >> and AvroIO) using FileIO, including support for dynamic filenames using
 >> readAll; this allows streaming pipelines to now read from files by
 >> continuously monitoring a directory for new filw. Many other IOs are
improved,
 >> notably including exactly-once support for the Kafka sink. Initial
support for
 >> BEAM-SQL is also included in this release. For a more-complete list of 
major
 >> changes in the release, please refer to the release notes [2].
 >>
 >> The 2.2.0 release is now the recommended version; we encourage everyone 
to
 >> upgrade from any earlier releases.
 >>
 >> We’d like to invite everyone to try out Apache Beam today and consider
 >> joining our vibrant community. We welcome feedback, contribution and
 >> participation through our mailing lists, issue tracker, pull requests, 
and
 >> events.
 >>
 >> - Reuven Lax, on behalf of the Apache Beam community.
 >>
 >> [1] https://beam.apache.org/get-started/downloads/
 >> [2]
 >>
    
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12341044
 >>
 >

--
Jean-Baptiste Onofré
jbono...@apache.org <mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Apache Beam, version 2.2.0

2017-12-04 Thread Jean-Baptiste Onofré


Just an important note that we forgot to mention.

!! The 2.2.0 release will be the last one supporting Spark 1.x and Java 7 !!

Starting from Beam 2.3.0, the Spark runner will work only with Spark 2.x and we 
will focus only Java 8.


Regards
JB

On 12/04/2017 10:15 AM, Jean-Baptiste Onofré wrote:

Thanks Reuven !

I would like to emphasize on some highlights in 2.2.0 release:

- New IOs have been introduced:
  * TikaIO leveraging Apache Tika, allowing the deal with a lot of different 
data formats
  * RedisIO to read and write key/value pairs from a Redis server. This IO will 
be soon extended to Redis PubSub.
  * FileIO provides transforms for working with files (raw). Especially, it 
provides matching file patterns and read on patterns. It can be easily extended 
for a specific format (like we do in AvroIO or TextIO now).

  * SolrIO to interact with Apache Solr (Lucene)

- On the other hand, improvements have been performed on existing IOs:
  * We started to introduce readAll pattern in IOs (AvroIO, TextIO, JdbcIO, 
...), allowing to pass "request" arguments via an input PCollection.
  * ElasticsearchIO has an improved support of different Elasticsearch version 
(including Elasticsearch 5.x). It also now supports SSL/TLS.

  * HBaseIO is now able to do dynamic work rebalancing
  * KinesisIO uses a more accurate watermark (based on 
approximateArrivalTimestamp)
  * TextIO now supports custom delimiter and like AvroIO, supports the readAll 
pattern,

  * Performance improvements on JdbcIO when it has to read lot of rows
  * Kafka write supports Exactly-Once pattern (introduce in Kafka 0.11.x)

- A new DSL has been introduced: the SQL DSL !

We are now focus on 2.3.0 release with new improvements and features !

Stay tuned !

JB on behalf of the Apache Beam community.

On 12/02/2017 11:40 PM, Reuven Lax wrote:

The Apache Beam community is pleased to announce the availability of the
2.2.0 release.

This release adds support for generic file sources and sinks (beyond TextIO 
and AvroIO) using FileIO, including support for dynamic filenames using 
readAll; this allows streaming pipelines to now read from files by 
continuously monitoring a directory for new filw. Many other IOs are improved, 
notably including exactly-once support for the Kafka sink. Initial support for 
BEAM-SQL is also included in this release. For a more-complete list of major 
changes in the release, please refer to the release notes [2].


The 2.2.0 release is now the recommended version; we encourage everyone to
upgrade from any earlier releases.

We’d like to invite everyone to try out Apache Beam today and consider
joining our vibrant community. We welcome feedback, contribution and
participation through our mailing lists, issue tracker, pull requests, and
events.

- Reuven Lax, on behalf of the Apache Beam community.

[1] https://beam.apache.org/get-started/downloads/
[2]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12341044 





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Apache Beam, version 2.2.0

2017-12-04 Thread Jean-Baptiste Onofré


Thanks Reuven !

I would like to emphasize on some highlights in 2.2.0 release:

- New IOs have been introduced:
 * TikaIO leveraging Apache Tika, allowing the deal with a lot of different 
data formats
 * RedisIO to read and write key/value pairs from a Redis server. This IO will 
be soon extended to Redis PubSub.
 * FileIO provides transforms for working with files (raw). Especially, it 
provides matching file patterns and read on patterns. It can be easily extended 
for a specific format (like we do in AvroIO or TextIO now).

 * SolrIO to interact with Apache Solr (Lucene)

- On the other hand, improvements have been performed on existing IOs:
 * We started to introduce readAll pattern in IOs (AvroIO, TextIO, JdbcIO, 
...), allowing to pass "request" arguments via an input PCollection.
 * ElasticsearchIO has an improved support of different Elasticsearch version 
(including Elasticsearch 5.x). It also now supports SSL/TLS.

 * HBaseIO is now able to do dynamic work rebalancing
 * KinesisIO uses a more accurate watermark (based on 
approximateArrivalTimestamp)
 * TextIO now supports custom delimiter and like AvroIO, supports the readAll 
pattern,

 * Performance improvements on JdbcIO when it has to read lot of rows
 * Kafka write supports Exactly-Once pattern (introduce in Kafka 0.11.x)

- A new DSL has been introduced: the SQL DSL !

We are now focus on 2.3.0 release with new improvements and features !

Stay tuned !

JB on behalf of the Apache Beam community.

On 12/02/2017 11:40 PM, Reuven Lax wrote:

The Apache Beam community is pleased to announce the availability of the
2.2.0 release.

This release adds support for generic file sources and sinks (beyond TextIO and 
AvroIO) using FileIO, including support for dynamic filenames using readAll; 
this allows streaming pipelines to now read from files by continuously 
monitoring a directory for new filw. Many other IOs are improved, notably 
including exactly-once support for the Kafka sink. Initial support for BEAM-SQL 
is also included in this release. For a more-complete list of major changes in 
the release, please refer to the release notes [2].


The 2.2.0 release is now the recommended version; we encourage everyone to
upgrade from any earlier releases.

We’d like to invite everyone to try out Apache Beam today and consider
joining our vibrant community. We welcome feedback, contribution and
participation through our mailing lists, issue tracker, pull requests, and
events.

- Reuven Lax, on behalf of the Apache Beam community.

[1] https://beam.apache.org/get-started/downloads/
[2]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12341044


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re:

2017-12-01 Thread Jean-Baptiste Onofré


Agree, I would prefer to do the callback in the IO more than in the main.

Regards
JB

On 12/01/2017 03:54 PM, Steve Niemitz wrote:
I do something almost exactly like this, but with BigtableIO instead.  I have a 
pull request open here [1] (which reminds me I need to finish this up...).  It 
would really be nice for most IOs to support something like this.


Essentially you do a GroupByKey (or some CombineFn) on the output from the 
BigtableIO, and then feed that into your function which will run when all writes 
finish.


You probably want to avoid doing something in the main method because there's no 
guarantee it'll actually run (maybe the driver will die, get killed, machine 
will explode, etc).


[1] https://github.com/apache/beam/pull/3997

On Fri, Dec 1, 2017 at 9:46 AM, NerdyNick <nerdyn...@gmail.com 
<mailto:nerdyn...@gmail.com>> wrote:


Assuming you're in Java. You could just follow on in your Main method.
Checking the state of the Result.

Example:
PipelineResult result = pipeline.run();
try {
result.waitUntilFinish();
if(result.getState() == PipelineResult.State.DONE) {
//DO ES work
}
} catch(Exception e) {
result.cancel();
throw e;
}

Otherwise you could also use Oozie to construct a work flow.

On Fri, Dec 1, 2017 at 2:02 AM, Jean-Baptiste Onofré <j...@nanthrax.net
<mailto:j...@nanthrax.net>> wrote:

Hi,

yes, we had a similar question some days ago.

We can imagine to have a user callback fn fired when the sink batch is
complete.

Let me think about that.

Regards
JB

On 12/01/2017 09:04 AM, Philip Chan wrote:

Hey JB,

Thanks for getting back so quickly.
I suppose in that case I would need a way of monitoring when the ES
transform completes successfully before I can proceed with doing the
swap.
The problem with this is that I can't think of a good way to
determine that termination state short of polling the new index to
check the document count compared to the size of input PCollection.
That, or maybe I'd need to use an external system like you mentioned
to poll on the state of the pipeline (I'm using Google Dataflow, so
maybe there's a way to do this with some API).
But I would have thought that there would be an easy way of simply
saying "do not process this transform until this other transform
completes".
Is there no established way of "signaling" between pipelines when
some pipeline completes, or have some way of declaring a dependency
of 1 transform on another transform?

Thanks again,
Philip

    On Thu, Nov 30, 2017 at 11:44 PM, Jean-Baptiste Onofré
<j...@nanthrax.net <mailto:j...@nanthrax.net> 
<mailto:j...@nanthrax.net
<mailto:j...@nanthrax.net>>> wrote:

     Hi Philip,

     You won't be able to do (3) in the same pipeline as the
Elasticsearch Sink
     PTransform ends the pipeline with PDone.

     So, (3) has to be done in another pipeline (using a DoFn) or in
another
     "system" (like Camel for instance). I would do a check of the
data in the
     index and then trigger the swap there.

     Regards
     JB

     On 12/01/2017 08:41 AM, Philip Chan wrote:

         Hi,

         I'm pretty new to Beam, and I've been trying to use the
ElasticSearchIO
         sink to write docs into ES.
         With this, I want to be able to
         1. ingest and transform rows from DB (done)
         2. write JSON docs/strings into a new ES index (done)
         3. After (2) is complete and all documents are written into
a new index,
         trigger an atomic index swap under an alias to replace the
current
         aliased index with the new index generated in step 2. This
is basically
         a single POST request to the ES cluster.

         The problem I'm facing is that I don't seem to be able to
find a way to
         have a way for (3) to happen after step (2) is complete.

         The ElasticSearchIO.Write transform returns a PDone, and
I'm not sure
         how to proceed from there because it doesn't seem to let me
do another
         apply on it to "define" a dependency.

https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html

Re:

2017-12-01 Thread Jean-Baptiste Onofré

Hi,

yes, we had a similar question some days ago.

We can imagine to have a user callback fn fired when the sink batch is complete.

Let me think about that.

Regards
JB

On 12/01/2017 09:04 AM, Philip Chan wrote:

Hey JB,

Thanks for getting back so quickly.
I suppose in that case I would need a way of monitoring when the ES transform
completes successfully before I can proceed with doing the swap.
The problem with this is that I can't think of a good way to determine that
termination state short of polling the new index to check the document count
compared to the size of input PCollection.
That, or maybe I'd need to use an external system like you mentioned to poll on
the state of the pipeline (I'm using Google Dataflow, so maybe there's a way to
do this with some API).
But I would have thought that there would be an easy way of simply saying "do
not process this transform until this other transform completes".
Is there no established way of "signaling" between pipelines when some pipeline
completes, or have some way of declaring a dependency of 1 transform on another
transform?

Thanks again,
Philip

On Thu, Nov 30, 2017 at 11:44 PM, Jean-Baptiste Onofré <j...@nanthrax.net
<mailto:j...@nanthrax.net>> wrote:

Hi Philip,

You won't be able to do (3) in the same pipeline as the Elasticsearch Sink
PTransform ends the pipeline with PDone.

So, (3) has to be done in another pipeline (using a DoFn) or in another
"system" (like Camel for instance). I would do a check of the data in the
index and then trigger the swap there.

Regards
JB

On 12/01/2017 08:41 AM, Philip Chan wrote:

Hi,

I'm pretty new to Beam, and I've been trying to use the ElasticSearchIO
sink to write docs into ES.
With this, I want to be able to
1. ingest and transform rows from DB (done)
2. write JSON docs/strings into a new ES index (done)
3. After (2) is complete and all documents are written into a new index,
trigger an atomic index swap under an alias to replace the current
aliased index with the new index generated in step 2. This is basically
a single POST request to the ES cluster.

The problem I'm facing is that I don't seem to be able to find a way to
have a way for (3) to happen after step (2) is complete.

The ElasticSearchIO.Write transform returns a PDone, and I'm not sure
how to proceed from there because it doesn't seem to let me do another
apply on it to "define" a dependency.

https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html

<https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html>

<https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html

<https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html>>

Is there a recommended way to construct pipelines workflows like this?

Thanks in advance,
Philip

--
Jean-Baptiste Onofré

jbono...@apache.org <mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Regarding Beam Slack Channel

2017-11-30 Thread Jean-Baptiste Onofré


Invite sent as well.

Regards
JB

On 11/30/2017 07:19 PM, Yanael Barbier wrote:

Hello
Can I get an invite too?

Thanks,
Yanael

Le jeu. 30 nov. 2017 à 19:15, Wesley Tanaka <wtanaka+b...@wtanaka.com 
<mailto:wtanaka%2bb...@wtanaka.com>> a écrit :


Invite sent


On 11/30/2017 08:11 AM, Nalseez Duke wrote:

Hello

Can someone please add me to the Beam slack channel?

Thanks.



-- 
Wesley Tanaka

https://wtanaka.com/



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Drop Spark 1.x support to focus on Spark 2.x

2017-11-26 Thread Jean-Baptiste Onofré


Hi all,

Quick update about the Spark 2.x runner: I updated the PR with Spark 2.x update 
only:


https://github.com/apache/beam/pull/3808

I will rebase and do new tests as soon as gitbox will be back.

Don't hesitate to take a look and review.

Thanks !
Regards
JB

On 11/21/2017 08:32 AM, Jean-Baptiste Onofré wrote:

Hi Tim,

I will update the PR today for a new review round. Yes, you are correct: the 
target is 2.3.0 for end of this year (with announcement in the Release Notes).


Regards
JB

On 11/20/2017 10:09 PM, Tim wrote:

Thanks JB

 From which release will Spark 1.x be dropped please? Is this slated for 2.3.0 
at the end of the year?


Thanks,
Tim,
Sent from my iPhone


On 20 Nov 2017, at 21:21, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:

Hi,
,
it seems we have a consensus to upgrade to Spark 2.x, dropping Spark 1.x. I 
will upgrade the PR accordingly.


Thanks all for your input and feedback.

Regards
JB


On 11/13/2017 09:32 AM, Jean-Baptiste Onofré wrote:
Hi Beamers,
I'm forwarding this discussion & vote from the dev mailing list to the user 
mailing list.

The goal is to have your feedback as user.
Basically, we have two options:
1. Right now, in the PR, we support both Spark 1.x and 2.x using three 
artifacts (common, spark1, spark2). You, as users, pick up spark1 or spark2 
in your dependencies set depending the Spark target version you want.
2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0. If 
you still want to use Spark 1.x, then, you will be stuck up to Beam 2.2.0.

Thoughts ?
Thanks !
Regards
JB
 Forwarded Message 
Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Date: Wed, 8 Nov 2017 08:27:58 +0100
From: Jean-Baptiste Onofré <j...@nanthrax.net>
Reply-To: d...@beam.apache.org
To: d...@beam.apache.org
Hi all,
as you might know, we are working on Spark 2.x support in the Spark runner.
I'm working on a PR about that:
https://github.com/apache/beam/pull/3808
Today, we have something working with both Spark 1.x and 2.x from a code 
standpoint, but I have to deal with dependencies. It's the first step of the 
update as I'm still using RDD, the second step would be to support dataframe 
(but for that, I would need PCollection elements with schemas, that's 
another topic on which Eugene, Reuven and I are discussing).
However, as all major distributions now ship Spark 2.x, I don't think it's 
required anymore to support Spark 1.x.
If we agree, I will update and cleanup the PR to only support and focus on 
Spark 2.x.

So, that's why I'm calling for a vote:
   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
   [ ] 0 (I don't care ;))
   [ ] -1, I would like to still support Spark 1.x, and so having support of 
both Spark 1.x and 2.x (please provide specific comment)
This vote is open for 48 hours (I have the commits ready, just waiting the 
end of the vote to push on the PR).

Thanks !
Regards
JB


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com




--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Drop Spark 1.x support to focus on Spark 2.x

2017-11-20 Thread Jean-Baptiste Onofré


Hi Tim,

I will update the PR today for a new review round. Yes, you are correct: the 
target is 2.3.0 for end of this year (with announcement in the Release Notes).


Regards
JB

On 11/20/2017 10:09 PM, Tim wrote:

Thanks JB

 From which release will Spark 1.x be dropped please? Is this slated for 2.3.0 
at the end of the year?

Thanks,
Tim,
Sent from my iPhone


On 20 Nov 2017, at 21:21, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:

Hi,
,
it seems we have a consensus to upgrade to Spark 2.x, dropping Spark 1.x. I 
will upgrade the PR accordingly.

Thanks all for your input and feedback.

Regards
JB


On 11/13/2017 09:32 AM, Jean-Baptiste Onofré wrote:
Hi Beamers,
I'm forwarding this discussion & vote from the dev mailing list to the user 
mailing list.
The goal is to have your feedback as user.
Basically, we have two options:
1. Right now, in the PR, we support both Spark 1.x and 2.x using three 
artifacts (common, spark1, spark2). You, as users, pick up spark1 or spark2 in 
your dependencies set depending the Spark target version you want.
2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0. If you 
still want to use Spark 1.x, then, you will be stuck up to Beam 2.2.0.
Thoughts ?
Thanks !
Regards
JB
 Forwarded Message 
Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Date: Wed, 8 Nov 2017 08:27:58 +0100
From: Jean-Baptiste Onofré <j...@nanthrax.net>
Reply-To: d...@beam.apache.org
To: d...@beam.apache.org
Hi all,
as you might know, we are working on Spark 2.x support in the Spark runner.
I'm working on a PR about that:
https://github.com/apache/beam/pull/3808
Today, we have something working with both Spark 1.x and 2.x from a code 
standpoint, but I have to deal with dependencies. It's the first step of the 
update as I'm still using RDD, the second step would be to support dataframe 
(but for that, I would need PCollection elements with schemas, that's another 
topic on which Eugene, Reuven and I are discussing).
However, as all major distributions now ship Spark 2.x, I don't think it's 
required anymore to support Spark 1.x.
If we agree, I will update and cleanup the PR to only support and focus on 
Spark 2.x.
So, that's why I'm calling for a vote:
   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
   [ ] 0 (I don't care ;))
   [ ] -1, I would like to still support Spark 1.x, and so having support of 
both Spark 1.x and 2.x (please provide specific comment)
This vote is open for 48 hours (I have the commits ready, just waiting the end 
of the vote to push on the PR).
Thanks !
Regards
JB


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Drop Spark 1.x support to focus on Spark 2.x

2017-11-20 Thread Jean-Baptiste Onofré


Hi,

it seems we have a consensus to upgrade to Spark 2.x, dropping Spark 1.x. I will 
upgrade the PR accordingly.


Thanks all for your input and feedback.

Regards
JB

On 11/13/2017 09:32 AM, Jean-Baptiste Onofré wrote:

Hi Beamers,

I'm forwarding this discussion & vote from the dev mailing list to the user 
mailing list.

The goal is to have your feedback as user.

Basically, we have two options:
1. Right now, in the PR, we support both Spark 1.x and 2.x using three artifacts 
(common, spark1, spark2). You, as users, pick up spark1 or spark2 in your 
dependencies set depending the Spark target version you want.
2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0. If you 
still want to use Spark 1.x, then, you will be stuck up to Beam 2.2.0.


Thoughts ?

Thanks !
Regards
JB


 Forwarded Message 
Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Date: Wed, 8 Nov 2017 08:27:58 +0100
From: Jean-Baptiste Onofré <j...@nanthrax.net>
Reply-To: d...@beam.apache.org
To: d...@beam.apache.org

Hi all,

as you might know, we are working on Spark 2.x support in the Spark runner.

I'm working on a PR about that:

https://github.com/apache/beam/pull/3808

Today, we have something working with both Spark 1.x and 2.x from a code 
standpoint, but I have to deal with dependencies. It's the first step of the 
update as I'm still using RDD, the second step would be to support dataframe 
(but for that, I would need PCollection elements with schemas, that's another 
topic on which Eugene, Reuven and I are discussing).


However, as all major distributions now ship Spark 2.x, I don't think it's 
required anymore to support Spark 1.x.


If we agree, I will update and cleanup the PR to only support and focus on Spark 
2.x.


So, that's why I'm calling for a vote:

   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
   [ ] 0 (I don't care ;))
   [ ] -1, I would like to still support Spark 1.x, and so having support of 
both Spark 1.x and 2.x (please provide specific comment)


This vote is open for 48 hours (I have the commits ready, just waiting the end 
of the vote to push on the PR).


Thanks !
Regards
JB


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: GroupByKey not happening on SparkRunner for multiple triggers.

2017-11-17 Thread Jean-Baptiste Onofré


Hi,

I think it's related to the identified issue:

https://issues.apache.org/jira/browse/BEAM-3193

I'm working on a fix.

To avoid to mix different change in the runner, I'm holding the fix a bit due to 
the Spark 2 update.


Regards
JB

On 11/17/2017 11:50 AM, Sushil Ks wrote:

Hi,
              I have configured multiple triggers on an *UnboundedSource* for a 
Beam Pipeline with *GroupByKey* and works as expected on DirectRunner. However, 
when I deploy on the SparkCluster its seems to not apply the 
*GroupByKey* transformation at all. Is this expected? if so any help to get 
unblocked would be appreciated.


Regards,
Sushil Ks


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

[VOTE] Choose the "new" Spark runner

2017-11-16 Thread Jean-Baptiste Onofré


Hi guys,

To illustrate the current discussion about Spark versions support, you can take 
a look on:


--
Spark 1 & Spark 2 Support Branch

https://github.com/jbonofre/beam/tree/BEAM-1920-SPARK2-MODULES

This branch contains a Spark runner common module compatible with both Spark 1.x 
and 2.x. For convenience, we introduced spark1 & spark2 modules/artifacts 
containing just a pom.xml to define the dependencies set.


--
Spark 2 Only Branch

https://github.com/jbonofre/beam/tree/BEAM-1920-SPARK2-ONLY

This branch is an upgrade to Spark 2.x and "drop" support of Spark 1.x.

As I'm ready to merge one of the other in the PR, I would like to complete the 
vote/discussion pretty soon.


Correct me if I'm wrong, but it seems that the preference is to drop Spark 1.x 
to focus only on Spark 2.x (for the Spark 2 Only Branch).


I would like to call a final vote to act the merge I will do:

[ ] Use Spark 1 & Spark 2 Support Branch
[ ] Use Spark 2 Only Branch

This informal vote is open for 48 hours.

Please, let me know what your preference is.

Thanks !
Regards
JB

On 11/13/2017 09:32 AM, Jean-Baptiste Onofré wrote:

Hi Beamers,

I'm forwarding this discussion & vote from the dev mailing list to the user 
mailing list.

The goal is to have your feedback as user.

Basically, we have two options:
1. Right now, in the PR, we support both Spark 1.x and 2.x using three artifacts 
(common, spark1, spark2). You, as users, pick up spark1 or spark2 in your 
dependencies set depending the Spark target version you want.
2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0. If you 
still want to use Spark 1.x, then, you will be stuck up to Beam 2.2.0.


Thoughts ?

Thanks !
Regards
JB


 Forwarded Message 
Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Date: Wed, 8 Nov 2017 08:27:58 +0100
From: Jean-Baptiste Onofré <j...@nanthrax.net>
Reply-To: d...@beam.apache.org
To: d...@beam.apache.org

Hi all,

as you might know, we are working on Spark 2.x support in the Spark runner.

I'm working on a PR about that:

https://github.com/apache/beam/pull/3808

Today, we have something working with both Spark 1.x and 2.x from a code 
standpoint, but I have to deal with dependencies. It's the first step of the 
update as I'm still using RDD, the second step would be to support dataframe 
(but for that, I would need PCollection elements with schemas, that's another 
topic on which Eugene, Reuven and I are discussing).


However, as all major distributions now ship Spark 2.x, I don't think it's 
required anymore to support Spark 1.x.


If we agree, I will update and cleanup the PR to only support and focus on Spark 
2.x.


So, that's why I'm calling for a vote:

   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
   [ ] 0 (I don't care ;))
   [ ] -1, I would like to still support Spark 1.x, and so having support of 
both Spark 1.x and 2.x (please provide specific comment)


This vote is open for 48 hours (I have the commits ready, just waiting the end 
of the vote to push on the PR).


Thanks !
Regards
JB


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Does ElasticsearchIO in the latest RC support adding document IDs?

2017-11-16 Thread Jean-Baptiste Onofré

icsearch/ElasticsearchIO.java#L838>

Essentially the data part of the document is being placed but it doesn’t
allow for other properties, such as the document ID, to be set.

This leads to two problems:

1. Beam doesn’t necessarily guarantee exactly-once execution for a given
item in a PCollection, as I understand it. This means that you may get
more than one record in Elastic for a given item in a PCollection that
you pass in.

2. You can’t do partial updates to an index. If you run a batch job
once, and then run the batch job again on the same index without
clearing it, you just double everything in there.

Is there any good way around this?

I’d be happy to try writing up a PR for this in theory, but not sure how
to best approach it. Also would like to figure out a way to get around
this in the meantime, if anyone has any ideas.

Best,

Chet

P.S. CCed echauc...@gmail.com <mailto:echauc...@gmail.com> because it
seems like he’s been doing work related to the elastic sink.











--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Does ElasticsearchIO in the latest RC support adding document IDs?

2017-11-15 Thread Jean-Baptiste Onofré

I think it's also related to the discussion Romain raised on the dev mailing
list (gap between batch size, checkpointing & bundles).

Regards
JB

On 11/15/2017 09:53 AM, Etienne Chauchot wrote:

Hi Chet,

What you say is totally true, docs written using ElasticSearchIO will always
have an ES generated id. But it might change in the future, indeed it might be a
good thing to allow the user to pass an id. Just in 5 seconds thinking, I see 3
possible designs for that.

a.(simplest) use a json special field for the id, if it is provided by the user
in the input json then it is used, auto-generated id otherwise.

b. (a bit less user friendly) PCollection with K as an id. But forces the
user to do a Pardo before writing to ES to output KV pairs of <id, json>

c. (a lot more complex) Allow the IO to serialize/deserialize java beans and
have an String id field. Matching java types to ES types is quite tricky, so,
for now we just relied on the user to serialize his beans into json and let ES
match the types automatically.

Related to the problems you raise bellow:

1. Well, the bundle is the commit entity of beam. Consider the case of
ESIO.batchSize being < to bundle size. While processing records, when the number
of elements reaches batchSize, an ES bulk insert will be issued but no
finishBundle. If there is a problem later on in the bundle processing before the
finishBundle, the checkpoint will still be at the beginning of the bundle, so
all the bundle will be retried leading to duplicate documents. Thanks for
raising that! I'm CCing the dev list so that someone could correct me on the
checkpointing mecanism if I'm missing something. Besides I'm thinking about
forcing the user to provide an id in all cases to workaround this issue.

2. Correct.

Best,
Etienne

Le 15/11/2017 à 02:16, Chet Aldrich a écrit :

Hello all!

So I’ve been using the ElasticSearchIO sink for a project (unfortunately it’s
Elasticsearch 5.x, and so I’ve been messing around with the latest RC) and I’m
finding that it doesn’t allow for changing the document ID, but only lets you
pass in a record, which means that the document ID is auto-generated. See this
line for what specifically is happening:

https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L838

Essentially the data part of the document is being placed but it doesn’t allow
for other properties, such as the document ID, to be set.

This leads to two problems:

1. Beam doesn’t necessarily guarantee exactly-once execution for a given item
in a PCollection, as I understand it. This means that you may get more than
one record in Elastic for a given item in a PCollection that you pass in.

2. You can’t do partial updates to an index. If you run a batch job once, and
then run the batch job again on the same index without clearing it, you just
double everything in there.

Is there any good way around this?

I’d be happy to try writing up a PR for this in theory, but not sure how to
best approach it. Also would like to figure out a way to get around this in
the meantime, if anyone has any ideas.

Best,

Chet

P.S. CCed echauc...@gmail.com <mailto:echauc...@gmail.com> because it seems
like he’s been doing work related to the elastic sink.

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Drop Spark 1.x support to focus on Spark 2.x

2017-11-15 Thread Jean-Baptiste Onofré


Any additional feedback about that ?

I will update the thread with the two branches later today: the one with Spark 
1.x & 2.x support, the one with Spark 2.x upgrade.


Thanks
Regards
JB

On 11/13/2017 09:32 AM, Jean-Baptiste Onofré wrote:

Hi Beamers,

I'm forwarding this discussion & vote from the dev mailing list to the user 
mailing list.

The goal is to have your feedback as user.

Basically, we have two options:
1. Right now, in the PR, we support both Spark 1.x and 2.x using three artifacts 
(common, spark1, spark2). You, as users, pick up spark1 or spark2 in your 
dependencies set depending the Spark target version you want.
2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0. If you 
still want to use Spark 1.x, then, you will be stuck up to Beam 2.2.0.


Thoughts ?

Thanks !
Regards
JB


 Forwarded Message 
Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Date: Wed, 8 Nov 2017 08:27:58 +0100
From: Jean-Baptiste Onofré <j...@nanthrax.net>
Reply-To: d...@beam.apache.org
To: d...@beam.apache.org

Hi all,

as you might know, we are working on Spark 2.x support in the Spark runner.

I'm working on a PR about that:

https://github.com/apache/beam/pull/3808

Today, we have something working with both Spark 1.x and 2.x from a code 
standpoint, but I have to deal with dependencies. It's the first step of the 
update as I'm still using RDD, the second step would be to support dataframe 
(but for that, I would need PCollection elements with schemas, that's another 
topic on which Eugene, Reuven and I are discussing).


However, as all major distributions now ship Spark 2.x, I don't think it's 
required anymore to support Spark 1.x.


If we agree, I will update and cleanup the PR to only support and focus on Spark 
2.x.


So, that's why I'm calling for a vote:

   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
   [ ] 0 (I don't care ;))
   [ ] -1, I would like to still support Spark 1.x, and so having support of 
both Spark 1.x and 2.x (please provide specific comment)


This vote is open for 48 hours (I have the commits ready, just waiting the end 
of the vote to push on the PR).


Thanks !
Regards
JB


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Drop Spark 1.x support to focus on Spark 2.x

2017-11-13 Thread Jean-Baptiste Onofré


Hi Tim,

Basically, if an user still wants to use Spark 1.x, he would just be "stuck" 
with Beam 2.2.0.


I would like to see a Beam 2.3.0 end of December/beginning of January with Spark 
2.x support (exclusive or with 1.x).


The goal of the discussion is just to know if it's worth to maintain Spark 1.x 
and 2.x or if I can do cleanup to support only 2.x ;)


Regards
JB

On 11/13/2017 10:56 AM, Tim Robertson wrote:

Thanks JB

On "thoughts":

- Cloudera 5.13 will still default to 1.6 even though a 2.2 parcel is available 
(HWX provides both)
- Cloudera support for spark 2 has a list of exceptions 
(https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html)

   - I am not sure if the HBaseIO would be affected
   - I am not sure if structured streaming would have implications
   - it might stop customers from being able to run spark 2 at all due to 
support agreements

- Spark 2.3 (EOY) will drop Scala 2.10 support
- IBM's now defunct distro only has 1.6
- Oozie doesn't have a spark 2 action (need to use a shell action)
- There are a lot of folks with code running on 1.3,1.4 and 1.5 still
- Spark 2.2+ requires Java 8 too, while <2.2 was J7 like Beam (not sure if this 
has other implications for the cross deployment nature of Beam)


My first impressions of Beam was really favourable as it all just worked first 
time on a CDH Spark 1.6 cluster.  For us it is lacking resources to refactor 
legacy code which delays the 2.2 push.


With that said I think is it very reasonable to have a clear cut off in Beam, 
especially if it limits progress / causes headaches in packaging, robustness 
etc.  I'd recommend putting it in a 6 month timeframe which might align with 2.3?


Hope this helps,
Tim











On Mon, Nov 13, 2017 at 10:07 AM, Neville Dipale <nevilled...@gmail.com 
<mailto:nevilled...@gmail.com>> wrote:


Hi JB,


   [X ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
   [ ] 0 (I don't care ;))
   [ ] -1, I would like to still support Spark 1.x, and so having support of
both Spark 1.x and 2.x (please provide specific comment)

    On 13 November 2017 at 10:32, Jean-Baptiste Onofré <j...@nanthrax.net
<mailto:j...@nanthrax.net>> wrote:

Hi Beamers,

I'm forwarding this discussion & vote from the dev mailing list to the
user mailing list.
The goal is to have your feedback as user.

Basically, we have two options:
1. Right now, in the PR, we support both Spark 1.x and 2.x using three
artifacts (common, spark1, spark2). You, as users, pick up spark1 or
spark2 in your dependencies set depending the Spark target version you 
want.
2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0.
If you still want to use Spark 1.x, then, you will be stuck up to Beam
2.2.0.

Thoughts ?

Thanks !
Regards
JB


 Forwarded Message 
Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Date: Wed, 8 Nov 2017 08:27:58 +0100
From: Jean-Baptiste Onofré <j...@nanthrax.net 
<mailto:j...@nanthrax.net>>
Reply-To: d...@beam.apache.org <mailto:d...@beam.apache.org>
To: d...@beam.apache.org <mailto:d...@beam.apache.org>

Hi all,

as you might know, we are working on Spark 2.x support in the Spark 
runner.

I'm working on a PR about that:

https://github.com/apache/beam/pull/3808
<https://github.com/apache/beam/pull/3808>

Today, we have something working with both Spark 1.x and 2.x from a code
standpoint, but I have to deal with dependencies. It's the first step of
the update as I'm still using RDD, the second step would be to support
dataframe (but for that, I would need PCollection elements with schemas,
that's another topic on which Eugene, Reuven and I are discussing).

However, as all major distributions now ship Spark 2.x, I don't think
it's required anymore to support Spark 1.x.

If we agree, I will update and cleanup the PR to only support and focus
on Spark 2.x.

So, that's why I'm calling for a vote:

   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
   [ ] 0 (I don't care ;))
   [ ] -1, I would like to still support Spark 1.x, and so having
support of both Spark 1.x and 2.x (please provide specific comment)

This vote is open for 48 hours (I have the commits ready, just waiting
the end of the vote to push on the PR).

    Thanks !
Regards
JB
-- 
Jean-Baptiste Onofré

jbono...@apache.org <mailto:jbono...@apache.org>
    http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Apache Beam and Spark runner

2017-11-13 Thread Jean-Baptiste Onofré

See my answers on the dev mailing list.

NB: no need to "flood" both mailing lists ;)

Regards
JB

On 11/13/2017 10:56 AM, Nishu wrote:

Hi ,

I am writing a streaming pipeline in Apache beam using spark runner.
Use case : To join the multiple kafka streams using windowed collections. I use
GroupByKey to group the events based on common business key and that output is
used as input for Join operation. Pipeline run on direct runner as expected but
on Spark cluster(v2.1), it throws the Accumulator error.
*"Exception in thread "main" java.lang.AssertionError: assertion failed:
copyAndReset must return a zero value copy"*

*
*
I tried the same pipeline on Spark cluster(v1.6), there it runs without any
error but doesn't perform the join operations on the streams .

I got couple of questions.

1. Does spark runner support spark version 2.x?

2. Regarding the triggers, currently only ProcessingTimeTrigger is supported in
Capability Matrix
<https://beam.apache.org/documentation/runners/capability-matrix/#cap-summary-what> ,
can we expect to have support for more trigger in near future sometime soon ?
Also, GroupByKey and Accumulating panes features, are those supported for spark
for streaming pipeline?

3. According to the documentation, Storage level
<https://beam.apache.org/documentation/runners/spark/#pipeline-options-for-the-spark-runner> is
set to IN_MEMORY for streaming pipelines. Can we configure it to disk as well?

4. Is there checkpointing feature supported for Spark runner? In case if Beam
pipeline fails unexpectedly, can we read the state from the last run.

It will be great if someone could help to know above.

--
Thanks & Regards,
Nishu Tayal

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

[DISCUSS] Drop Spark 1.x support to focus on Spark 2.x

2017-11-13 Thread Jean-Baptiste Onofré


Hi Beamers,

I'm forwarding this discussion & vote from the dev mailing list to the user 
mailing list.

The goal is to have your feedback as user.

Basically, we have two options:
1. Right now, in the PR, we support both Spark 1.x and 2.x using three artifacts 
(common, spark1, spark2). You, as users, pick up spark1 or spark2 in your 
dependencies set depending the Spark target version you want.
2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0. If you 
still want to use Spark 1.x, then, you will be stuck up to Beam 2.2.0.


Thoughts ?

Thanks !
Regards
JB


 Forwarded Message 
Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Date: Wed, 8 Nov 2017 08:27:58 +0100
From: Jean-Baptiste Onofré <j...@nanthrax.net>
Reply-To: d...@beam.apache.org
To: d...@beam.apache.org

Hi all,

as you might know, we are working on Spark 2.x support in the Spark runner.

I'm working on a PR about that:

https://github.com/apache/beam/pull/3808

Today, we have something working with both Spark 1.x and 2.x from a code 
standpoint, but I have to deal with dependencies. It's the first step of the 
update as I'm still using RDD, the second step would be to support dataframe 
(but for that, I would need PCollection elements with schemas, that's another 
topic on which Eugene, Reuven and I are discussing).


However, as all major distributions now ship Spark 2.x, I don't think it's 
required anymore to support Spark 1.x.


If we agree, I will update and cleanup the PR to only support and focus on Spark 
2.x.


So, that's why I'm calling for a vote:

  [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
  [ ] 0 (I don't care ;))
  [ ] -1, I would like to still support Spark 1.x, and so having support of 
both Spark 1.x and 2.x (please provide specific comment)


This vote is open for 48 hours (I have the commits ready, just waiting the end 
of the vote to push on the PR).


Thanks !
Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Beam support on Spark 2.x

2017-11-10 Thread Jean-Baptiste Onofré


Hi,

I guess you are not following the dev mailing list.

Spark runner supports almost all transforms and yes, you can fully use Spark 
runner to run your pipelines.


PCollection is represented with RDD and it's currently Spark 1.x.

I'm working on the Spark 2.x support (still using RDD): we have a vote in 
progress on the mailing list if we want to support both Spark 1.x & Spark 2.x or 
just upgrade to Spark 2.x and drop support for Spark 1.x.


You can take a look on the beam-samples: they all run using the Spark runner.

Regards
JB

On 11/10/2017 01:46 PM, Artur Mrozowski wrote:

Hi,
I have seen the compatibility matrix and I realize that Spark is not the most 
supported runner.
I am curious if it is possible to run a pipeline on Spark, say with global 
windows, after processing triggers and group by key(CoGroupByKye, CombineByKey) 
. We have definitely problems to execute a pipeline that successfully runs on 
direct runner.


Is that a known issue? Is Flink the best option?

Best Regards
Artur


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: London Apache Beam meetup 2: call for speakers

2017-11-09 Thread Jean-Baptiste Onofré

We don't, but it has to be voted by PMC and approved by legal.

Regards
JB

On 11/09/2017 10:24 PM, Griselda Cuevas wrote:
Question for the PMCs and Committers - do we have a YT channel already or do we 
need to create one? If we need to create one, do we need to ask someone in 
press@ or marketing@?

Gris Cuevas Zambrano

g...@google.com <mailto:g...@google.com>

Open Source Strategy

345 Spear Street, San Francisco, 94105

On 9 November 2017 at 14:07, AndrasNagy <start.and...@gmail.com 
<mailto:start.and...@gmail.com>> wrote:

Thx.
Great pls post the links here too ;)

On 9 Nov 2017 10:05 pm, "Griselda Cuevas" <g...@google.com
<mailto:g...@google.com>> wrote:

Great suggestion (about the Youtube channel). I have the recording of
the talks we gave here in SF this past 11/1 but I need to get it from
our tech team and upload it somewhere, will look into a youtube channel 
:)

Gris Cuevas Zambrano

g...@google.com <mailto:g...@google.com>

Open Source Strategy

345 Spear Street, San Francisco, 94105

<https://maps.google.com/?q=345+Spear+Street,+San+Francisco,+94105=gmail=g>

On 9 November 2017 at 14:01, AndrasNagy <start.and...@gmail.com
<mailto:start.and...@gmail.com>> wrote:

Hey.
Please make sure all talks are going to be recorded and later
uploaded to youtube or ustream.

We are all really interested but many people (include myself) do not
have the opportunity to attend.

Thx

On 9 Nov 2017 21:55, "Griselda Cuevas" <g...@google.com
<mailto:g...@google.com>> wrote:

Maybe send a call for speakers and see if we have anyone from
the community visiting London who could be a speaker?

It'd be great to have one London Meetup on 12/5 because we're
hosting one in SF and one in NY so it'd be awesome to have three
on the same day!

Gris Cuevas Zambrano

g...@google.com <mailto:g...@google.com>

Open Source Strategy

345 Spear Street, San Francisco, 94105

<https://maps.google.com/?q=345+Spear+Street,+San+Francisco,+94105=gmail=g>

On 8 November 2017 at 15:09, Matthias Baetens
<baetensmatth...@gmail.com <mailto:baetensmatth...@gmail.com>>
wrote:

No worries JB, I'll send you a message on how we can plan
around this (reschedule the meetup or postpone your 
session).
Thanks for the heads-up, have fun in Singapore!

    Best,
Matthias

Op di 7 nov. 2017 om 04:52 schreef Jean-Baptiste Onofré
<j...@nanthrax.net <mailto:j...@nanthrax.net>>:

Hi,

unfortunately, I have to decline the invite as I will be
at Strata Singapore in
the same time :(

I'm very sorry about that. You can count on me for the
3rd edition !

Regards
JB

On 11/07/2017 01:41 AM, Matthias Baetens wrote:
 > Hi all!
 >
 > Hope you are well.
 > We are back for a second edition of the London Apache
Beam meetup, aiming for
 > the 5th of December.
 >
         > We are pretty excited to announce that our first
speaker will be
 > Jean-Baptiste Onofré <https://github.com/jbonofre>
himself!
 >
 > If you have an interesting *use-case* to share and
are in London on the *5th of
 > December*, don't hesitate to reach out to me :)
 > Else: keep track of the meetup page
 > <https://www.meetup.com/London-Apache-Beam-Meetup/
<https://www.meetup.com/London-Apache-Beam-Meetup/>> to
be updated on our
 > activity in the space.
     >
 > Best,
 > Matthias
 > --

--
Jean-Baptiste Onofré
    jbono...@apache.org <mailto:jbono...@apache.org>

Re: London Apache Beam meetup 2: call for speakers

2017-11-06 Thread Jean-Baptiste Onofré


Hi,

unfortunately, I have to decline the invite as I will be at Strata Singapore in 
the same time :(


I'm very sorry about that. You can count on me for the 3rd edition !

Regards
JB

On 11/07/2017 01:41 AM, Matthias Baetens wrote:

Hi all!

Hope you are well.
We are back for a second edition of the London Apache Beam meetup, aiming for 
the 5th of December.


We are pretty excited to announce that our first speaker will be 
Jean-Baptiste Onofré <https://github.com/jbonofre> himself!


If you have an interesting *use-case* to share and are in London on the *5th of 
December*, don't hesitate to reach out to me :)
Else: keep track of the meetup page 
<https://www.meetup.com/London-Apache-Beam-Meetup/> to be updated on our 
activity in the space.


Best,
Matthias
--


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Beam Slack Channel Invitation Request

2017-11-06 Thread Jean-Baptiste Onofré


Done, you should have received an invite.

Welcome !

Regards
JB

On 11/06/2017 12:57 PM, Damien Hawes wrote:

I too would like an invite to the slack channel.

On 5 Nov 2017 05:11, "Mingmin Xu" <mingm...@gmail.com 
<mailto:mingm...@gmail.com>> wrote:


sent, welcome!

On Sat, Nov 4, 2017 at 4:42 PM, Tristan Shephard <tristanasheph...@gmail.com
<mailto:tristanasheph...@gmail.com>> wrote:

Hello,

Can someone please add me to the Beam slack channel?

Thanks in advance,
Tristan




    -- 
----

Mingmin



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Regarding Beam Slack Channel

2017-10-26 Thread Jean-Baptiste Onofré


Done,

you should have received an invite.

Welcome aboard !

Regards
JB

On 10/26/2017 02:20 PM, Lovis Dahl wrote:

I'd love an invite too!

/L

On Sun, 15 Oct 2017 at 08:09 Jean-Baptiste Onofré <j...@nanthrax.net 
<mailto:j...@nanthrax.net>> wrote:


Invitation sent,

Welcome !
Regards
JB

On 10/15/2017 07:44 AM, Sushil Ks wrote:
 > +1
 > Kindly add me to beam slack channel.
 >
 >
 > On Oct 14, 2017 5:02 AM, "Lukasz Cwik" <lc...@google.com
<mailto:lc...@google.com>
 > <mailto:lc...@google.com <mailto:lc...@google.com>>> wrote:
 >
 >     Invite sent, welcome.
 >
 >     On Fri, Oct 13, 2017 at 3:07 PM, NerdyNick <nerdyn...@gmail.com
<mailto:nerdyn...@gmail.com>
 >     <mailto:nerdyn...@gmail.com <mailto:nerdyn...@gmail.com>>> wrote:
 >
 >         Hello
 >
 >         Can someone please add me to the Beam slack channel?
 >
 >         Thanks.
 >
 >

--
Jean-Baptiste Onofré
jbono...@apache.org <mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com

--
Lovis Dahl
Backend engineer, Greta.io


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: ElasticSearch with RestHighLevelClient

2017-10-23 Thread Jean-Baptiste Onofré


Hi Ryan,

the last version of ElasticsearchIO (that will be included in Beam 2.2.0) 
supports Elasticsearch 5.x.


The client should be created in the @Setup (or @StartBundle) and release cleanly 
in @Teardown (or @FinishBundle). Then, it's used in @ProcessElement to actually 
store the elements in the PCollection.


Regards
JB

On 10/23/2017 08:53 PM, Ryan Bobko wrote:

Hi JB,
Thanks for your input. I'm trying to update ElasticsearchIO, and
hopefully learn a bit about Beam in the process. The documentation
says ElasticsearchIO only works with ES 2.X, and I'm using ES 5.6. I'd
prefer not to have two ES libs in my classpath if I can avoid it. I'm
just getting started, so my pipeline is quite simple:

pipeline.apply( "Raw Reader", reader ) // read raw files
 .apply( "Document Generator", ParDo.of( extractor ) ) //
create my document objects for ES insertion
 .apply( "Elastic Writer", new ElasticWriter( ... ); // upload to ES


public final class ElasticWriter extends
PTransform<PCollection, PDone> {

   private static final Logger log = LoggerFactory.getLogger(
ElasticWriter.class );
   private final String elasticurl;

   public ElasticWriter( String url ) {
 elasticurl = url;
   }

   @Override
   public PDone expand( PCollection input ) {
 input.apply( ParDo.of( new WriteFn( elasticurl ) ) );
 return PDone.in( input.getPipeline() );
   }

   public static class WriteFn extends DoFn<Document, Void> implements
Serializable {

 private transient RestHighLevelClient client;
 private final String elasticurl;

 public WriteFn( String elasticurl ) {
   this.elasticurl = elasticurl;
 }

 @Setup
 public void setup() {
   log.debug( " into WriteFn::setup" );
   HttpHost elastic = HttpHost.create( elasticurl );
   RestClientBuilder bldr = RestClient.builder( elastic );

   // if this is uncommented, the program never exits
   //client = new RestHighLevelClient( bldr.build() );
 }

 @Teardown
 public void teardown() {
   log.debug( " into WriteFn::teardown" );
   // there's nothing to tear down
 }

 @ProcessElement
 public void pe( ProcessContext c ) {
   Document doc = DocumentImpl.from( c.element() );
   log.debug( "writing {} to elastic", doc.getMetadata().first(
Metadata.NAME ) );

   // this is where I want to write to ES, but for now, just write
a text file

   ObjectMapper mpr = new ObjectMapper();

   try ( Writer fos = new BufferedWriter( new FileWriter( new File(
"/tmp/writers",
   doc.getMetadata().first( Metadata.NAME ).asString() ) ) ) ) {
 mpr.writeValue( fos, doc );
   }
   catch ( IOException ioe ) {
 log.error( ioe.getLocalizedMessage(), ioe );
   }
 }
   }
}


On Mon, Oct 23, 2017 at 2:28 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:

Hi Ryan,

Why don't you use the ElasticsearchIO for that ?

Anyway, can you share your pipeline where you have the ParDo calling your
DoFn ?

Thanks,
Regards
JB


On 10/23/2017 07:50 PM, r...@ostrich-emulators.com wrote:


Hi List,
I'm trying to write an updated ElasticSearch client using the
newly-published RestHighLevelClient class (with ES 5.6.0). I'm only
interested in writes at this time, so I'm using the ElasticsearchIO.write()
function as a model. I have a transient member named client. Here's my setup
function for my DoFn:

@Setup
public void setup() {
HttpHost elastic = HttpHost.create( elasticurl );
RestClientBuilder bldr = RestClient.builder( elastic );
client = new RestHighLevelClient( bldr.build() );
}

If I run the code as shown, I eventually get the debug message: "Pipeline
has terminated. Shutting down." but the program never exits. If I comment
out the client assignment above, the pipeline behaves normally (but
obviously, I can't write anything to ES).

Any advice for a dev just getting started with Apache Beam (2.0.0)?

ry



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: ElasticSearch with RestHighLevelClient

2017-10-23 Thread Jean-Baptiste Onofré


Hi Ryan,

Why don't you use the ElasticsearchIO for that ?

Anyway, can you share your pipeline where you have the ParDo calling your DoFn ?

Thanks,
Regards
JB

On 10/23/2017 07:50 PM, r...@ostrich-emulators.com wrote:

Hi List,
I'm trying to write an updated ElasticSearch client using the newly-published 
RestHighLevelClient class (with ES 5.6.0). I'm only interested in writes at 
this time, so I'm using the ElasticsearchIO.write() function as a model. I have 
a transient member named client. Here's my setup function for my DoFn:

@Setup
public void setup() {
   HttpHost elastic = HttpHost.create( elasticurl );
   RestClientBuilder bldr = RestClient.builder( elastic );
   client = new RestHighLevelClient( bldr.build() );
}

If I run the code as shown, I eventually get the debug message: "Pipeline has 
terminated. Shutting down." but the program never exits. If I comment out the client 
assignment above, the pipeline behaves normally (but obviously, I can't write anything to 
ES).

Any advice for a dev just getting started with Apache Beam (2.0.0)?

ry



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: How to use ConsoleIO SDK

2017-10-19 Thread Jean-Baptiste Onofré


Hi,

I started to work on the ConsoleIO (and SocketIO too), but it's not yet merged.

I can provide a SNAPSHOT to you if you wanna try.

Regards
JB

On 10/20/2017 04:14 AM, linr...@itri.org.tw wrote:

Dear sir,

I have the question how to use the beam java sdk: ConsoleIO.

My objective colored in background yellow is to write the PCollection ”data” on 
Console, and then use it(type: RDD ??) as another variable to do other works.


If any further information is needed, I am glad to be informed and will provide 
to you as soon as possible.


I am looking forward to hearing from you.

My java code is as:

“

*import *java.io.IOException;

*import*org.apache.beam.sdk.Pipeline;

*import*org.apache.beam.sdk.options.PipelineOptionsFactory;

*import*org.apache.beam.runners.spark.SparkRunner;

*import*org.apache.beam.runners.spark.io.ConsoleIO;

*import*org.apache.beam.runners.spark.SparkPipelineOptions;

**

*import *org.apache.beam.sdk.transforms.Create;

*import *org.apache.beam.sdk.values.KV;

*import *org.apache.beam.sdk.values.PCollection;

*import *org.apache.beam.sdk.values.TimestampedValue;

**

*import *javafx.util.Pair;**

**

*import*org.joda.time.Duration;

*import*org.joda.time.Instant;

*import*org.joda.time.MutableDateTime;

*public**static**void*main(String[] args) *throws*IOException  {

    MutableDateTime mutableNow= Instant./now/().toMutableDateTime();

mutableNow.setDateTime(2017, 7, 12, 14, 0, 0, 0);

    Instant starttime= mutableNow.toInstant().plus(8*60*60*1000);

*int*min;

*int*sec;

*int*millsec;

min=2;

sec=min*60;

millsec=sec*1000;

*double*[] value=*new**double*[] {1.0,2.0,3.0,4.0,5.0};

    List<TimestampedValue<KV<String,Pair<Integer, Double>>>> dataList= 
*new*ArrayList<>();


*int*n=value.length;

*int*count=0;

*for*(*int*i=0; i<n; i++)

         {

count=count+1;

*if*(i<=3)

    {

   Instant M1_time=starttime.plus(millsec*count);

dataList.add(TimestampedValue./of/(KV./of/("M1", *new*Pair<Integer, Double> 
(i,value[i])), M1_time));


    }

*else**if*(4<=i&& i<5)

    {

   Instant M2_time=starttime.plus(millsec*count);

dataList.add(TimestampedValue./of/(KV./of/("M1", *new*Pair<Integer, Double> 
(i,value[i])), M2_time));


    }

*else*

    {

   Instant M3_time=starttime.plus(millsec*count);

dataList.add(TimestampedValue./of/(KV./of/("M1", *new*Pair<Integer, Double> 
(i,value[i])), M3_time));


    }

    System.*/out/*.println("raw_data="+dataList.get(i));

         }

    SparkPipelineOptions options= 
PipelineOptionsFactory./as/(SparkPipelineOptions.*class*);


options.setRunner(SparkRunner.*class*);

options.setSparkMaster("local[4]");

    Pipeline p= Pipeline./create/(options);

PCollection<KV<String,Pair<Integer, Double>>> data=p.apply("create data with 
time",Create./timestamped/(dataList));


data.apply("spark_write_on_console",ConsoleIO.Write._out_);

p.run().waitUntilFinish();

”

Thanks very much

Sincerely yours,

Liang-Sian Lin, Dr.

Oct 20 2017



--
本信件可能包含工研院機密資訊，非指定之收件者，請勿使用或揭露本信件內容，並請銷毀 
此信件。 This email may contain confidential information. Please do not use or 
disclose it in any way and delete it if you are not the intended recipient.


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [VOTE] [DISCUSSION] Remove support for Java 7

2017-10-18 Thread Jean-Baptiste Onofré

What happens for the users using spark 1.5 that run with Java 7 only ?

On Oct 18, 2017, 12:06, at 12:06, "Ismaël Mejía"  wrote:
>+1
>
>I forgot to vote yesterday, I don't really think this is a change
>worth requiring a major version of Beam. Just clear information in the
>site/release notes should make it. Also I am afraid that if we wait
>until we have enough changes to switch Beam to a new major version the
>switch to Java 8 will happen too late, probably after Java 8's end of
>life. And I am not exaggerating, Java 8 is planned to EOL next march
>2018! (of course Oracle usually changes this), in any case go go Java
>8 ASAP !
>
>
>On Wed, Oct 18, 2017 at 8:08 AM, Prabeesh K. 
>wrote:
>> +1
>>
>> On 18 October 2017 at 05:16, Griselda Cuevas  wrote:
>>>
>>> +1
>>>
>>> On 17 October 2017 at 16:36, Robert Bradshaw 
>wrote:

 +1 to removing Java 7 support, pending no major user outcry to the
 contrary.

 In terms of versioning, I fall into the camp that this isn't
 sufficiently incompatible to warrant a major version increase.
 Semantic versioning is all about messaging, and upgrading the major
 version so soon after GA for such a minor change would IMHO cause
>more
 confusion that clarity. Hitting 3.0 should signal a major
>improvement
 to Beam itself.

 On Tue, Oct 17, 2017 at 3:52 PM, Eugene Kirpichov
>
 wrote:
 > +1 to removing Java 7 support.
 >
 > In terms of release 3.0, we can handle this two ways:
 > - Wait until enough other potentially incompatible changes
>accumulate,
 > do
 > all of them, and call it a "3.0" release, so that 3.0 will truly
>differ
 > in a
 > lot of incompatible and hopefully nice ways from 2.x. This might
>well
 > take a
 > year or so.
 > - Make a release in which Java 7 support is removed, and call it
>a
 > "3.0"
 > release just to signal the incompatibility, and other potentially
 > incompatible changes will wait until "4.0" etc.
 >
 > I suppose the decision depends on whether we have a lot of other
 > incompatible changes we would like to do, and whether we have any
>other
 > truly great features enabled by those changes, or at least truly
>great
 > features justifying increasing the major version number. If we go
>with
 > #1,
 > I'd say, among the current work happening in Beam, portability
>comes to
 > mind
 > as a sufficiently huge milestone, so maybe drop Java 7 in the
>same
 > release
 > that offers a sufficient chunk of the portability work?
 >
 > (There's also a third path: declare that dropping Java7 support
>is not
 > sufficiently "incompatible" to warrant a major version increase,
 > because
 > people don't have to rewrite their code but only switch their
>compiler
 > version, and people who already use a Java8 compiler won't even
>notice.
 > This
 > path could perhaps be considered if we had evidence that
>switching to a
 > Beam
 > release without Java7 support would require 0 work for an
>overwhelming
 > majority of users)
 >
 >
 >
 > On Tue, Oct 17, 2017 at 3:34 PM Randal Moore
>
 > wrote:
 >>
 >> +1
 >>
 >> On Tue, Oct 17, 2017 at 5:21 PM Raghu Angadi
>
 >> wrote:
 >>>
 >>> +1.
 >>>
 >>> On Tue, Oct 17, 2017 at 2:11 PM, David McNeill
>
 >>> wrote:
 
  The final version of Beam that supports Java 7 should be
>clearly
  stated
  in the docs, so those stuck on old production infrastructure
>for
  other java
  app dependencies know where to stop upgrading.
 
  David McNeill
  021 721 015
 
 
 
  On 18 October 2017 at 05:16, Ismaël Mejía 
>wrote:
 >
 > We have discussed recently in the developer mailing list
>about the
 > idea of removing support for Java 7 on Beam. There are
>multiple
 > reasons for this:
 >
 > - Java 7 has not received public updates for almost two years
>and
 > most
 > companies are moving / have already moved to Java 8.
 > - A good amount of the systems Beam users rely on have
>decided to
 > drop
 > Java 7 support, e.g. Spark, Flink, Elasticsearch, even Hadoop
>plans
 > to
 > do it on version 3.
 > - Most Big data distributions and Cloud managed Spark/Hadoop
 > services
 > have already moved to Java 8.
 > - Recent versions of core libraries Beam uses are moving to
>be Java
 > 8
 > only (or mostly), e.g. Guava, Google Auto, etc.
 > - Java 8 has some nice features that can make Beam code nicer
>e.g.
 > lambdas, streams.
 >
 >

Re: DoFn setup/teardown sequence

2017-10-16 Thread Jean-Baptiste Onofré


Yes, no problem at all. I meant that the DoFn is "attached" to a pipeline.

Regards
JB

On 10/16/2017 08:25 AM, Derek Hao Hu wrote:

I believe a worker can execute multiple instances (i.e. threads) of a DoFn.

Derek

On Sun, Oct 15, 2017 at 10:46 PM, Jean-Baptiste Onofré <j...@nanthrax.net 
<mailto:j...@nanthrax.net>> wrote:


Hi,

Correct, @setup is used when bootstrapping the DoFn, @StartBundle is called
for a set of data (bundle), @ProcessElement is for each element in the
bundle/collection, @FinishBundle at the end of the dataset (bundle),
@Teardown is called when the DoFn is "removed".

A DoFn is per pipeline.

Regards
JB


On 10/16/2017 07:31 AM, Jacob Marble wrote:

(there might be documentation on this that I didn't find; if so a link
is sufficient)

Good evening, this is just a check on my understanding. It looks like an
instance of a given DoFn goes through this lifecycle. Am I correct?

- constructor
- @Setup (once)
    - @StartBundle (zero to many times)
      - @ProcessContext (zero to many times)
    - @FinishBundle
- @Teardown (once)

Can any of these steps be called concurrently? (I believe no)
Can one worker execute multiple instances of a DoFn? (I believe yes)

Thank you,

    Jacob


-- 
Jean-Baptiste Onofré

jbono...@apache.org <mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com




--
Derek Hao Hu

Software Engineer | Snapchat
Snap Inc.


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: DoFn setup/teardown sequence

2017-10-15 Thread Jean-Baptiste Onofré


Hi,

Correct, @setup is used when bootstrapping the DoFn, @StartBundle is called for 
a set of data (bundle), @ProcessElement is for each element in the 
bundle/collection, @FinishBundle at the end of the dataset (bundle), @Teardown 
is called when the DoFn is "removed".


A DoFn is per pipeline.

Regards
JB

On 10/16/2017 07:31 AM, Jacob Marble wrote:
(there might be documentation on this that I didn't find; if so a link is 
sufficient)


Good evening, this is just a check on my understanding. It looks like an 
instance of a given DoFn goes through this lifecycle. Am I correct?


- constructor
- @Setup (once)
   - @StartBundle (zero to many times)
     - @ProcessContext (zero to many times)
   - @FinishBundle
- @Teardown (once)

Can any of these steps be called concurrently? (I believe no)
Can one worker execute multiple instances of a DoFn? (I believe yes)

Thank you,

Jacob


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Apache Beam meetup London 1: debrief

2017-10-11 Thread Jean-Baptiste Onofré

Hey,

London is a just a little hop from Paris ;) So, I can take a flight easily ;)

Please let me know when you have some dates.

Thanks,
Regards
JB

On 10/10/2017 10:19 PM, Matthias Baetens wrote:
Awesome JB, glad you are happy. I'm having a call with Gris from Googl and 
Victor (he's co-organising these with me) for the next meetup, probably in 
November. Do you have any plans of being in London in that month? We would be 
happy to adjust the timing to your needs to be able to have you talk!

Thanks :)
Best,
Matthias

Op di 10 okt. 2017 om 14:23 schreef Jean-Baptiste Onofré <j...@nanthrax.net 
<mailto:j...@nanthrax.net>>:

Thanks for the update !

That's a great stuff !

Looking forward next Beam meetup and hope to be able to participate !

Regards
JB

On 10/10/2017 03:11 PM, Matthias Baetens wrote:
 > Hi all!
 >
 > We had our first Apache Beam meetup last week, and we just released a 
small
 > blogpost
 > 
<http://blog.datatonic.com/2017/10/first-apache-beam-meetup-in-london.html> about

it!
 > Stay tuned for more of these or reach out if you have something
interesting to
 > share :)
 >
 > Also: if you are thinking to set up something similar in your city, I am
happy
 > to join forces and share thoughts and ideas. Don't doubt to reach out to
 > Griselda (in cc) as well - she has been a great help for us so far!
 >
     > Best,
 > Matthias
 > --

--
Jean-Baptiste Onofré
jbono...@apache.org <mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com

--

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Slack invitation

2017-10-10 Thread Jean-Baptiste Onofré


Done,

welcome aboard !

Regards
JB

On 10/10/2017 07:04 PM, t osh wrote:

Hello

Could someone please add me to the Beam slack channel?  Thanks.

Taichi Oshiumi


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Spark and Beam

2017-09-29 Thread Jean-Baptiste Onofré

your runner (target environment). For testing the direct runner is 
enough but to run on spark you will need to import the spark one as dependency.



Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> | Blog
<https://blog-rmannibucau.rhcloud.com> | Old Blog
<http://rmannibucau.wordpress.com> | Github
<https://github.com/rmannibucau> | LinkedIn
<https://www.linkedin.com/in/rmannibucau> | JavaEE Factory
<https://javaeefactory-rmannibucau.rhcloud.com>

2017-09-26 11:02 GMT+02:00 tal m <tal.m...@gmail.com
<mailto:tal.m...@gmail.com>>:

HI
i looked at the links you sent me, and i haven't found
any clue how to adapt it to my current code.
my code is very simple:

val sc = spark.sparkContext

val productsNum =10
println(s"Saving $productsNumproducts RDD to the space")
val rdd = sc.parallelize(1 to productsNum).map { i =>
   Product(i,"Description of product " + i, 
Random.nextInt(10), Random.nextBoolean())
}

is that simple to use beam instead of SparkContext ? 
i'm not familiar with Spark at all so i have no idea what is Spark runner and 
how can i use it in my case, just need to make it work :).

Thanks Tal


On Tue, Sep 26, 2017 at 11:57 AM, Aviem Zur
<aviem...@gmail.com <mailto:aviem...@gmail.com>> wrote:

Hi Tal,

Thanks for reaching out!

Please take a look at our documentation:

Quickstart guide (Java):
https://beam.apache.org/get-started/quickstart-java/

<https://beam.apache.org/get-started/quickstart-java/>
This guide will show you how to run our wordcount
example using each any of the runners (For example,
direct runner or Spark runner in your case).

More reading:
Programming guide:

https://beam.apache.org/documentation/programming-guide/

<https://beam.apache.org/documentation/programming-guide/>
Spark runner:
https://beam.apache.org/documentation/runners/spark/

<https://beam.apache.org/documentation/runners/spark/>

Please let us know if you have further questions,
and good luck with your first try of Beam!

Aviem.

On Tue, Sep 26, 2017 at 11:47 AM tal m
<tal.m...@gmail.com <mailto:tal.m...@gmail.com>> 
wrote:

hi
i'm new at Spark and also at beam.
currently i have Java code that use Spark from
        reading some data from DB.
my Spark code using SparkSession.builder (.)
and also sparkContext.
how can i make beam work similar to my current
code, i just want make it work for now.
Thanks Tal










--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Slack invitation

2017-09-12 Thread Jean-Baptiste Onofré


Done,

Welcome !

Regards
JB

On 09/12/2017 11:12 AM, Pawel Bartoszek wrote:

Can I get an invitation to Slack?


Pawel Bartoszek


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

1 2 >

1 - 100 of 138 matches

Mail list logo