Re: Resizing Beam IOITs

2019-10-08 Thread Chamikara Jayalath
On Tue, Oct 8, 2019 at 6:52 AM Michał Walenia 
wrote:

> Hi all,
> I'm working on resizing IO integration tests in Beam and I'd like to ask
> for the community's opinion.
>
> Right now each IO integration test has a set of four predetermined sizes
> (1000, 100k, 1M and 100M elements).
> For every size there is a pre calculated hash for read correctness
> checking.
> As it is now, measuring throughput in a IOIT is very costly - accessing
> memory for each PCollection element increases the runtime of the test
> manyfold, which changes the runtime measurements.
>
> My proposed improvements change the test sizes, add dataset size reporting
> to metrics (throughput will be possible to calculate at dashboard level)
> and change the way test parameters are passed.
> The changes are in a PR here .
> Tests were resized to about 1GB each.
> Test configurations would be set by one string parameter in pipeline
> options (eg. "testConfigName=XML_1GB" instead of
> "numberOfRecords=100").
>
> What in general do you think about this approach? Do you think that 1GB
> test datasets are reasonable?
> Thanks,
>

Thanks Michal. I think these tests fulfil two purposes currently.
(1) As end-to-end integration tests that confirm that connectors work with
a given runner.
(2) As Large scale performance tests for tracking performance and
triggering alerts.

It might be good to separate out these two cases and run two integration
tests for each connector. For example,
(1) Version with a small input (say 1KB - 1MB) that we run often,
potentially with every run of post-commit test suite.
(2) A version with a large input (say 10-100 GB, depending on the
connector) that is used for performance tracking and triggering alerts.
This version should be run less frequently (for example, once a day).

WDYT ?

Thanks,
Cham


>
> Michal
>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>


Re: Contributor permission

2019-10-08 Thread Kenneth Knowles
Excellent to hear you are ready to contribute! I've added you to the JIRA
role "Contributor" so you can be assigned tickets.

Kenn

On Tue, Oct 8, 2019 at 5:03 AM Alexey Romanenko 
wrote:

> As Pablo said, we have several labels for starter tasks.
> And we even have a short link to simplify this search:
> https://s.apache.org/beam-starter-tasks
>
> On 7 Oct 2019, at 21:35, Pablo Estrada  wrote:
>
> We have some labels: easyfix, newbie, beginner, starter. Check them, and
> let us know if they help you find an issue. If not, don't hesitate to ask -
> I can try to look for a couple easier beginner contributions.
>
> Best
> -P.
>
> On Mon, Oct 7, 2019 at 11:54 AM Leonardo Miguel <
> leonardo.mig...@arquivei.com.br> wrote:
>
>> I was going to start with katas.
>> I have interest in the sdk-java and sdk-py, so maybe working on
>> examples-java and examples-py would be a start.
>>
>> Em seg, 7 de out de 2019 às 15:30, Rui Wang  escreveu:
>>
>>> I am not aware of starter tags. Which component/project you are
>>> interested in?
>>>
>>>
>>> -Rui
>>>
>>> On Mon, Oct 7, 2019 at 11:02 AM Leonardo Miguel <
>>> leonardo.mig...@arquivei.com.br> wrote:
>>>
 Hi,

 I've been working with Beam for three years now and I would like to
 start contributing.
 Could someone please give me permission to assign issues to myself?
 My jira username is leonardo.miguel

 Do you have any Labels for begginers that I could start working on?

 Thanks!

 --
 []s

 Leonardo Alves Miguel
 Data Engineer
 (16) 3509-5515 | www.arquivei.com.br
 
 [image: Arquivei.com.br – Inteligência em Notas Fiscais]
 
 [image: Google seleciona Arquivei para imersão e mentoria no Vale do
 Silício]
 
 
 
 

>>>
>>
>> --
>> []s
>>
>> Leonardo Alves Miguel
>> Data Engineer
>> (16) 3509-5515 | www.arquivei.com.br
>> 
>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>> 
>> [image: Google seleciona Arquivei para imersão e mentoria no Vale do
>> Silício]
>> 
>> 
>> 
>> 
>>
>
>


Re: Received status code 500 from server: Internal Server Error

2019-10-08 Thread jincheng sun
Thanks for your quick response, I think your approach is work locally.
Thank you!  @ Łukasz
I seed this info to dev ML not for my local env. But for all the
Beam Contributors, because they will got the PreCommit fail when open the
PR. Anyway, Thanks for your kind solution for locally.
Currently, I trigger the PreCommit in the PR, It works now.

Best,
Jincheng

Kyle Weaver  于2019年10月8日周二 下午11:01写道:

> Is there a way we could add a backup server url to our configuration to
> use if the sonatype server is down?
>
> On Tue, Oct 8, 2019 at 4:56 AM Łukasz Gajowy 
> wrote:
>
>> of course I meant:
>>
>> maven { url "https://oss.sonatype.org/content/repositories/staging/; }
>> => maven { url "https://repo1.maven.org/maven2/; }
>>
>> :)
>>
>> wt., 8 paź 2019 o 13:53 Łukasz Gajowy 
>> napisał(a):
>>
>>> It seems that oss.sonatype was down and this prevented us from
>>> downloading required resources from it. I can't use it either (I was
>>> getting the same error). If running gradle in offline mode does not help
>>> (--offline flag) another temporary solution is to replace the url In
>>> Repository.groovy
>>> 
>>>  to
>>> e.g. a maven central one when working locally:
>>>
>>> maven { url "https://repo1.maven.org/maven2/; } => maven { url "
>>> https://oss.sonatype.org/content/repositories/staging/; }
>>>
>>> When sonatype is up again you should be fine without this hack.
>>>
>>> I hope this helps.
>>>
>>> Łukasz
>>>
>>> wt., 8 paź 2019 o 11:55 jincheng sun 
>>> napisał(a):
>>>
 Hi all,
 I got the 500 error, when do the PreCommit. We can run the following
 command to see the detail:

 ./gradlew :sdks:python:test-suites:portable:py2:flinkValidatesRunner

 >>
 Task :model:pipeline:compileJava FAILED
 FAILURE: Build failed with an exception.
 * What went wrong:
 Execution failed for task ':model:pipeline:compileJava'.
 > Could not resolve all files for configuration
 ':model:pipeline:errorprone'.
> Could not resolve
 com.google.errorprone:error_prone_core:latest.release.
  Required by:
  project :model:pipeline
   > Failed to list versions for
 com.google.errorprone:error_prone_core.
  > Unable to load Maven meta-data from
 https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml
 .
 > Could not HEAD '
 https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml'.
 Received status code 500 from server: Internal Server Error

 I appreciate if anyone help solve the server problem!

 Best,
 Jincheng






Re: Support for LZO compression.

2019-10-08 Thread Luke Cwik
Sorry about that, gave the wrong information.

The GPL 1, 2, and 3 all fall under category X licenses [1].
"Apache projects may not distribute Category X licensed components, be it
in source or binary form; and be it in ASF source code or convenience
binaries. As with the previous question on platforms, the component can be
relied on if the component's license terms do not affect the Apache
product's licensing. For example, using a GPL'ed tool during the build is
OK, however including GPL'ed source code is not."

But if this is an optional component which does not significantly prevent
the majority of users to use the product then it will be ok[2]. Relevant
bit is:
"Apache projects can rely on components under prohibited licenses if the
component is only needed for optional features. When doing so, a project
shall provide the user with instructions on how to obtain and install the
non-included work. Optional means that the component is not required for
standard use of the product or for the product to achieve a desirable level
of quality. The question to ask yourself in this situation is:"

So in this case I believe we can include the LZO as long as we mark it as
optional.

1: https://www.apache.org/legal/resolved.html#category-x
2: https://www.apache.org/legal/resolved.html#optional



On Tue, Oct 8, 2019 at 3:51 PM Luke Cwik  wrote:

> Which GPL version?
>
> The Apache License 2.0 is compatible with GPL 3[1]
>
> 1: https://www.apache.org/foundation/license-faq.html#GPL
>
> On Tue, Oct 8, 2019 at 2:10 PM Sameer Abhyankar 
> wrote:
>
>> Hi All,
>>
>> We were looking to add an IO that would read LZO compressed binaries from
>> a supported filesystem. However, based on our research, LZO is shipped
>> under the GPL license.
>>
>> Would the licensing issue make it unlikely for this to be accepted as a
>> contribution into the Beam SDK? Are there options for adding support for
>> LZO into the Beam SDK so we dont run into licensing issues?
>>
>> Thanks in advance for the help with this!!
>>
>> Sameer
>>
>


Re: Support for LZO compression.

2019-10-08 Thread Luke Cwik
Which GPL version?

The Apache License 2.0 is compatible with GPL 3[1]

1: https://www.apache.org/foundation/license-faq.html#GPL

On Tue, Oct 8, 2019 at 2:10 PM Sameer Abhyankar 
wrote:

> Hi All,
>
> We were looking to add an IO that would read LZO compressed binaries from
> a supported filesystem. However, based on our research, LZO is shipped
> under the GPL license.
>
> Would the licensing issue make it unlikely for this to be accepted as a
> contribution into the Beam SDK? Are there options for adding support for
> LZO into the Beam SDK so we dont run into licensing issues?
>
> Thanks in advance for the help with this!!
>
> Sameer
>


Re: Applicant from Outreachy

2019-10-08 Thread Kenneth Knowles
Hi Xianqiong!

I have added you to the "Contributors" role, so you can be assigned JIRA
tickets.

Kenn

On Sun, Oct 6, 2019 at 6:59 PM Xianqiong Wu  wrote:

>
> Dear Beam Team,
>
>
> Hope you’re all having a great day!
>
>
> My name is Xianqiong, and I’m an applicant from Outreachy.
>
> I’m super excited to join your team and hope to make a contribution to
> BeamSQL.
>
>
> And my JIRA ID is: esswxq.
>
>
>
> All the best,
>
>
> Xianqiong
>


Re: Received status code 500 from server: Internal Server Error

2019-10-08 Thread Kyle Weaver
Is there a way we could add a backup server url to our configuration to use
if the sonatype server is down?

On Tue, Oct 8, 2019 at 4:56 AM Łukasz Gajowy 
wrote:

> of course I meant:
>
> maven { url "https://oss.sonatype.org/content/repositories/staging/; }
> => maven { url "https://repo1.maven.org/maven2/; }
>
> :)
>
> wt., 8 paź 2019 o 13:53 Łukasz Gajowy 
> napisał(a):
>
>> It seems that oss.sonatype was down and this prevented us from
>> downloading required resources from it. I can't use it either (I was
>> getting the same error). If running gradle in offline mode does not help
>> (--offline flag) another temporary solution is to replace the url In
>> Repository.groovy
>> 
>>  to
>> e.g. a maven central one when working locally:
>>
>> maven { url "https://repo1.maven.org/maven2/; } => maven { url "
>> https://oss.sonatype.org/content/repositories/staging/; }
>>
>> When sonatype is up again you should be fine without this hack.
>>
>> I hope this helps.
>>
>> Łukasz
>>
>> wt., 8 paź 2019 o 11:55 jincheng sun 
>> napisał(a):
>>
>>> Hi all,
>>> I got the 500 error, when do the PreCommit. We can run the following
>>> command to see the detail:
>>>
>>> ./gradlew :sdks:python:test-suites:portable:py2:flinkValidatesRunner
>>>
>>> >>
>>> Task :model:pipeline:compileJava FAILED
>>> FAILURE: Build failed with an exception.
>>> * What went wrong:
>>> Execution failed for task ':model:pipeline:compileJava'.
>>> > Could not resolve all files for configuration
>>> ':model:pipeline:errorprone'.
>>>> Could not resolve
>>> com.google.errorprone:error_prone_core:latest.release.
>>>  Required by:
>>>  project :model:pipeline
>>>   > Failed to list versions for
>>> com.google.errorprone:error_prone_core.
>>>  > Unable to load Maven meta-data from
>>> https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml
>>> .
>>> > Could not HEAD '
>>> https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml'.
>>> Received status code 500 from server: Internal Server Error
>>>
>>> I appreciate if anyone help solve the server problem!
>>>
>>> Best,
>>> Jincheng
>>>
>>>
>>>
>>>


Resizing Beam IOITs

2019-10-08 Thread Michał Walenia
Hi all,
I'm working on resizing IO integration tests in Beam and I'd like to ask
for the community's opinion.

Right now each IO integration test has a set of four predetermined sizes
(1000, 100k, 1M and 100M elements).
For every size there is a pre calculated hash for read correctness checking.
As it is now, measuring throughput in a IOIT is very costly - accessing
memory for each PCollection element increases the runtime of the test
manyfold, which changes the runtime measurements.

My proposed improvements change the test sizes, add dataset size reporting
to metrics (throughput will be possible to calculate at dashboard level)
and change the way test parameters are passed.
The changes are in a PR here .
Tests were resized to about 1GB each.
Test configurations would be set by one string parameter in pipeline
options (eg. "testConfigName=XML_1GB" instead of
"numberOfRecords=100").

What in general do you think about this approach? Do you think that 1GB
test datasets are reasonable?
Thanks,

Michal

-- 

Michał Walenia
Polidea  | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.wale...@polidea.com

Unique Tech
Check out our projects! 


Re: Contributor permission

2019-10-08 Thread Alexey Romanenko
As Pablo said, we have several labels for starter tasks. 
And we even have a short link to simplify this search: 
https://s.apache.org/beam-starter-tasks

> On 7 Oct 2019, at 21:35, Pablo Estrada  wrote:
> 
> We have some labels: easyfix, newbie, beginner, starter. Check them, and let 
> us know if they help you find an issue. If not, don't hesitate to ask - I can 
> try to look for a couple easier beginner contributions.
> 
> Best
> -P.
> 
> On Mon, Oct 7, 2019 at 11:54 AM Leonardo Miguel 
> mailto:leonardo.mig...@arquivei.com.br>> 
> wrote:
> I was going to start with katas.
> I have interest in the sdk-java and sdk-py, so maybe working on examples-java 
> and examples-py would be a start.
> 
> Em seg, 7 de out de 2019 às 15:30, Rui Wang  > escreveu:
> I am not aware of starter tags. Which component/project you are interested in?
> 
> 
> -Rui
> 
> On Mon, Oct 7, 2019 at 11:02 AM Leonardo Miguel 
> mailto:leonardo.mig...@arquivei.com.br>> 
> wrote:
> Hi,
> 
> I've been working with Beam for three years now and I would like to start 
> contributing.
> Could someone please give me permission to assign issues to myself?
> My jira username is leonardo.miguel
> 
> Do you have any Labels for begginers that I could start working on?
> 
> Thanks!
> 
> -- 
> []s
> 
> Leonardo Alves Miguel
> Data Engineer
> (16) 3509-5515 | www.arquivei.com.br 
> 
>  
> 
>  
> 
>    
>   
> 
> 
> -- 
> []s
> 
> Leonardo Alves Miguel
> Data Engineer
> (16) 3509-5515 | www.arquivei.com.br 
> 
>  
> 
>  
> 
>    
>   
> 


Re: Received status code 500 from server: Internal Server Error

2019-10-08 Thread Łukasz Gajowy
of course I meant:

maven { url "https://oss.sonatype.org/content/repositories/staging/; }  =>
maven { url "https://repo1.maven.org/maven2/; }

:)

wt., 8 paź 2019 o 13:53 Łukasz Gajowy  napisał(a):

> It seems that oss.sonatype was down and this prevented us from downloading
> required resources from it. I can't use it either (I was getting the same
> error). If running gradle in offline mode does not help (--offline flag)
> another temporary solution is to replace the url In Repository.groovy
> 
>  to
> e.g. a maven central one when working locally:
>
> maven { url "https://repo1.maven.org/maven2/; } => maven { url "
> https://oss.sonatype.org/content/repositories/staging/; }
>
> When sonatype is up again you should be fine without this hack.
>
> I hope this helps.
>
> Łukasz
>
> wt., 8 paź 2019 o 11:55 jincheng sun 
> napisał(a):
>
>> Hi all,
>> I got the 500 error, when do the PreCommit. We can run the following
>> command to see the detail:
>>
>> ./gradlew :sdks:python:test-suites:portable:py2:flinkValidatesRunner
>>
>> >>
>> Task :model:pipeline:compileJava FAILED
>> FAILURE: Build failed with an exception.
>> * What went wrong:
>> Execution failed for task ':model:pipeline:compileJava'.
>> > Could not resolve all files for configuration
>> ':model:pipeline:errorprone'.
>>> Could not resolve
>> com.google.errorprone:error_prone_core:latest.release.
>>  Required by:
>>  project :model:pipeline
>>   > Failed to list versions for
>> com.google.errorprone:error_prone_core.
>>  > Unable to load Maven meta-data from
>> https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml
>> .
>> > Could not HEAD '
>> https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml'.
>> Received status code 500 from server: Internal Server Error
>>
>> I appreciate if anyone help solve the server problem!
>>
>> Best,
>> Jincheng
>>
>>
>>
>>


Re: Received status code 500 from server: Internal Server Error

2019-10-08 Thread Łukasz Gajowy
It seems that oss.sonatype was down and this prevented us from downloading
required resources from it. I can't use it either (I was getting the same
error). If running gradle in offline mode does not help (--offline flag)
another temporary solution is to replace the url In Repository.groovy

to
e.g. a maven central one when working locally:

maven { url "https://repo1.maven.org/maven2/; } => maven { url "
https://oss.sonatype.org/content/repositories/staging/; }

When sonatype is up again you should be fine without this hack.

I hope this helps.

Łukasz

wt., 8 paź 2019 o 11:55 jincheng sun  napisał(a):

> Hi all,
> I got the 500 error, when do the PreCommit. We can run the following
> command to see the detail:
>
> ./gradlew :sdks:python:test-suites:portable:py2:flinkValidatesRunner
>
> >>
> Task :model:pipeline:compileJava FAILED
> FAILURE: Build failed with an exception.
> * What went wrong:
> Execution failed for task ':model:pipeline:compileJava'.
> > Could not resolve all files for configuration
> ':model:pipeline:errorprone'.
>> Could not resolve
> com.google.errorprone:error_prone_core:latest.release.
>  Required by:
>  project :model:pipeline
>   > Failed to list versions for com.google.errorprone:error_prone_core.
>  > Unable to load Maven meta-data from
> https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml
> .
> > Could not HEAD '
> https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml'.
> Received status code 500 from server: Internal Server Error
>
> I appreciate if anyone help solve the server problem!
>
> Best,
> Jincheng
>
>
>
>


Re: Beam 2.15.0 SparkRunner issues

2019-10-08 Thread Tim Robertson
I'm sorry for not replying. We are super busy trying to prepare data to
release.

An update:
- We were using G1GC and through slack were advised against that. This
fixed the OOM error we saw and all our 2.15.0 jobs did complete

When we have time (after 3 weeks) I'll try and isolate a test case with the
reshuffle example and parallelism.

Thanks,
Tim


On Thu, Oct 3, 2019 at 1:21 PM Jan Lukavský  wrote:

> Hi Tim,
>
> can you please elaborate more about some parts?
>
> 1) What happens actually in your case? What is the specific settings you
> use?
>
> 3) Can you share stacktrace? Is it always the same, or does it change?
>
> The mentioned GroupCombineFunctions.java:202 comes from a Reshuffle,
> which seems to make a little sense to me regarding the logic you
> described. Do you use Reshuffle transform or does it expand from some
> other transform?
>
> Jan
>
> On 10/3/19 9:24 AM, Tim Robertson wrote:
> > Hi all,
> >
> > We haven't dug enough into this to know where to log issues, but I'll
> > start by sharing here.
> >
> > After upgrading from Beam 2.10.0 to 2.15.0 we see issues on
> > SparkRunner - we suspect all of this related.
> >
> > 1. spark.default.parallelism is not respected
> >
> > 2. File writing (Avro) with dynamic destinations (grouped into folders
> > by a field name) consistently fail with
> > org.apache.beam.sdk.util.UserCodeException:
> > java.nio.file.FileAlreadyExistsException: Unable to rename resource
> >
> hdfs://ha-nn/pipelines/export-20190930-0854/.temp-beam-d4fd89ed-fc7a-4b1e-aceb-68f9d72d50f0/6e086f60-8bda-4d0e-b29d-1b47fdfc88c0
>
> > to
> >
> hdfs://ha-nn/pipelines/export-20190930-0854/7c9d2aec-f762-11e1-a439-00145eb45e9a/verbatimHBaseExport-0-of-1.avro
>
> > as destination already exists and couldn't be deleted.
> >
> > 3. GBK operations that run over 500M small records consistently fail
> > with OOM. We tried different configs with 48GB, 60GB, 80GB executor
> > memory
> >
> > Our pipelines run are batch, simple transformations with either an
> > HBaseSnapshot to Avro files or a merge of records in Avro (the GBK
> > issue) pushed to ElasticSearch (it fails upstream of the
> > ElasticsearchIO in the GBK stage).
> >
> > We notice operations that were mapToPair  in 2.10.0 become repartition
> > operations ( (mapToPair at GroupCombineFunctions.java:68 becomes
> > repartition at GroupCombineFunctions.java:202)) which might be related
> > to this and looks surprising.
> >
> > I'll report more as we learn. If anyone has any immediate ideas based
> > on their commits or reviews or if you wish an tests run on other Beam
> > versions please say.
> >
> > Thanks,
> > Tim
> >
> >
> >
>


Received status code 500 from server: Internal Server Error

2019-10-08 Thread jincheng sun
Hi all,
I got the 500 error, when do the PreCommit. We can run the following
command to see the detail:

./gradlew :sdks:python:test-suites:portable:py2:flinkValidatesRunner

>>
Task :model:pipeline:compileJava FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':model:pipeline:compileJava'.
> Could not resolve all files for configuration
':model:pipeline:errorprone'.
   > Could not resolve
com.google.errorprone:error_prone_core:latest.release.
 Required by:
 project :model:pipeline
  > Failed to list versions for com.google.errorprone:error_prone_core.
 > Unable to load Maven meta-data from
https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml
.
> Could not HEAD '
https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml'.
Received status code 500 from server: Internal Server Error

I appreciate if anyone help solve the server problem!

Best,
Jincheng