Re: Writing bytes to BigQuery with beam

2019-03-20 Thread Chamikara Jayalath
On Wed, Mar 20, 2019 at 7:37 PM Valentyn Tymofieiev 
wrote:

> Pablo, according to Juta's analysis (1.c in the document) and also
> https://issuetracker.google.com/issues/129006689, I think BQ confuses
> BYTES and STRING when schema is not specified... This seems to me like a BQ
> bug, so for Beam this means that we either have to wait until BQ fixes or,
> or work around it. If we work around it, we can ask users to always supply
> schema if their table has BYTES data (temporary limitation), or try to pull
> schema from BQ before (every?) write operation.
>
> Cham, according to BQ documentation, BQ *can* auto-detect schema when
> populating new tables using a data source, for example a json file with
> records : https://cloud.google.com/bigquery/docs/schema-detect.
>

Ah, we don't support that AFAIK so currently we require users to provide a
schema to create tables. But good point, in case if we ever want to support
that feature.


>
> On Wed, Mar 20, 2019 at 7:15 PM Chamikara Jayalath 
> wrote:
>
>>
>>
>> On Wed, Mar 20, 2019 at 6:30 PM Pablo Estrada  wrote:
>>
>>> That sounds reasonable to me, Valentyn.
>>>
>>> Regarding (3), when the table already exists, it's not necessary to get
>>> the schema. BQ is smart enough to load everything in appropriately. (as
>>> long as bytes fields are base64 encoded)
>>>
>>> The problem is when the table does not exist and the user does not
>>> provide a schema. In that case, there is no simple way of auto-inferring
>>> the schema, as you correctly point out. I think it's reasonable to simply
>>> expect users provide schemas if their data will have tricky types to infer.
>>> Best
>>> -P.
>>>
>>
>> Is this even an option ? I think when table is not available users have
>> to provide a schema to create a new table.
>>
>>
>>>
>>>
>>> On Wed, Mar 20, 2019 at 3:44 PM Valentyn Tymofieiev 
>>> wrote:
>>>
 Thanks Juta for detailed analysis.

 I reached out to BigQuery team to improve documentation around
 treatment of Bytes and reported the issue that schema autodetection does
 not work  for BYTES
 in GCP issue tracker
 .

 Is this a correct summary of your proposal?

 1. Beam will base64-encode raw bytes, before passing them to BQ over
 rest API. This will be a change in behavior for Python 2 (for good 
 reasons).
 2. When reading data from BQ, all fileds of type BYTES will be
 base64-decoded.
 3. Beam will send an API call to BigQuery to get table schema, whenever
 schema is not supplied, to work around
 https://issuetracker.google.com/issues/129006689. Does anyone see any
 concerns with this? Is it always possible?

 Thanks,
 Valentyn

 On Wed, Mar 20, 2019 at 12:45 PM Reuven Lax  wrote:

> The Java SDK relies on Jackson to do the encoding.
>
> On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>>
>>
>> On Wed, Mar 20, 2019 at 5:46 AM Juta Staes  wrote:
>>
>>> Hi all,
>>>
>>>
>>> I am working on porting beam to python 3 and discovered the
>>> following:
>>>
>>>
>>> Current handling of bytes in bigquery IO:
>>>
>>> When writing bytes to BQ , beam uses
>>> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API
>>> expects byte values to be base-64 encoded*.
>>>
>>> However when writing raw bytes they are currently never transformed
>>> to base-64 encoded strings. This results in the following errors:
>>>
>>>-
>>>
>>>When writing b’abc’ in python 2 this results in actually writing
>>>b'i\xb7' which is the same as base64.b64decode('abc='))
>>>-
>>>
>>>When writing b’abc’ in python 3 this results in “TypeError:
>>>b'abc' is not JSON serializable”
>>>-
>>>
>>>When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8'
>>>codec can't decode byte 0xab in position 0: invalid start byte. NAN, 
>>> INF
>>>and -INF values are not JSON compliant”
>>>-
>>>
>>>When reading bytes from BQ they are currently returned as
>>>base-64 encoded strings rather then the raw bytes.
>>>
>>>
>>> Example code:
>>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing
>>>
>>> There is also another issue when writing base-64 encoded string to
>>> BQ. When no schema is specified this results in “Invalid schema update.
>>> Field bytes has changed type from BYTES to STRING”.
>>>
>>> This error can be reproduced when uploading a file (directly in the
>>> BQ UI) to a table with bytes and using schema autodetect.
>>>
>>> Suggested solution:
>>>
>>> I suggest to change BigQuery IO to handle the base-64 encoding as
>>> follows to allow 

Re: Python36/37 not installed on Beam2 and Beam12?

2019-03-20 Thread Valentyn Tymofieiev
I asked them yesterday on Slack, and commented on existing issue
https://issues.apache.org/jira/browse/INFRA-17335, however didn't receive a
response. We can try opening another infra ticket. Mark, perhaps you can
quote/+1 my message from yesterday in thier slack channel :) ?

On Wed, Mar 20, 2019 at 6:23 PM Yifan Zou  wrote:

> You could try to ping them in the slack channel
> https://the-asf.slack.com/messages/  if it is really urgent.
>
> On Wed, Mar 20, 2019 at 5:29 PM Mark Liu  wrote:
>
>> Hi,
>>
>> I saw occasional py36 tox test failure in beam_PreCommit_Python
>> and beam_Release_NightlySnapshot in cron job
>>  as well
>> as PR triggered job
>> . The
>> error is simple:
>>
>> ERROR: InterpreterNotFound: python3.6
>>
>> Turns out those failures only happened in Beam2 and Beam12. From console
>> log of inventory jobs (beam2
>>  and
>> beam12 ),
>> I found python3.6 and python3.7 interpreters are missing. This makes
>> beam_PreCommit_Python_Cron
>>  flaky
>> recently and may fail any python build that runs on those two nodes.
>>
>> Infra team helped install Python3 on our Jenkins before, but they were
>> slow for response on JIRA. What's the best way to have Infra team get
>> involved to this problem?
>>
>> Thanks,
>> Mark
>>
>


Re: Writing bytes to BigQuery with beam

2019-03-20 Thread Valentyn Tymofieiev
Pablo, according to Juta's analysis (1.c in the document) and also
https://issuetracker.google.com/issues/129006689, I think BQ confuses BYTES
and STRING when schema is not specified... This seems to me like a BQ bug,
so for Beam this means that we either have to wait until BQ fixes or, or
work around it. If we work around it, we can ask users to always supply
schema if their table has BYTES data (temporary limitation), or try to pull
schema from BQ before (every?) write operation.

Cham, according to BQ documentation, BQ *can* auto-detect schema when
populating new tables using a data source, for example a json file with
records : https://cloud.google.com/bigquery/docs/schema-detect.

On Wed, Mar 20, 2019 at 7:15 PM Chamikara Jayalath 
wrote:

>
>
> On Wed, Mar 20, 2019 at 6:30 PM Pablo Estrada  wrote:
>
>> That sounds reasonable to me, Valentyn.
>>
>> Regarding (3), when the table already exists, it's not necessary to get
>> the schema. BQ is smart enough to load everything in appropriately. (as
>> long as bytes fields are base64 encoded)
>>
>> The problem is when the table does not exist and the user does not
>> provide a schema. In that case, there is no simple way of auto-inferring
>> the schema, as you correctly point out. I think it's reasonable to simply
>> expect users provide schemas if their data will have tricky types to infer.
>> Best
>> -P.
>>
>
> Is this even an option ? I think when table is not available users have to
> provide a schema to create a new table.
>
>
>>
>>
>> On Wed, Mar 20, 2019 at 3:44 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> Thanks Juta for detailed analysis.
>>>
>>> I reached out to BigQuery team to improve documentation around treatment
>>> of Bytes and reported the issue that schema autodetection does not work
>>>  for BYTES in GCP
>>> issue tracker .
>>>
>>> Is this a correct summary of your proposal?
>>>
>>> 1. Beam will base64-encode raw bytes, before passing them to BQ over
>>> rest API. This will be a change in behavior for Python 2 (for good reasons).
>>> 2. When reading data from BQ, all fileds of type BYTES will be
>>> base64-decoded.
>>> 3. Beam will send an API call to BigQuery to get table schema, whenever
>>> schema is not supplied, to work around
>>> https://issuetracker.google.com/issues/129006689. Does anyone see any
>>> concerns with this? Is it always possible?
>>>
>>> Thanks,
>>> Valentyn
>>>
>>> On Wed, Mar 20, 2019 at 12:45 PM Reuven Lax  wrote:
>>>
 The Java SDK relies on Jackson to do the encoding.

 On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath <
 chamik...@google.com> wrote:

>
>
> On Wed, Mar 20, 2019 at 5:46 AM Juta Staes  wrote:
>
>> Hi all,
>>
>>
>> I am working on porting beam to python 3 and discovered the following:
>>
>>
>> Current handling of bytes in bigquery IO:
>>
>> When writing bytes to BQ , beam uses
>> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API
>> expects byte values to be base-64 encoded*.
>>
>> However when writing raw bytes they are currently never transformed
>> to base-64 encoded strings. This results in the following errors:
>>
>>-
>>
>>When writing b’abc’ in python 2 this results in actually writing
>>b'i\xb7' which is the same as base64.b64decode('abc='))
>>-
>>
>>When writing b’abc’ in python 3 this results in “TypeError:
>>b'abc' is not JSON serializable”
>>-
>>
>>When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8'
>>codec can't decode byte 0xab in position 0: invalid start byte. NAN, 
>> INF
>>and -INF values are not JSON compliant”
>>-
>>
>>When reading bytes from BQ they are currently returned as base-64
>>encoded strings rather then the raw bytes.
>>
>>
>> Example code:
>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing
>>
>> There is also another issue when writing base-64 encoded string to
>> BQ. When no schema is specified this results in “Invalid schema update.
>> Field bytes has changed type from BYTES to STRING”.
>>
>> This error can be reproduced when uploading a file (directly in the
>> BQ UI) to a table with bytes and using schema autodetect.
>>
>> Suggested solution:
>>
>> I suggest to change BigQuery IO to handle the base-64 encoding as
>> follows to allow the user to read and write raw bytes in BQ
>>
>> Writing data:
>>
>>-
>>
>>When a new table is created we use the provided schema to detect
>>bytes and handle the base-64 encoding accordingly
>>-
>>
>>When data is written to an existing table we use the API to get
>>the schema of the table and handle the base-64 

Re: Writing bytes to BigQuery with beam

2019-03-20 Thread Chamikara Jayalath
On Wed, Mar 20, 2019 at 6:30 PM Pablo Estrada  wrote:

> That sounds reasonable to me, Valentyn.
>
> Regarding (3), when the table already exists, it's not necessary to get
> the schema. BQ is smart enough to load everything in appropriately. (as
> long as bytes fields are base64 encoded)
>
> The problem is when the table does not exist and the user does not provide
> a schema. In that case, there is no simple way of auto-inferring the
> schema, as you correctly point out. I think it's reasonable to simply
> expect users provide schemas if their data will have tricky types to infer.
> Best
> -P.
>

Is this even an option ? I think when table is not available users have to
provide a schema to create a new table.


>
>
> On Wed, Mar 20, 2019 at 3:44 PM Valentyn Tymofieiev 
> wrote:
>
>> Thanks Juta for detailed analysis.
>>
>> I reached out to BigQuery team to improve documentation around treatment
>> of Bytes and reported the issue that schema autodetection does not work
>>  for BYTES in GCP
>> issue tracker .
>>
>> Is this a correct summary of your proposal?
>>
>> 1. Beam will base64-encode raw bytes, before passing them to BQ over rest
>> API. This will be a change in behavior for Python 2 (for good reasons).
>> 2. When reading data from BQ, all fileds of type BYTES will be
>> base64-decoded.
>> 3. Beam will send an API call to BigQuery to get table schema, whenever
>> schema is not supplied, to work around
>> https://issuetracker.google.com/issues/129006689. Does anyone see any
>> concerns with this? Is it always possible?
>>
>> Thanks,
>> Valentyn
>>
>> On Wed, Mar 20, 2019 at 12:45 PM Reuven Lax  wrote:
>>
>>> The Java SDK relies on Jackson to do the encoding.
>>>
>>> On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>>


 On Wed, Mar 20, 2019 at 5:46 AM Juta Staes  wrote:

> Hi all,
>
>
> I am working on porting beam to python 3 and discovered the following:
>
>
> Current handling of bytes in bigquery IO:
>
> When writing bytes to BQ , beam uses
> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API
> expects byte values to be base-64 encoded*.
>
> However when writing raw bytes they are currently never transformed to
> base-64 encoded strings. This results in the following errors:
>
>-
>
>When writing b’abc’ in python 2 this results in actually writing
>b'i\xb7' which is the same as base64.b64decode('abc='))
>-
>
>When writing b’abc’ in python 3 this results in “TypeError: b'abc'
>is not JSON serializable”
>-
>
>When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8'
>codec can't decode byte 0xab in position 0: invalid start byte. NAN, 
> INF
>and -INF values are not JSON compliant”
>-
>
>When reading bytes from BQ they are currently returned as base-64
>encoded strings rather then the raw bytes.
>
>
> Example code:
> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing
>
> There is also another issue when writing base-64 encoded string to BQ.
> When no schema is specified this results in “Invalid schema update. Field
> bytes has changed type from BYTES to STRING”.
>
> This error can be reproduced when uploading a file (directly in the BQ
> UI) to a table with bytes and using schema autodetect.
>
> Suggested solution:
>
> I suggest to change BigQuery IO to handle the base-64 encoding as
> follows to allow the user to read and write raw bytes in BQ
>
> Writing data:
>
>-
>
>When a new table is created we use the provided schema to detect
>bytes and handle the base-64 encoding accordingly
>-
>
>When data is written to an existing table we use the API to get
>the schema of the table and handle the base-64 encoding accordingly. We
>also pass the schema as argument to avoid the error from schema 
> autodetect.
>
> Reading data:
>
>-
>
>When reading data we also request the schema and handle the
>base-64 decoding accordingly to return raw bytes
>
>
> What are your thoughts on this?
>

 Thanks for the update. More context here:
 https://issues.apache.org/jira/browse/BEAM-6769

 Suggested solution sounds good to me. BTW do you know how Java SDK
 handles bytes type ? I believe we write JSON files and execute load jobs
 there as well (when method is FILE_LOADS).

 Thanks,
 Cham


>
> *I could not find this in the documentation of the API or in the
> documentation of BigQuery itself which also expects base-64 encoded 
> values.
> I discovered 

Re: Writing bytes to BigQuery with beam

2019-03-20 Thread Pablo Estrada
That sounds reasonable to me, Valentyn.

Regarding (3), when the table already exists, it's not necessary to get the
schema. BQ is smart enough to load everything in appropriately. (as long as
bytes fields are base64 encoded)

The problem is when the table does not exist and the user does not provide
a schema. In that case, there is no simple way of auto-inferring the
schema, as you correctly point out. I think it's reasonable to simply
expect users provide schemas if their data will have tricky types to infer.
Best
-P.


On Wed, Mar 20, 2019 at 3:44 PM Valentyn Tymofieiev 
wrote:

> Thanks Juta for detailed analysis.
>
> I reached out to BigQuery team to improve documentation around treatment
> of Bytes and reported the issue that schema autodetection does not work
>  for BYTES in GCP issue
> tracker .
>
> Is this a correct summary of your proposal?
>
> 1. Beam will base64-encode raw bytes, before passing them to BQ over rest
> API. This will be a change in behavior for Python 2 (for good reasons).
> 2. When reading data from BQ, all fileds of type BYTES will be
> base64-decoded.
> 3. Beam will send an API call to BigQuery to get table schema, whenever
> schema is not supplied, to work around
> https://issuetracker.google.com/issues/129006689. Does anyone see any
> concerns with this? Is it always possible?
>
> Thanks,
> Valentyn
>
> On Wed, Mar 20, 2019 at 12:45 PM Reuven Lax  wrote:
>
>> The Java SDK relies on Jackson to do the encoding.
>>
>> On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath 
>> wrote:
>>
>>>
>>>
>>> On Wed, Mar 20, 2019 at 5:46 AM Juta Staes  wrote:
>>>
 Hi all,


 I am working on porting beam to python 3 and discovered the following:


 Current handling of bytes in bigquery IO:

 When writing bytes to BQ , beam uses
 https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API
 expects byte values to be base-64 encoded*.

 However when writing raw bytes they are currently never transformed to
 base-64 encoded strings. This results in the following errors:

-

When writing b’abc’ in python 2 this results in actually writing
b'i\xb7' which is the same as base64.b64decode('abc='))
-

When writing b’abc’ in python 3 this results in “TypeError: b'abc'
is not JSON serializable”
-

When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8'
codec can't decode byte 0xab in position 0: invalid start byte. NAN, INF
and -INF values are not JSON compliant”
-

When reading bytes from BQ they are currently returned as base-64
encoded strings rather then the raw bytes.


 Example code:
 https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing

 There is also another issue when writing base-64 encoded string to BQ.
 When no schema is specified this results in “Invalid schema update. Field
 bytes has changed type from BYTES to STRING”.

 This error can be reproduced when uploading a file (directly in the BQ
 UI) to a table with bytes and using schema autodetect.

 Suggested solution:

 I suggest to change BigQuery IO to handle the base-64 encoding as
 follows to allow the user to read and write raw bytes in BQ

 Writing data:

-

When a new table is created we use the provided schema to detect
bytes and handle the base-64 encoding accordingly
-

When data is written to an existing table we use the API to get the
schema of the table and handle the base-64 encoding accordingly. We also
pass the schema as argument to avoid the error from schema autodetect.

 Reading data:

-

When reading data we also request the schema and handle the base-64
decoding accordingly to return raw bytes


 What are your thoughts on this?

>>>
>>> Thanks for the update. More context here:
>>> https://issues.apache.org/jira/browse/BEAM-6769
>>>
>>> Suggested solution sounds good to me. BTW do you know how Java SDK
>>> handles bytes type ? I believe we write JSON files and execute load jobs
>>> there as well (when method is FILE_LOADS).
>>>
>>> Thanks,
>>> Cham
>>>
>>>

 *I could not find this in the documentation of the API or in the
 documentation of BigQuery itself which also expects base-64 encoded values.
 I discovered this when uploading a file to BQ UI and getting an error:
 "Could not decode base64 string to bytes."


 --

 [image: https://ml6.eu] 

 * Juta Staes*
 ML6 Gent
 

Re: Python36/37 not installed on Beam2 and Beam12?

2019-03-20 Thread Yifan Zou
You could try to ping them in the slack channel
https://the-asf.slack.com/messages/  if it is really urgent.

On Wed, Mar 20, 2019 at 5:29 PM Mark Liu  wrote:

> Hi,
>
> I saw occasional py36 tox test failure in beam_PreCommit_Python
> and beam_Release_NightlySnapshot in cron job
>  as well
> as PR triggered job
> . The
> error is simple:
>
> ERROR: InterpreterNotFound: python3.6
>
> Turns out those failures only happened in Beam2 and Beam12. From console
> log of inventory jobs (beam2
>  and
> beam12 ),
> I found python3.6 and python3.7 interpreters are missing. This makes
> beam_PreCommit_Python_Cron
>  flaky
> recently and may fail any python build that runs on those two nodes.
>
> Infra team helped install Python3 on our Jenkins before, but they were
> slow for response on JIRA. What's the best way to have Infra team get
> involved to this problem?
>
> Thanks,
> Mark
>


Re: Python36/37 not installed on Beam2 and Beam12?

2019-03-20 Thread Ahmet Altay
I believe this is https://issues.apache.org/jira/browse/BEAM-6863

Asking questions on the infra channel on slack worked well for me before.

On Wed, Mar 20, 2019 at 5:29 PM Mark Liu  wrote:

> Hi,
>
> I saw occasional py36 tox test failure in beam_PreCommit_Python
> and beam_Release_NightlySnapshot in cron job
>  as well
> as PR triggered job
> . The
> error is simple:
>
> ERROR: InterpreterNotFound: python3.6
>
> Turns out those failures only happened in Beam2 and Beam12. From console
> log of inventory jobs (beam2
>  and
> beam12 ),
> I found python3.6 and python3.7 interpreters are missing. This makes
> beam_PreCommit_Python_Cron
>  flaky
> recently and may fail any python build that runs on those two nodes.
>
> Infra team helped install Python3 on our Jenkins before, but they were
> slow for response on JIRA. What's the best way to have Infra team get
> involved to this problem?
>
> Thanks,
> Mark
>


Python36/37 not installed on Beam2 and Beam12?

2019-03-20 Thread Mark Liu
Hi,

I saw occasional py36 tox test failure in beam_PreCommit_Python
and beam_Release_NightlySnapshot in cron job
 as well as PR
triggered job
. The
error is simple:

ERROR: InterpreterNotFound: python3.6

Turns out those failures only happened in Beam2 and Beam12. From console
log of inventory jobs (beam2
 and beam12
), I found
python3.6 and python3.7 interpreters are missing. This makes
beam_PreCommit_Python_Cron
 flaky recently
and may fail any python build that runs on those two nodes.

Infra team helped install Python3 on our Jenkins before, but they were slow
for response on JIRA. What's the best way to have Infra team get involved
to this problem?

Thanks,
Mark


Re: Writing bytes to BigQuery with beam

2019-03-20 Thread Valentyn Tymofieiev
Thanks Juta for detailed analysis.

I reached out to BigQuery team to improve documentation around treatment of
Bytes and reported the issue that schema autodetection does not work
 for BYTES in GCP issue
tracker .

Is this a correct summary of your proposal?

1. Beam will base64-encode raw bytes, before passing them to BQ over rest
API. This will be a change in behavior for Python 2 (for good reasons).
2. When reading data from BQ, all fileds of type BYTES will be
base64-decoded.
3. Beam will send an API call to BigQuery to get table schema, whenever
schema is not supplied, to work around
https://issuetracker.google.com/issues/129006689. Does anyone see any
concerns with this? Is it always possible?

Thanks,
Valentyn

On Wed, Mar 20, 2019 at 12:45 PM Reuven Lax  wrote:

> The Java SDK relies on Jackson to do the encoding.
>
> On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath 
> wrote:
>
>>
>>
>> On Wed, Mar 20, 2019 at 5:46 AM Juta Staes  wrote:
>>
>>> Hi all,
>>>
>>>
>>> I am working on porting beam to python 3 and discovered the following:
>>>
>>>
>>> Current handling of bytes in bigquery IO:
>>>
>>> When writing bytes to BQ , beam uses
>>> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API
>>> expects byte values to be base-64 encoded*.
>>>
>>> However when writing raw bytes they are currently never transformed to
>>> base-64 encoded strings. This results in the following errors:
>>>
>>>-
>>>
>>>When writing b’abc’ in python 2 this results in actually writing
>>>b'i\xb7' which is the same as base64.b64decode('abc='))
>>>-
>>>
>>>When writing b’abc’ in python 3 this results in “TypeError: b'abc'
>>>is not JSON serializable”
>>>-
>>>
>>>When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8'
>>>codec can't decode byte 0xab in position 0: invalid start byte. NAN, INF
>>>and -INF values are not JSON compliant”
>>>-
>>>
>>>When reading bytes from BQ they are currently returned as base-64
>>>encoded strings rather then the raw bytes.
>>>
>>>
>>> Example code:
>>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing
>>>
>>> There is also another issue when writing base-64 encoded string to BQ.
>>> When no schema is specified this results in “Invalid schema update. Field
>>> bytes has changed type from BYTES to STRING”.
>>>
>>> This error can be reproduced when uploading a file (directly in the BQ
>>> UI) to a table with bytes and using schema autodetect.
>>>
>>> Suggested solution:
>>>
>>> I suggest to change BigQuery IO to handle the base-64 encoding as
>>> follows to allow the user to read and write raw bytes in BQ
>>>
>>> Writing data:
>>>
>>>-
>>>
>>>When a new table is created we use the provided schema to detect
>>>bytes and handle the base-64 encoding accordingly
>>>-
>>>
>>>When data is written to an existing table we use the API to get the
>>>schema of the table and handle the base-64 encoding accordingly. We also
>>>pass the schema as argument to avoid the error from schema autodetect.
>>>
>>> Reading data:
>>>
>>>-
>>>
>>>When reading data we also request the schema and handle the base-64
>>>decoding accordingly to return raw bytes
>>>
>>>
>>> What are your thoughts on this?
>>>
>>
>> Thanks for the update. More context here:
>> https://issues.apache.org/jira/browse/BEAM-6769
>>
>> Suggested solution sounds good to me. BTW do you know how Java SDK
>> handles bytes type ? I believe we write JSON files and execute load jobs
>> there as well (when method is FILE_LOADS).
>>
>> Thanks,
>> Cham
>>
>>
>>>
>>> *I could not find this in the documentation of the API or in the
>>> documentation of BigQuery itself which also expects base-64 encoded values.
>>> I discovered this when uploading a file to BQ UI and getting an error:
>>> "Could not decode base64 string to bytes."
>>>
>>>
>>> --
>>>
>>> [image: https://ml6.eu] 
>>>
>>> * Juta Staes*
>>> ML6 Gent
>>> 
>>>
>>>  DISCLAIMER 
>>> This email and any files transmitted with it are confidential and
>>> intended solely for the use of the individual or entity to whom they are
>>> addressed. If you have received this email in error please notify the
>>> system manager. This message contains confidential information and is
>>> intended only for the individual named. If you are not the named addressee
>>> you should not disseminate, distribute or copy this e-mail. Please notify
>>> the sender immediately by e-mail if you have received this e-mail by
>>> mistake and delete this e-mail from your system. If you are not the
>>> intended recipient you are notified that disclosing, copying, distributing
>>> or taking any 

Re: What quick command to catch common issues before pushing a python PR?

2019-03-20 Thread Pablo Estrada
Fancy : )

On Wed, Mar 20, 2019 at 1:25 AM Robert Bradshaw  wrote:

> I use tox as well. Actually, I use detox and retox (parallel versions
> of tox, easily installable with pip) which can speed things up quite a
> bit.
>
> On Wed, Mar 20, 2019 at 1:33 AM Pablo Estrada  wrote:
> >
> > Correction  - the command is now: tox -e py35-gcp,py35-lint
> >
> > And it ran on my machine in 5min 40s. Not blazing fast, but at least
> significantly faster than waiting for Jenkins : )
> > Best
> > -P.
> >
> > On Tue, Mar 19, 2019 at 5:22 PM Pablo Estrada 
> wrote:
> >>
> >> I use a selection of tox tasks. Here are the tox tasks that I use the
> most:
> >> - py27-gcp
> >> - py35-gcp
> >> - py27-cython
> >> - py35-cython
> >> - py35-lint
> >> - py27-lint
> >>
> >> Most recently, I'll run `tox -e py3-gcp,py3-lint`, which run fairly
> quickly. You can choose which subset works for you.
> >> My insight is: Lints are pretty fast, so it's fine to add a couple
> different lints. Unittest runs are pretty slow, so I usually go for the one
> with most coverage for my change (x-gcp, or x-cython).
> >> Best
> >> -P.
> >>
> >> On Mon, Feb 25, 2019 at 4:33 PM Ruoyun Huang  wrote:
> >>>
> >>> nvm.  Don't take my previous non-scientific comparison (only ran it
> once) too seriously. :-)
> >>>
> >>> I tried to repeat each for multiple times and now the difference
> diminishes.  likely there was a transient error in caching.
> >>>
> >>> On Mon, Feb 25, 2019 at 3:38 PM Kenneth Knowles 
> wrote:
> 
>  Ah, that is likely caused by us having ill-defined tasks that cannot
> be cached. Or is it that the configuration time is so significant?
> 
>  Kenn
> 
>  On Mon, Feb 25, 2019 at 11:05 AM Ruoyun Huang 
> wrote:
> >
> > Out of curiosity as a light gradle user, I did a side by side
> comparison, and the readings confirm what Ken and Michael suggests.
> >
> > In the same repository, do gradle clean then followed by either of
> the two commands. Measure their runtime respectively.  The latter one takes
> 1/3 running time.
> >
> > time ./gradlew spotlessApply && ./gradlew checkstyleMain &&
> ./gradlew checkstyleTest && ./gradlew javadoc && ./gradlew findbugsMain &&
> ./gradlew compileTestJava && ./gradlew compileJava
> > real 9m29.330s user 0m11.330s sys 0m1.239s
> >
> > time ./gradlew spotlessApply checkstyleMain checkstyleTest javadoc
> findbugsMain compileJava compileTestJava
> > real3m35.573s
> > user0m2.701s
> > sys 0m0.327s
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Feb 25, 2019 at 10:47 AM Alex Amato 
> wrote:
> >>
> >> @Michael, no particular reason. I think Ken's suggestion makes more
> sense.
> >>
> >> On Mon, Feb 25, 2019 at 10:36 AM Udi Meiri 
> wrote:
> >>>
> >>> Talking about Python:
> >>> I only know of "./gradlew lint", which include style and some py3
> compliance checking.
> >>> There is no auto-fix like spotlessApply AFAIK.
> >>>
> >>> As a side-note, I really dislike our python line continuation
> indent rule, since pycharm can't be configured to adhere to it and I find
> myself manually adjusting whitespace all the time.
> >>>
> >>>
> >>> On Mon, Feb 25, 2019 at 10:22 AM Kenneth Knowles 
> wrote:
> 
>  FWIW gradle is a depgraph-based build system. You can gain a few
> seconds by putting all but spotlessApply in one command.
> 
>  ./gradlew spotlessApply && ./gradlew checkstyleMain
> checkstyleTest javadoc findbugsMain compileTestJava compileJava
> 
>  It might be clever to define a meta-task. Gradle "base plugin"
> has the notable check (build and run tests), assemble (make artifacts), and
> build (assemble + check, badly named!)
> 
>  I think something like "everything except running tests and
> building artifacts" might be helpful.
> 
>  Kenn
> 
>  On Mon, Feb 25, 2019 at 10:13 AM Alex Amato 
> wrote:
> >
> > I made a thread about this a while back for java, but I don't
> think the same commands like sptoless work for python.
> >
> > auto fixing lint issues
> > running and quick checks which would fail the PR (without
> running the whole precommit?)
> > Something like findbugs to detect common issues (i.e. py3
> compliance)
> >
> > FWIW, this is what I have been using for java. It will catch
> pretty much everything except presubmit test failures.
> >
> > ./gradlew spotlessApply && ./gradlew checkstyleMain && ./gradlew
> checkstyleTest && ./gradlew javadoc && ./gradlew findbugsMain && ./gradlew
> compileTestJava && ./gradlew compileJava
> >
> >
> >
> > --
> > 
> > Ruoyun  Huang
> >
> >>>
> >>>
> >>> --
> >>> 
> >>> Ruoyun  Huang
> >>>
>


Re: Writing bytes to BigQuery with beam

2019-03-20 Thread Reuven Lax
The Java SDK relies on Jackson to do the encoding.

On Wed, Mar 20, 2019 at 11:33 AM Chamikara Jayalath 
wrote:

>
>
> On Wed, Mar 20, 2019 at 5:46 AM Juta Staes  wrote:
>
>> Hi all,
>>
>>
>> I am working on porting beam to python 3 and discovered the following:
>>
>>
>> Current handling of bytes in bigquery IO:
>>
>> When writing bytes to BQ , beam uses
>> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API
>> expects byte values to be base-64 encoded*.
>>
>> However when writing raw bytes they are currently never transformed to
>> base-64 encoded strings. This results in the following errors:
>>
>>-
>>
>>When writing b’abc’ in python 2 this results in actually writing
>>b'i\xb7' which is the same as base64.b64decode('abc='))
>>-
>>
>>When writing b’abc’ in python 3 this results in “TypeError: b'abc' is
>>not JSON serializable”
>>-
>>
>>When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8'
>>codec can't decode byte 0xab in position 0: invalid start byte. NAN, INF
>>and -INF values are not JSON compliant”
>>-
>>
>>When reading bytes from BQ they are currently returned as base-64
>>encoded strings rather then the raw bytes.
>>
>>
>> Example code:
>> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing
>>
>> There is also another issue when writing base-64 encoded string to BQ.
>> When no schema is specified this results in “Invalid schema update. Field
>> bytes has changed type from BYTES to STRING”.
>>
>> This error can be reproduced when uploading a file (directly in the BQ
>> UI) to a table with bytes and using schema autodetect.
>>
>> Suggested solution:
>>
>> I suggest to change BigQuery IO to handle the base-64 encoding as follows
>> to allow the user to read and write raw bytes in BQ
>>
>> Writing data:
>>
>>-
>>
>>When a new table is created we use the provided schema to detect
>>bytes and handle the base-64 encoding accordingly
>>-
>>
>>When data is written to an existing table we use the API to get the
>>schema of the table and handle the base-64 encoding accordingly. We also
>>pass the schema as argument to avoid the error from schema autodetect.
>>
>> Reading data:
>>
>>-
>>
>>When reading data we also request the schema and handle the base-64
>>decoding accordingly to return raw bytes
>>
>>
>> What are your thoughts on this?
>>
>
> Thanks for the update. More context here:
> https://issues.apache.org/jira/browse/BEAM-6769
>
> Suggested solution sounds good to me. BTW do you know how Java SDK handles
> bytes type ? I believe we write JSON files and execute load jobs there as
> well (when method is FILE_LOADS).
>
> Thanks,
> Cham
>
>
>>
>> *I could not find this in the documentation of the API or in the
>> documentation of BigQuery itself which also expects base-64 encoded values.
>> I discovered this when uploading a file to BQ UI and getting an error:
>> "Could not decode base64 string to bytes."
>>
>>
>> --
>>
>> [image: https://ml6.eu] 
>>
>> * Juta Staes*
>> ML6 Gent
>> 
>>
>>  DISCLAIMER 
>> This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity to whom they are
>> addressed. If you have received this email in error please notify the
>> system manager. This message contains confidential information and is
>> intended only for the individual named. If you are not the named addressee
>> you should not disseminate, distribute or copy this e-mail. Please notify
>> the sender immediately by e-mail if you have received this e-mail by
>> mistake and delete this e-mail from your system. If you are not the
>> intended recipient you are notified that disclosing, copying, distributing
>> or taking any action in reliance on the contents of this information is
>> strictly prohibited.
>>
>


Re: Hazelcast Jet Runner

2019-03-20 Thread Ankur Goenka
Hi Can,

Like GreedyPipelineFuser, we have added many more components which makes
building a Portable Runner easy. Here is a link [1] to slides which
explains at a very high level what is needed to add a new portable runner.
Still adding a portable runner will be more complex than adding a native
runner but with these components it should be relatively easier than
originally expected.

[1]
https://docs.google.com/presentation/d/1JRNUSpOC8qaA4uLDuyGsuuyf6Tk8Xi9LAukhgl-hT_w/edit?usp=sharing

Thanks,
Ankur

On Wed, Mar 20, 2019 at 7:19 AM Maximilian Michels  wrote:

> Documentation on portability is still a bit sparse although there are
> many design documents:
> https://beam.apache.org/contribute/design-documents/#portability
>
> The structure of portable Runners is not fundamentally different, but
> some of the operations are deferred to the SDK which runs code for all
> supported languages. The Runner needs to provide an integration with it.
>
> Eventually, the old Runners will become obsolete though that won't
> happen very soon. Performance should be slightly better on the old Runners.
>
> I think writing an old-style Runner now will give you enough experience
> to port it to the new language-portable style later on.
>
> Cheers,
> Max
>
> On 20.03.19 14:52, Can Gencer wrote:
> > I had a look at "GreedyPipelineFuser" and indeed this was what exactly I
> > was talking about.
> >
> > Is https://beam.apache.org/roadmap/portability/ still the best
> > information about the portable runners or is there a more in-depth guide
> > available anywhere?
> >
> > On Wed, Mar 20, 2019 at 2:29 PM Can Gencer  > > wrote:
> >
> > Hi Max,
> >
> > Thanks. When you mean "old-style runner"  is this meant that this
> > style of runners will be obsolete and only the portable one will be
> > supported? The documentation for portable runners wasn't quite
> > complete and the barrier to entry for writing an old style runner
> > seemed easier for us and the old style runner should have better
> > performance?
> >
> > On Wed, Mar 20, 2019 at 1:36 PM Maximilian Michels  > > wrote:
> >
> > Hi Can,
> >
> > Thanks for the update. Interesting question. Flink has an
> > optimization
> > built in called chaining which works together nicely with Beam.
> > Essentially, operators which share the same partitioning get
> > executed
> > one after another inside a master operator. This saves resources.
> >
> > Interestingly, Beam's Fuser for portable Runners does something
> > similar.
> > AFAIK there is no built-in solution for the old-style Runners. I
> > think
> > it would be possible to build something like this on top of the
> > existing
> > translation.
> >
> > Cheers,
> > Max
> >
> > On 20.03.19 13:07, Can Gencer wrote:
> >  > Hi again,
> >  >
> >  > We've made some progress on the runner since writing this
> > more than a
> >  > month ago, the repo is available here publicly:
> >  > https://github.com/hazelcast/hazelcast-jet-beam-runner
> >  >
> >  > Still very much a work in progress though. One of the issues
> > I wanted to
> >  > raise is that currently we're translating each PTransform to
> > a Jet
> >  > Vertex (could be consider analogous to a Flink operator or a
> > vertex in
> >  > Tez). This is sub-optimal, since Beam creates lots of
> > transforms for
> >  > computations that could be performed inside the same Vertex,
> > such as
> >  > subsequent mapping transforms and many others. Ideally you
> > only need
> >  > distinct vertices where the data is re-partitioned and/or
> > shuffled. I'm
> >  > curious if Beam offers some way of translating the PTransform
> > graph to a
> >  > more minimal set of transforms, i.e. some kind of planner or
> > would this
> >  > have to be custom code? We've done a similar integration with
> > Cascading
> >  > in the past and it offered a planner which given a set of
> > rules would
> >  > partition the Cascading DAG into a minimal set of vertices
> > for the same
> >  > DAG. Curious if Beam has any similar functionality?
> >  >
> >  >
> >  >
> >  > On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles
> > mailto:k...@apache.org>
> >  > >> wrote:
> >  >
> >  > Elaborating on what Robert alluded to: when I wrote that
> > runner
> >  > author guide, portability was in its infancy. Now Beam
> > Python can be
> >  > run on Flink. So that guide is primarily focused on the
> > 

Re: Writing bytes to BigQuery with beam

2019-03-20 Thread Chamikara Jayalath
On Wed, Mar 20, 2019 at 5:46 AM Juta Staes  wrote:

> Hi all,
>
>
> I am working on porting beam to python 3 and discovered the following:
>
>
> Current handling of bytes in bigquery IO:
>
> When writing bytes to BQ , beam uses
> https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API
> expects byte values to be base-64 encoded*.
>
> However when writing raw bytes they are currently never transformed to
> base-64 encoded strings. This results in the following errors:
>
>-
>
>When writing b’abc’ in python 2 this results in actually writing
>b'i\xb7' which is the same as base64.b64decode('abc='))
>-
>
>When writing b’abc’ in python 3 this results in “TypeError: b'abc' is
>not JSON serializable”
>-
>
>When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' codec
>can't decode byte 0xab in position 0: invalid start byte. NAN, INF and -INF
>values are not JSON compliant”
>-
>
>When reading bytes from BQ they are currently returned as base-64
>encoded strings rather then the raw bytes.
>
>
> Example code:
> https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing
>
> There is also another issue when writing base-64 encoded string to BQ.
> When no schema is specified this results in “Invalid schema update. Field
> bytes has changed type from BYTES to STRING”.
>
> This error can be reproduced when uploading a file (directly in the BQ UI)
> to a table with bytes and using schema autodetect.
>
> Suggested solution:
>
> I suggest to change BigQuery IO to handle the base-64 encoding as follows
> to allow the user to read and write raw bytes in BQ
>
> Writing data:
>
>-
>
>When a new table is created we use the provided schema to detect bytes
>and handle the base-64 encoding accordingly
>-
>
>When data is written to an existing table we use the API to get the
>schema of the table and handle the base-64 encoding accordingly. We also
>pass the schema as argument to avoid the error from schema autodetect.
>
> Reading data:
>
>-
>
>When reading data we also request the schema and handle the base-64
>decoding accordingly to return raw bytes
>
>
> What are your thoughts on this?
>

Thanks for the update. More context here:
https://issues.apache.org/jira/browse/BEAM-6769

Suggested solution sounds good to me. BTW do you know how Java SDK handles
bytes type ? I believe we write JSON files and execute load jobs there as
well (when method is FILE_LOADS).

Thanks,
Cham


>
> *I could not find this in the documentation of the API or in the
> documentation of BigQuery itself which also expects base-64 encoded values.
> I discovered this when uploading a file to BQ UI and getting an error:
> "Could not decode base64 string to bytes."
>
>
> --
>
> [image: https://ml6.eu] 
>
> * Juta Staes*
> ML6 Gent
> 
>
>  DISCLAIMER 
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system manager.
> This message contains confidential information and is intended only for the
> individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. If you are not the intended recipient
> you are notified that disclosing, copying, distributing or taking any
> action in reliance on the contents of this information is strictly
> prohibited.
>


Re: [Announcement] New Website for Beam Summits

2019-03-20 Thread David Morávek
This is great! Thanks for all of the hard work you're putting into this.

D.

On Wed, Mar 20, 2019 at 1:38 PM Maximilian Michels  wrote:

> Not a bug, it's a feature ;)
>
> On 20.03.19 07:23, Kenneth Knowles wrote:
> > Very nice. I appreciate the emphasis on coffee [1] [2] [3] though I
> > suspect there may be a rendering bug.
> >
> > Kenn
> >
> > [1] https://beamsummit.org/schedule/2019-06-19?sessionId=1
> > [2] https://beamsummit.org/schedule/2019-06-19?sessionId=3
> > [3] https://beamsummit.org/schedule/2019-06-19?sessionId=4
> >
> > On Tue, Mar 19, 2019 at 4:43 AM Łukasz Gajowy  > > wrote:
> >
> > Looks great! Thanks for doing this! :)
> >
> > Łukasz
> >
> > wt., 19 mar 2019 o 12:30 Maximilian Michels  > > napisał(a):
> >
> > Great stuff! Looking forward to seeing many Beam folks in Berlin.
> >
> > In case you want to speak at Beam Summit Europe, the Call for
> > Papers is
> > open until April 1:
> https://sessionize.com/beam-summit-europe-2019
> >
> > -Max
> >
> > On 19.03.19 09:49, Matthias Baetens wrote:
> >  > Awesome Aizhamal! Great work and thanks for your continued
> > efforts on
> >  > this :) Looking forward to the summit.
> >  >
> >  > On Mon, 18 Mar 2019 at 23:17, Aizhamal Nurmamat kyzy
> >  > mailto:aizha...@google.com>
> > >>
> wrote:
> >  >
> >  > Hello everybody!
> >  >
> >  >
> >  > We are thrilled to announce the launch of beamsummit.org
> > 
> >  >  dedicated to Beam Summits!
> >  >
> >  >
> >  > The current version of the website provides information
> > about the
> >  > upcoming Beam Summit in Europe on June 19-20th, 2019. We
> > will update
> >  > it for the upcoming summits in Asia and North America
> > accordingly.
> >  > You can access all necessary information about the
> > conference theme,
> >  > speakers and sessions, the abstract submission timeline
> > and the
> >  > registration process, the conference venues and much more
> > that you
> >  > will find useful until and during the Beam Summits 2019.
> >  >
> >  >
> >  > We are working to make the website easy to use, so that
> > anyone who
> >  > is organizing a Beam event can rely on it. You can find
> > the code for
> >  > it in Github
> > .
> >  >
> >  > The pages will be updated on a regular basis, but we also
> > love
> >  > hearing thoughts from our community! Let us know if you
> > have any
> >  > questions, comments or suggestions, and help us improve.
> > Also, if
> >  > you are thinking of organizing a Beam event, please feel
> > free to
> >  > reach out  > >for support, and to use the
> >  > code in GitHub as well.
> >  >
> >  >
> >  > We sincerely hope that you like the new Beam Summit
> > website and will
> >  > find it useful for accessing information. Enjoy browsing
> > around!
> >  >
> >  >
> >  > Thanks,
> >  >
> >  > Aizhamal
> >  >
> >
>


Re: User state cleanup

2019-03-20 Thread Kenneth Knowles
On Wed, Mar 20, 2019 at 6:23 AM Maximilian Michels  wrote:

> Hi,
>
> I just realized that user state acquired via StateInternals in the Flink
> Runner is not released automatically even when it falls out of the
> Window scope. There are ways to work around this, i.e. setting a cleanup
> timer that fires when the Window expires.
>
> Do we expect Runners to perform the cleanup? I would think so since
> users do not have control over state once the window expires.
>

Just to be super clear for anyone not digging in the referenced code: yes,
we do. And the code Reuven referenced is utility code that a runner can use
to facilitate this, or the runner can do its own thing.

Kenn


>
> Thanks,
> Max
>
>


Re: User state cleanup

2019-03-20 Thread Thomas Weise
Good to know that the basic capability is in place, otherwise stateful
processing could only be used with timers that perform cleanup in user land.

I don't think the cleanup timer is used in the portable Flink runner
though. DoFnOperator.createWrappingDoFnRunner isn't executed in this case.

Would be nice to have test coverage for the cleanup path, since that would
eventually lead to out of memory (with the heap memory state backend) or
even harder to diagnose disk space issues.

Thomas

On Wed, Mar 20, 2019 at 7:03 AM Maximilian Michels  wrote:

> Thanks for the pointer Reuven. I didn't see that on window expiration
> this would iterate over all user state and call the `clear` method.
>
> -Max
>
> On 20.03.19 14:59, Reuven Lax wrote:
> > Is this not already handled by cleanupTimer in StatefulDoFnRunner?
> >
> > On Wed, Mar 20, 2019 at 6:23 AM Maximilian Michels  > > wrote:
> >
> > Hi,
> >
> > I just realized that user state acquired via StateInternals in the
> > Flink
> > Runner is not released automatically even when it falls out of the
> > Window scope. There are ways to work around this, i.e. setting a
> > cleanup
> > timer that fires when the Window expires.
> >
> > Do we expect Runners to perform the cleanup? I would think so since
> > users do not have control over state once the window expires.
> >
> > Thanks,
> > Max
> >
>


Re: Hazelcast Jet Runner

2019-03-20 Thread Maximilian Michels
Documentation on portability is still a bit sparse although there are 
many design documents: 
https://beam.apache.org/contribute/design-documents/#portability


The structure of portable Runners is not fundamentally different, but 
some of the operations are deferred to the SDK which runs code for all 
supported languages. The Runner needs to provide an integration with it.


Eventually, the old Runners will become obsolete though that won't 
happen very soon. Performance should be slightly better on the old Runners.


I think writing an old-style Runner now will give you enough experience 
to port it to the new language-portable style later on.


Cheers,
Max

On 20.03.19 14:52, Can Gencer wrote:
I had a look at "GreedyPipelineFuser" and indeed this was what exactly I 
was talking about.


Is https://beam.apache.org/roadmap/portability/ still the best 
information about the portable runners or is there a more in-depth guide 
available anywhere?


On Wed, Mar 20, 2019 at 2:29 PM Can Gencer > wrote:


Hi Max,

Thanks. When you mean "old-style runner"  is this meant that this
style of runners will be obsolete and only the portable one will be
supported? The documentation for portable runners wasn't quite
complete and the barrier to entry for writing an old style runner
seemed easier for us and the old style runner should have better
performance?

On Wed, Mar 20, 2019 at 1:36 PM Maximilian Michels mailto:m...@apache.org>> wrote:

Hi Can,

Thanks for the update. Interesting question. Flink has an
optimization
built in called chaining which works together nicely with Beam.
Essentially, operators which share the same partitioning get
executed
one after another inside a master operator. This saves resources.

Interestingly, Beam's Fuser for portable Runners does something
similar.
AFAIK there is no built-in solution for the old-style Runners. I
think
it would be possible to build something like this on top of the
existing
translation.

Cheers,
Max

On 20.03.19 13:07, Can Gencer wrote:
 > Hi again,
 >
 > We've made some progress on the runner since writing this
more than a
 > month ago, the repo is available here publicly:
 > https://github.com/hazelcast/hazelcast-jet-beam-runner
 >
 > Still very much a work in progress though. One of the issues
I wanted to
 > raise is that currently we're translating each PTransform to
a Jet
 > Vertex (could be consider analogous to a Flink operator or a
vertex in
 > Tez). This is sub-optimal, since Beam creates lots of
transforms for
 > computations that could be performed inside the same Vertex,
such as
 > subsequent mapping transforms and many others. Ideally you
only need
 > distinct vertices where the data is re-partitioned and/or
shuffled. I'm
 > curious if Beam offers some way of translating the PTransform
graph to a
 > more minimal set of transforms, i.e. some kind of planner or
would this
 > have to be custom code? We've done a similar integration with
Cascading
 > in the past and it offered a planner which given a set of
rules would
 > partition the Cascading DAG into a minimal set of vertices
for the same
 > DAG. Curious if Beam has any similar functionality?
 >
 >
 >
 > On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles
mailto:k...@apache.org>
 > >> wrote:
 >
 >     Elaborating on what Robert alluded to: when I wrote that
runner
 >     author guide, portability was in its infancy. Now Beam
Python can be
 >     run on Flink. So that guide is primarily focused on the
"deserialize
 >     a Java DoFn and call its methods" approach. A decent
amount of it is
 >     still really important to know, but is now the
responsibility of the
 >     "SDK harness", aka language-specific coprocessor. For
Python & Go &
 >      you really want to use the
 >     portability protos and the portable Flink runner is the
best model.
 >
 >     Kenn
 >
 >
 >     On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw
mailto:rober...@google.com>
 >     >> wrote:
 >
 >         On Fri, Feb 15, 2019 at 7:36 AM Can Gencer
mailto:c...@hazelcast.com>
 >         >> wrote:
 >          >
 >          > We at Hazelcast are 

Re: User state cleanup

2019-03-20 Thread Maximilian Michels
Thanks for the pointer Reuven. I didn't see that on window expiration 
this would iterate over all user state and call the `clear` method.


-Max

On 20.03.19 14:59, Reuven Lax wrote:

Is this not already handled by cleanupTimer in StatefulDoFnRunner?

On Wed, Mar 20, 2019 at 6:23 AM Maximilian Michels > wrote:


Hi,

I just realized that user state acquired via StateInternals in the
Flink
Runner is not released automatically even when it falls out of the
Window scope. There are ways to work around this, i.e. setting a
cleanup
timer that fires when the Window expires.

Do we expect Runners to perform the cleanup? I would think so since
users do not have control over state once the window expires.

Thanks,
Max



Re: User state cleanup

2019-03-20 Thread Reuven Lax
Is this not already handled by cleanupTimer in StatefulDoFnRunner?

On Wed, Mar 20, 2019 at 6:23 AM Maximilian Michels  wrote:

> Hi,
>
> I just realized that user state acquired via StateInternals in the Flink
> Runner is not released automatically even when it falls out of the
> Window scope. There are ways to work around this, i.e. setting a cleanup
> timer that fires when the Window expires.
>
> Do we expect Runners to perform the cleanup? I would think so since
> users do not have control over state once the window expires.
>
> Thanks,
> Max
>
>


Re: Hazelcast Jet Runner

2019-03-20 Thread Can Gencer
I had a look at "GreedyPipelineFuser" and indeed this was what exactly I
was talking about.

Is https://beam.apache.org/roadmap/portability/ still the best information
about the portable runners or is there a more in-depth guide available
anywhere?

On Wed, Mar 20, 2019 at 2:29 PM Can Gencer  wrote:

> Hi Max,
>
> Thanks. When you mean "old-style runner"  is this meant that this style of
> runners will be obsolete and only the portable one will be supported? The
> documentation for portable runners wasn't quite complete and the barrier to
> entry for writing an old style runner seemed easier for us and the old
> style runner should have better performance?
>
> On Wed, Mar 20, 2019 at 1:36 PM Maximilian Michels  wrote:
>
>> Hi Can,
>>
>> Thanks for the update. Interesting question. Flink has an optimization
>> built in called chaining which works together nicely with Beam.
>> Essentially, operators which share the same partitioning get executed
>> one after another inside a master operator. This saves resources.
>>
>> Interestingly, Beam's Fuser for portable Runners does something similar.
>> AFAIK there is no built-in solution for the old-style Runners. I think
>> it would be possible to build something like this on top of the existing
>> translation.
>>
>> Cheers,
>> Max
>>
>> On 20.03.19 13:07, Can Gencer wrote:
>> > Hi again,
>> >
>> > We've made some progress on the runner since writing this more than a
>> > month ago, the repo is available here publicly:
>> > https://github.com/hazelcast/hazelcast-jet-beam-runner
>> >
>> > Still very much a work in progress though. One of the issues I wanted
>> to
>> > raise is that currently we're translating each PTransform to a Jet
>> > Vertex (could be consider analogous to a Flink operator or a vertex in
>> > Tez). This is sub-optimal, since Beam creates lots of transforms for
>> > computations that could be performed inside the same Vertex, such as
>> > subsequent mapping transforms and many others. Ideally you only need
>> > distinct vertices where the data is re-partitioned and/or shuffled. I'm
>> > curious if Beam offers some way of translating the PTransform graph to
>> a
>> > more minimal set of transforms, i.e. some kind of planner or would this
>> > have to be custom code? We've done a similar integration with Cascading
>> > in the past and it offered a planner which given a set of rules would
>> > partition the Cascading DAG into a minimal set of vertices for the same
>> > DAG. Curious if Beam has any similar functionality?
>> >
>> >
>> >
>> > On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles > > > wrote:
>> >
>> > Elaborating on what Robert alluded to: when I wrote that runner
>> > author guide, portability was in its infancy. Now Beam Python can be
>> > run on Flink. So that guide is primarily focused on the "deserialize
>> > a Java DoFn and call its methods" approach. A decent amount of it is
>> > still really important to know, but is now the responsibility of the
>> > "SDK harness", aka language-specific coprocessor. For Python & Go &
>> >  you really want to use the
>> > portability protos and the portable Flink runner is the best model.
>> >
>> > Kenn
>> >
>> >
>> > On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw <
>> rober...@google.com
>> > > wrote:
>> >
>> > On Fri, Feb 15, 2019 at 7:36 AM Can Gencer > > > wrote:
>> >  >
>> >  > We at Hazelcast are looking into writing a Beam runner for
>> > Hazelcast Jet (https://github.com/hazelcast/hazelcast-jet). I
>> > wanted to introduce myself as we'll likely have questions as we
>> > start development.
>> >
>> > Welcome!
>> >
>> > Hazelcast looks interesting, a Beam runner for it would be very
>> > cool.
>> >
>> >  > Some of the things I'm wondering about currently:
>> >  >
>> >  > * Currently there seems to be a guide available at
>> > https://beam.apache.org/contribute/runner-guide/ , is this up
>> to
>> > date? Is there anything in specific to be aware of when starting
>> > with a new runner that's not covered here?
>> >
>> > That looks like a pretty good starting point. At a quick
>> glance, I
>> > don't see anything that looks out of date. Another resource that
>> > might
>> > be helpful is a talk from last year on writing an SDK (but as it
>> > mostly covers the runner-sdk interaction, it's also quite
>> useful for
>> > understanding the runner side:
>> >
>> https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
>> > And please feel free to ask any questions on this list as well;
>> we'd
>> > be happy to help.
>> >
>> >  > * Should we be targeting the latest master which is at
>> > 2.12-SNAPSHOT or a stable version?
>> >
>> > 

Re: Hazelcast Jet Runner

2019-03-20 Thread Can Gencer
Hi Max,

Thanks. When you mean "old-style runner"  is this meant that this style of
runners will be obsolete and only the portable one will be supported? The
documentation for portable runners wasn't quite complete and the barrier to
entry for writing an old style runner seemed easier for us and the old
style runner should have better performance?

On Wed, Mar 20, 2019 at 1:36 PM Maximilian Michels  wrote:

> Hi Can,
>
> Thanks for the update. Interesting question. Flink has an optimization
> built in called chaining which works together nicely with Beam.
> Essentially, operators which share the same partitioning get executed
> one after another inside a master operator. This saves resources.
>
> Interestingly, Beam's Fuser for portable Runners does something similar.
> AFAIK there is no built-in solution for the old-style Runners. I think
> it would be possible to build something like this on top of the existing
> translation.
>
> Cheers,
> Max
>
> On 20.03.19 13:07, Can Gencer wrote:
> > Hi again,
> >
> > We've made some progress on the runner since writing this more than a
> > month ago, the repo is available here publicly:
> > https://github.com/hazelcast/hazelcast-jet-beam-runner
> >
> > Still very much a work in progress though. One of the issues I wanted to
> > raise is that currently we're translating each PTransform to a Jet
> > Vertex (could be consider analogous to a Flink operator or a vertex in
> > Tez). This is sub-optimal, since Beam creates lots of transforms for
> > computations that could be performed inside the same Vertex, such as
> > subsequent mapping transforms and many others. Ideally you only need
> > distinct vertices where the data is re-partitioned and/or shuffled. I'm
> > curious if Beam offers some way of translating the PTransform graph to a
> > more minimal set of transforms, i.e. some kind of planner or would this
> > have to be custom code? We've done a similar integration with Cascading
> > in the past and it offered a planner which given a set of rules would
> > partition the Cascading DAG into a minimal set of vertices for the same
> > DAG. Curious if Beam has any similar functionality?
> >
> >
> >
> > On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles  > > wrote:
> >
> > Elaborating on what Robert alluded to: when I wrote that runner
> > author guide, portability was in its infancy. Now Beam Python can be
> > run on Flink. So that guide is primarily focused on the "deserialize
> > a Java DoFn and call its methods" approach. A decent amount of it is
> > still really important to know, but is now the responsibility of the
> > "SDK harness", aka language-specific coprocessor. For Python & Go &
> >  you really want to use the
> > portability protos and the portable Flink runner is the best model.
> >
> > Kenn
> >
> >
> > On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw  > > wrote:
> >
> > On Fri, Feb 15, 2019 at 7:36 AM Can Gencer  > > wrote:
> >  >
> >  > We at Hazelcast are looking into writing a Beam runner for
> > Hazelcast Jet (https://github.com/hazelcast/hazelcast-jet). I
> > wanted to introduce myself as we'll likely have questions as we
> > start development.
> >
> > Welcome!
> >
> > Hazelcast looks interesting, a Beam runner for it would be very
> > cool.
> >
> >  > Some of the things I'm wondering about currently:
> >  >
> >  > * Currently there seems to be a guide available at
> > https://beam.apache.org/contribute/runner-guide/ , is this up to
> > date? Is there anything in specific to be aware of when starting
> > with a new runner that's not covered here?
> >
> > That looks like a pretty good starting point. At a quick glance,
> I
> > don't see anything that looks out of date. Another resource that
> > might
> > be helpful is a talk from last year on writing an SDK (but as it
> > mostly covers the runner-sdk interaction, it's also quite useful
> for
> > understanding the runner side:
> >
> https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
> > And please feel free to ask any questions on this list as well;
> we'd
> > be happy to help.
> >
> >  > * Should we be targeting the latest master which is at
> > 2.12-SNAPSHOT or a stable version?
> >
> > I would target the latest master.
> >
> >  > * After a runner is developed, how is the maintenance
> > typically handled, as the runners seems to be part of Beam
> codebase?
> >
> > Either is possible. Several runner adapters are part of the Beam
> > codebase, but for example the IMB Streams Beam runner is not.
> There
> > are certainly pros and cons (certainly early on when the APIs
> > 

User state cleanup

2019-03-20 Thread Maximilian Michels

Hi,

I just realized that user state acquired via StateInternals in the Flink 
Runner is not released automatically even when it falls out of the 
Window scope. There are ways to work around this, i.e. setting a cleanup 
timer that fires when the Window expires.


Do we expect Runners to perform the cleanup? I would think so since 
users do not have control over state once the window expires.


Thanks,
Max



Writing bytes to BigQuery with beam

2019-03-20 Thread Juta Staes
Hi all,


I am working on porting beam to python 3 and discovered the following:


Current handling of bytes in bigquery IO:

When writing bytes to BQ , beam uses
https://cloud.google.com/bigquery/docs/reference/rest/v2/. This API expects
byte values to be base-64 encoded*.

However when writing raw bytes they are currently never transformed to
base-64 encoded strings. This results in the following errors:

   -

   When writing b’abc’ in python 2 this results in actually writing
   b'i\xb7' which is the same as base64.b64decode('abc='))
   -

   When writing b’abc’ in python 3 this results in “TypeError: b'abc' is
   not JSON serializable”
   -

   When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' codec
   can't decode byte 0xab in position 0: invalid start byte. NAN, INF and -INF
   values are not JSON compliant”
   -

   When reading bytes from BQ they are currently returned as base-64
   encoded strings rather then the raw bytes.


Example code:
https://docs.google.com/document/d/19zvDycWzF82MmtCmxrhqqyXKaRq8slRIjdxE6E8MObA/edit?usp=sharing

There is also another issue when writing base-64 encoded string to BQ. When
no schema is specified this results in “Invalid schema update. Field bytes
has changed type from BYTES to STRING”.

This error can be reproduced when uploading a file (directly in the BQ UI)
to a table with bytes and using schema autodetect.

Suggested solution:

I suggest to change BigQuery IO to handle the base-64 encoding as follows
to allow the user to read and write raw bytes in BQ

Writing data:

   -

   When a new table is created we use the provided schema to detect bytes
   and handle the base-64 encoding accordingly
   -

   When data is written to an existing table we use the API to get the
   schema of the table and handle the base-64 encoding accordingly. We also
   pass the schema as argument to avoid the error from schema autodetect.

Reading data:

   -

   When reading data we also request the schema and handle the base-64
   decoding accordingly to return raw bytes


What are your thoughts on this?

*I could not find this in the documentation of the API or in the
documentation of BigQuery itself which also expects base-64 encoded values.
I discovered this when uploading a file to BQ UI and getting an error:
"Could not decode base64 string to bytes."


-- 

[image: https://ml6.eu] 

* Juta Staes*
ML6 Gent


 DISCLAIMER 
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the system manager.
This message contains confidential information and is intended only for the
individual named. If you are not the named addressee you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately by e-mail if you have received this e-mail by mistake and
delete this e-mail from your system. If you are not the intended recipient
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.


Re: [Announcement] New Website for Beam Summits

2019-03-20 Thread Maximilian Michels

Not a bug, it's a feature ;)

On 20.03.19 07:23, Kenneth Knowles wrote:
Very nice. I appreciate the emphasis on coffee [1] [2] [3] though I 
suspect there may be a rendering bug.


Kenn

[1] https://beamsummit.org/schedule/2019-06-19?sessionId=1
[2] https://beamsummit.org/schedule/2019-06-19?sessionId=3
[3] https://beamsummit.org/schedule/2019-06-19?sessionId=4

On Tue, Mar 19, 2019 at 4:43 AM Łukasz Gajowy > wrote:


Looks great! Thanks for doing this! :)

Łukasz

wt., 19 mar 2019 o 12:30 Maximilian Michels mailto:m...@apache.org>> napisał(a):

Great stuff! Looking forward to seeing many Beam folks in Berlin.

In case you want to speak at Beam Summit Europe, the Call for
Papers is
open until April 1: https://sessionize.com/beam-summit-europe-2019

-Max

On 19.03.19 09:49, Matthias Baetens wrote:
 > Awesome Aizhamal! Great work and thanks for your continued
efforts on
 > this :) Looking forward to the summit.
 >
 > On Mon, 18 Mar 2019 at 23:17, Aizhamal Nurmamat kyzy
 > mailto:aizha...@google.com>
>> wrote:
 >
 >     Hello everybody!
 >
 >
 >     We are thrilled to announce the launch of beamsummit.org

 >      dedicated to Beam Summits!
 >
 >
 >     The current version of the website provides information
about the
 >     upcoming Beam Summit in Europe on June 19-20th, 2019. We
will update
 >     it for the upcoming summits in Asia and North America
accordingly.
 >     You can access all necessary information about the
conference theme,
 >     speakers and sessions, the abstract submission timeline
and the
 >     registration process, the conference venues and much more
that you
 >     will find useful until and during the Beam Summits 2019.
 >
 >
 >     We are working to make the website easy to use, so that
anyone who
 >     is organizing a Beam event can rely on it. You can find
the code for
 >     it in Github
.
 >
 >     The pages will be updated on a regular basis, but we also
love
 >     hearing thoughts from our community! Let us know if you
have any
 >     questions, comments or suggestions, and help us improve.
Also, if
 >     you are thinking of organizing a Beam event, please feel
free to
 >     reach out >for support, and to use the
 >     code in GitHub as well.
 >
 >
 >     We sincerely hope that you like the new Beam Summit
website and will
 >     find it useful for accessing information. Enjoy browsing
around!
 >
 >
 >     Thanks,
 >
 >     Aizhamal
 >



Re: Hazelcast Jet Runner

2019-03-20 Thread Maximilian Michels

Hi Can,

Thanks for the update. Interesting question. Flink has an optimization 
built in called chaining which works together nicely with Beam. 
Essentially, operators which share the same partitioning get executed 
one after another inside a master operator. This saves resources.


Interestingly, Beam's Fuser for portable Runners does something similar. 
AFAIK there is no built-in solution for the old-style Runners. I think 
it would be possible to build something like this on top of the existing 
translation.


Cheers,
Max

On 20.03.19 13:07, Can Gencer wrote:

Hi again,

We've made some progress on the runner since writing this more than a 
month ago, the repo is available here publicly: 
https://github.com/hazelcast/hazelcast-jet-beam-runner


Still very much a work in progress though. One of the issues I wanted to 
raise is that currently we're translating each PTransform to a Jet 
Vertex (could be consider analogous to a Flink operator or a vertex in 
Tez). This is sub-optimal, since Beam creates lots of transforms for 
computations that could be performed inside the same Vertex, such as 
subsequent mapping transforms and many others. Ideally you only need 
distinct vertices where the data is re-partitioned and/or shuffled. I'm 
curious if Beam offers some way of translating the PTransform graph to a 
more minimal set of transforms, i.e. some kind of planner or would this 
have to be custom code? We've done a similar integration with Cascading 
in the past and it offered a planner which given a set of rules would 
partition the Cascading DAG into a minimal set of vertices for the same 
DAG. Curious if Beam has any similar functionality?




On Sat, Feb 16, 2019 at 4:50 AM Kenneth Knowles > wrote:


Elaborating on what Robert alluded to: when I wrote that runner
author guide, portability was in its infancy. Now Beam Python can be
run on Flink. So that guide is primarily focused on the "deserialize
a Java DoFn and call its methods" approach. A decent amount of it is
still really important to know, but is now the responsibility of the
"SDK harness", aka language-specific coprocessor. For Python & Go &
 you really want to use the
portability protos and the portable Flink runner is the best model.

Kenn


On Fri, Feb 15, 2019 at 2:08 AM Robert Bradshaw mailto:rober...@google.com>> wrote:

On Fri, Feb 15, 2019 at 7:36 AM Can Gencer mailto:c...@hazelcast.com>> wrote:
 >
 > We at Hazelcast are looking into writing a Beam runner for
Hazelcast Jet (https://github.com/hazelcast/hazelcast-jet). I
wanted to introduce myself as we'll likely have questions as we
start development.

Welcome!

Hazelcast looks interesting, a Beam runner for it would be very
cool.

 > Some of the things I'm wondering about currently:
 >
 > * Currently there seems to be a guide available at
https://beam.apache.org/contribute/runner-guide/ , is this up to
date? Is there anything in specific to be aware of when starting
with a new runner that's not covered here?

That looks like a pretty good starting point. At a quick glance, I
don't see anything that looks out of date. Another resource that
might
be helpful is a talk from last year on writing an SDK (but as it
mostly covers the runner-sdk interaction, it's also quite useful for
understanding the runner side:

https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
And please feel free to ask any questions on this list as well; we'd
be happy to help.

 > * Should we be targeting the latest master which is at
2.12-SNAPSHOT or a stable version?

I would target the latest master.

 > * After a runner is developed, how is the maintenance
typically handled, as the runners seems to be part of Beam codebase?

Either is possible. Several runner adapters are part of the Beam
codebase, but for example the IMB Streams Beam runner is not. There
are certainly pros and cons (certainly early on when the APIs
themselves were under heavy development it was easier to keep things
in sync in the same codebase, but things have mostly stabilized
now).
A runner only becomes part of the Beam codebase if there are members
of the community committed to maintaining it (which could include
you). Both approaches are fine.

- Robert



Re: What quick command to catch common issues before pushing a python PR?

2019-03-20 Thread Robert Bradshaw
I use tox as well. Actually, I use detox and retox (parallel versions
of tox, easily installable with pip) which can speed things up quite a
bit.

On Wed, Mar 20, 2019 at 1:33 AM Pablo Estrada  wrote:
>
> Correction  - the command is now: tox -e py35-gcp,py35-lint
>
> And it ran on my machine in 5min 40s. Not blazing fast, but at least 
> significantly faster than waiting for Jenkins : )
> Best
> -P.
>
> On Tue, Mar 19, 2019 at 5:22 PM Pablo Estrada  wrote:
>>
>> I use a selection of tox tasks. Here are the tox tasks that I use the most:
>> - py27-gcp
>> - py35-gcp
>> - py27-cython
>> - py35-cython
>> - py35-lint
>> - py27-lint
>>
>> Most recently, I'll run `tox -e py3-gcp,py3-lint`, which run fairly quickly. 
>> You can choose which subset works for you.
>> My insight is: Lints are pretty fast, so it's fine to add a couple different 
>> lints. Unittest runs are pretty slow, so I usually go for the one with most 
>> coverage for my change (x-gcp, or x-cython).
>> Best
>> -P.
>>
>> On Mon, Feb 25, 2019 at 4:33 PM Ruoyun Huang  wrote:
>>>
>>> nvm.  Don't take my previous non-scientific comparison (only ran it once) 
>>> too seriously. :-)
>>>
>>> I tried to repeat each for multiple times and now the difference 
>>> diminishes.  likely there was a transient error in caching.
>>>
>>> On Mon, Feb 25, 2019 at 3:38 PM Kenneth Knowles  wrote:

 Ah, that is likely caused by us having ill-defined tasks that cannot be 
 cached. Or is it that the configuration time is so significant?

 Kenn

 On Mon, Feb 25, 2019 at 11:05 AM Ruoyun Huang  wrote:
>
> Out of curiosity as a light gradle user, I did a side by side comparison, 
> and the readings confirm what Ken and Michael suggests.
>
> In the same repository, do gradle clean then followed by either of the 
> two commands. Measure their runtime respectively.  The latter one takes 
> 1/3 running time.
>
> time ./gradlew spotlessApply && ./gradlew checkstyleMain && ./gradlew 
> checkstyleTest && ./gradlew javadoc && ./gradlew findbugsMain && 
> ./gradlew compileTestJava && ./gradlew compileJava
> real 9m29.330s user 0m11.330s sys 0m1.239s
>
> time ./gradlew spotlessApply checkstyleMain checkstyleTest javadoc 
> findbugsMain compileJava compileTestJava
> real3m35.573s
> user0m2.701s
> sys 0m0.327s
>
>
>
>
>
>
>
> On Mon, Feb 25, 2019 at 10:47 AM Alex Amato  wrote:
>>
>> @Michael, no particular reason. I think Ken's suggestion makes more 
>> sense.
>>
>> On Mon, Feb 25, 2019 at 10:36 AM Udi Meiri  wrote:
>>>
>>> Talking about Python:
>>> I only know of "./gradlew lint", which include style and some py3 
>>> compliance checking.
>>> There is no auto-fix like spotlessApply AFAIK.
>>>
>>> As a side-note, I really dislike our python line continuation indent 
>>> rule, since pycharm can't be configured to adhere to it and I find 
>>> myself manually adjusting whitespace all the time.
>>>
>>>
>>> On Mon, Feb 25, 2019 at 10:22 AM Kenneth Knowles  
>>> wrote:

 FWIW gradle is a depgraph-based build system. You can gain a few 
 seconds by putting all but spotlessApply in one command.

 ./gradlew spotlessApply && ./gradlew checkstyleMain checkstyleTest 
 javadoc findbugsMain compileTestJava compileJava

 It might be clever to define a meta-task. Gradle "base plugin" has the 
 notable check (build and run tests), assemble (make artifacts), and 
 build (assemble + check, badly named!)

 I think something like "everything except running tests and building 
 artifacts" might be helpful.

 Kenn

 On Mon, Feb 25, 2019 at 10:13 AM Alex Amato  wrote:
>
> I made a thread about this a while back for java, but I don't think 
> the same commands like sptoless work for python.
>
> auto fixing lint issues
> running and quick checks which would fail the PR (without running the 
> whole precommit?)
> Something like findbugs to detect common issues (i.e. py3 compliance)
>
> FWIW, this is what I have been using for java. It will catch pretty 
> much everything except presubmit test failures.
>
> ./gradlew spotlessApply && ./gradlew checkstyleMain && ./gradlew 
> checkstyleTest && ./gradlew javadoc && ./gradlew findbugsMain && 
> ./gradlew compileTestJava && ./gradlew compileJava
>
>
>
> --
> 
> Ruoyun  Huang
>
>>>
>>>
>>> --
>>> 
>>> Ruoyun  Huang
>>>


Re: [Announcement] New Website for Beam Summits

2019-03-20 Thread Kenneth Knowles
Very nice. I appreciate the emphasis on coffee [1] [2] [3] though I suspect
there may be a rendering bug.

Kenn

[1] https://beamsummit.org/schedule/2019-06-19?sessionId=1
[2] https://beamsummit.org/schedule/2019-06-19?sessionId=3
[3] https://beamsummit.org/schedule/2019-06-19?sessionId=4

On Tue, Mar 19, 2019 at 4:43 AM Łukasz Gajowy  wrote:

> Looks great! Thanks for doing this! :)
>
> Łukasz
>
> wt., 19 mar 2019 o 12:30 Maximilian Michels  napisał(a):
>
>> Great stuff! Looking forward to seeing many Beam folks in Berlin.
>>
>> In case you want to speak at Beam Summit Europe, the Call for Papers is
>> open until April 1: https://sessionize.com/beam-summit-europe-2019
>>
>> -Max
>>
>> On 19.03.19 09:49, Matthias Baetens wrote:
>> > Awesome Aizhamal! Great work and thanks for your continued efforts on
>> > this :) Looking forward to the summit.
>> >
>> > On Mon, 18 Mar 2019 at 23:17, Aizhamal Nurmamat kyzy
>> > mailto:aizha...@google.com>> wrote:
>> >
>> > Hello everybody!
>> >
>> >
>> > We are thrilled to announce the launch of beamsummit.org
>> >  dedicated to Beam Summits!
>> >
>> >
>> > The current version of the website provides information about the
>> > upcoming Beam Summit in Europe on June 19-20th, 2019. We will update
>> > it for the upcoming summits in Asia and North America accordingly.
>> > You can access all necessary information about the conference theme,
>> > speakers and sessions, the abstract submission timeline and the
>> > registration process, the conference venues and much more that you
>> > will find useful until and during the Beam Summits 2019.
>> >
>> >
>> > We are working to make the website easy to use, so that anyone who
>> > is organizing a Beam event can rely on it. You can find the code for
>> > it in Github .
>> >
>> > The pages will be updated on a regular basis, but we also love
>> > hearing thoughts from our community! Let us know if you have any
>> > questions, comments or suggestions, and help us improve. Also, if
>> > you are thinking of organizing a Beam event, please feel free to
>> > reach out for support, and to use
>> the
>> > code in GitHub as well.
>> >
>> >
>> > We sincerely hope that you like the new Beam Summit website and will
>> > find it useful for accessing information. Enjoy browsing around!
>> >
>> >
>> > Thanks,
>> >
>> > Aizhamal
>> >
>>
>