date:20190806

[jira] [Work logged] (BEAM-5820) Vendor Calcite

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-5820?focusedWorklogId=290254=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290254
 ]

ASF GitHub Bot logged work on BEAM-5820:


Author: ASF GitHub Bot
Created on: 07/Aug/19 05:54
Start Date: 07/Aug/19 05:54
Worklog Time Spent: 10m 
  Work Description: vectorijk commented on pull request #9189: [BEAM-5820] 
vendor calcite
URL: https://github.com/apache/beam/pull/9189#discussion_r311378668
 
 

 ##
 File path: vendor/calcite-1_19_0/build.gradle
 ##
 @@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+plugins { id 'org.apache.beam.vendor-java' }
+
+description = "Apache Beam :: Vendored Dependencies :: Calcite 1.19.0"
+
+group = "org.apache.beam"
+version = "0.1"
+
+def calcite_version = "1.19.0"
+def avatica_version = "1.15.0"
+def prefix = "org.apache.beam.vendor.calcite.v1_19_0"
+
+vendorJava(
+  dependencies: [
+"org.apache.calcite:calcite-core:$calcite_version",
+"org.apache.calcite:calcite-linq4j:$calcite_version",
+"org.apache.calcite.avatica:avatica-core:$avatica_version",
+  ],
+  relocations: [
+"org.apache.calcite": "${prefix}.org.apache.calcite",
+
+// Calcite has Guava on its API surface
+"com.google.common": 
"org.apache.beam.vendor.guava.v26_0_jre.com.google.thirdparty",
 
 Review comment:
   address this.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290254)
Time Spent: 1h 40m  (was: 1.5h)

> Vendor Calcite
> --
>
> Key: BEAM-5820
> URL: https://issues.apache.org/jira/browse/BEAM-5820
> Project: Beam
>  Issue Type: Sub-task
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Assignee: Kai Jiang
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7916) Change ElasticsearchIO query parameter to be a ValueProvider

2019-08-06 Thread Oliver Henlich (JIRA)

Oliver Henlich created BEAM-7916:


 Summary: Change ElasticsearchIO query parameter to be a 
ValueProvider
 Key: BEAM-7916
 URL: https://issues.apache.org/jira/browse/BEAM-7916
 Project: Beam
  Issue Type: Improvement
  Components: io-java-elasticsearch
Affects Versions: 2.14.0
Reporter: Oliver Henlich


We need to be able to perform Elasticsearch queries that are dynamic. The 
problem is {{ElasticsearchIO.read().withQuery()}} only accepts a string which 
means the query must be known when the pipleline/Google Dataflow Template is 
built.

It would be great if we could change the parameter on the {{withQuery()}} 
method from {{String}} to {{ValueProvider}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7389) Colab examples for element-wise transforms (Python)

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7389?focusedWorklogId=290250=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290250
 ]

ASF GitHub Bot logged work on BEAM-7389:


Author: ASF GitHub Bot
Created on: 07/Aug/19 05:49
Start Date: 07/Aug/19 05:49
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #9260: [BEAM-7389] Add 
code examples for FlatMap page
URL: https://github.com/apache/beam/pull/9260#issuecomment-518950173
 
 
   > Thanks @tvalentyn, I just pushed #9276 adding samples for both MapTuple 
and FlatMapTuple. I added them alongside Map and FlatMap respectively since 
it's just a small variant.
   > 
   > Do you think it would also make sense to have a FilterTuple, 
PartitionTuple, and ParDoTuple as well since they all take fun.
   
   I wouldn't do this for `ParDo` since it is a general transform. I am not 
sure how often users of `Filter`/`Partition` would encounter tuple unpacking 
problem.  My initial intuition is: probably not, but adding @robertwb  who may 
have a different opinion on this.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290250)
Time Spent: 33h 10m  (was: 33h)

> Colab examples for element-wise transforms (Python)
> ---
>
> Key: BEAM-7389
> URL: https://issues.apache.org/jira/browse/BEAM-7389
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Rose Nguyen
>Assignee: David Cavazos
>Priority: Minor
>  Time Spent: 33h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-5820) Vendor Calcite

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-5820?focusedWorklogId=290214=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290214
 ]

ASF GitHub Bot logged work on BEAM-5820:


Author: ASF GitHub Bot
Created on: 07/Aug/19 05:19
Start Date: 07/Aug/19 05:19
Worklog Time Spent: 10m 
  Work Description: vectorijk commented on pull request #9189: [BEAM-5820] 
vendor calcite
URL: https://github.com/apache/beam/pull/9189#discussion_r311371881
 
 

 ##
 File path: vendor/calcite-1_19_0/build.gradle
 ##
 @@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+plugins { id 'org.apache.beam.vendor-java' }
+
+description = "Apache Beam :: Vendored Dependencies :: Calcite 1.19.0"
+
+group = "org.apache.beam"
+version = "0.1"
+
+def calcite_version = "1.19.0"
+def avatica_version = "1.15.0"
+def prefix = "org.apache.beam.vendor.calcite.v1_19_0"
+
+vendorJava(
+  dependencies: [
+"org.apache.calcite:calcite-core:$calcite_version",
+"org.apache.calcite:calcite-linq4j:$calcite_version",
+"org.apache.calcite.avatica:avatica-core:$avatica_version",
+  ],
+  relocations: [
+"org.apache.calcite": "${prefix}.org.apache.calcite",
+
+// Calcite has Guava on its API surface
+"com.google.common": 
"org.apache.beam.vendor.guava.v26_0_jre.com.google.thirdparty",
 
 Review comment:
   addressed
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290214)
Time Spent: 1h 20m  (was: 1h 10m)

> Vendor Calcite
> --
>
> Key: BEAM-5820
> URL: https://issues.apache.org/jira/browse/BEAM-5820
> Project: Beam
>  Issue Type: Sub-task
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Assignee: Kai Jiang
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-5820) Vendor Calcite

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-5820?focusedWorklogId=290217=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290217
 ]

ASF GitHub Bot logged work on BEAM-5820:


Author: ASF GitHub Bot
Created on: 07/Aug/19 05:19
Start Date: 07/Aug/19 05:19
Worklog Time Spent: 10m 
  Work Description: vectorijk commented on pull request #9189: [BEAM-5820] 
vendor calcite
URL: https://github.com/apache/beam/pull/9189#discussion_r311371979
 
 

 ##
 File path: vendor/calcite-1_19_0/build.gradle
 ##
 @@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+plugins { id 'org.apache.beam.vendor-java' }
+
+description = "Apache Beam :: Vendored Dependencies :: Calcite 1.19.0"
+
+group = "org.apache.beam"
+version = "0.1"
+
+def calcite_version = "1.19.0"
+def avatica_version = "1.15.0"
+def prefix = "org.apache.beam.vendor.calcite.v1_19_0"
+
+vendorJava(
+  dependencies: [
+"org.apache.calcite:calcite-core:$calcite_version",
+"org.apache.calcite:calcite-linq4j:$calcite_version",
+"org.apache.calcite.avatica:avatica-core:$avatica_version",
+  ],
+  relocations: [
+"org.apache.calcite": "${prefix}.org.apache.calcite",
+
+// Calcite has Guava on its API surface
+"com.google.common": 
"org.apache.beam.vendor.guava.v26_0_jre.com.google.thirdparty",
+"com.google.thirdparty": 
"org.apache.beam.vendor.guava.v26_0_jre.com.google.thirdparty",
+
+// Making this self-contained, includes Calcite's dependencies
+"org.apache.commons": "${prefix}.org.apache.beam",
 
 Review comment:
   addressed
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290217)
Time Spent: 1.5h  (was: 1h 20m)

> Vendor Calcite
> --
>
> Key: BEAM-5820
> URL: https://issues.apache.org/jira/browse/BEAM-5820
> Project: Beam
>  Issue Type: Sub-task
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Assignee: Kai Jiang
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7738) Support PubSubIO to be configured externally for use with other SDKs

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7738?focusedWorklogId=290210=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290210
 ]

ASF GitHub Bot logged work on BEAM-7738:


Author: ASF GitHub Bot
Created on: 07/Aug/19 04:29
Start Date: 07/Aug/19 04:29
Worklog Time Spent: 10m 
  Work Description: chadrik commented on pull request #9268: [BEAM-7738] 
Add external transform support to PubSubIO
URL: https://github.com/apache/beam/pull/9268#discussion_r311364462
 
 

 ##
 File path: 
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java
 ##
 @@ -674,6 +680,85 @@ public String toString() {
   abstract Builder setClock(@Nullable Clock clock);
 
   abstract Read build();
+
+  @Override
+  public PTransform> 
buildExternal(External.Configuration config) {
+if (config.topic != null) {
+  StaticValueProvider topic = 
StaticValueProvider.of(utf8String(config.topic));
+  setTopicProvider(NestedValueProvider.of(topic, new 
TopicTranslator()));
+}
+if (config.subscription != null) {
+  StaticValueProvider subscription =
+  StaticValueProvider.of(utf8String(config.subscription));
+  setSubscriptionProvider(
+  NestedValueProvider.of(subscription, new 
SubscriptionTranslator()));
+}
+if (config.idAttribute != null) {
+  String idAttribute = utf8String(config.idAttribute);
+  setIdAttribute(idAttribute);
+}
+if (config.timestampAttribute != null) {
+  String timestampAttribute = utf8String(config.timestampAttribute);
+  setTimestampAttribute(timestampAttribute);
+}
+setNeedsAttributes(config.needsAttributes);
+setPubsubClientFactory(FACTORY);
+if (config.needsAttributes) {
+  SimpleFunction parseFn =
+  (SimpleFunction) new IdentityMessageFn();
+  setParseFn(parseFn);
+  // FIXME: call setCoder(). need to use PubsubMessage proto to be 
compatible with python
 
 Review comment:
   I serialized the `PubsubMessage` using protobufs.  Since there's no 
cross-language coder for `PubsubMessage`, and I assumed it would be overreach 
to add one, I used the bytes coder and then handled converting to and from 
protobufs in code that lives close to the transforms. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290210)
Time Spent: 1h  (was: 50m)

> Support PubSubIO to be configured externally for use with other SDKs
> 
>
> Key: BEAM-7738
> URL: https://issues.apache.org/jira/browse/BEAM-7738
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp, runner-flink, sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Labels: portability
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Now that KafkaIO is supported via the external transform API (BEAM-7029) we 
> should add support for PubSub.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7738) Support PubSubIO to be configured externally for use with other SDKs

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7738?focusedWorklogId=290208=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290208
 ]

ASF GitHub Bot logged work on BEAM-7738:


Author: ASF GitHub Bot
Created on: 07/Aug/19 04:23
Start Date: 07/Aug/19 04:23
Worklog Time Spent: 10m 
  Work Description: chadrik commented on pull request #9268: [BEAM-7738] 
Add external transform support to PubSubIO
URL: https://github.com/apache/beam/pull/9268#discussion_r311363606
 
 

 ##
 File path: sdks/python/apache_beam/io/external/gcp/pubsub.py
 ##
 @@ -0,0 +1,131 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import absolute_import
+
+from apache_beam import ExternalTransform
+from apache_beam import pvalue
+from apache_beam.coders import BytesCoder
+from apache_beam.coders import FastPrimitivesCoder
+from apache_beam.coders.coders import LengthPrefixCoder
+from apache_beam.portability.api.external_transforms_pb2 import ConfigValue
+from apache_beam.portability.api.external_transforms_pb2 import 
ExternalConfigurationPayload
+from apache_beam.transforms import ptransform
+
+
+class ReadFromPubSub(ptransform.PTransform):
+  """An external ``PTransform`` for reading from Cloud Pub/Sub."""
+
+  _urn = 'beam:external:java:pubsub:read:v1'
+
+  def __init__(self, topic=None, subscription=None, id_label=None,
+   with_attributes=False, timestamp_attribute=None,
+   expansion_service='localhost:8097'):
+super(ReadFromPubSub, self).__init__()
+self.topic = topic
+self.subscription = subscription
+self.id_label = id_label
+self.with_attributes = with_attributes
+self.timestamp_attribute = timestamp_attribute
+self.expansion_service = expansion_service
+
+  def expand(self, pbegin):
+if not isinstance(pbegin, pvalue.PBegin):
+  raise Exception("ReadFromPubSub must be a root transform")
+
+args = {}
+
+if self.topic is not None:
+  args['topic'] = _encode_str(self.topic)
+
+if self.subscription is not None:
+  args['subscription'] = _encode_str(self.subscription)
+
+if self.id_label is not None:
+  args['id_label'] = _encode_str(self.id_label)
+
+# FIXME: how do we encode a bool so that Java can decode it?
+# args['with_attributes'] = _encode_bool(self.with_attributes)
 
 Review comment:
   I encoded as int and handled the cast in java
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290208)
Time Spent: 50m  (was: 40m)

> Support PubSubIO to be configured externally for use with other SDKs
> 
>
> Key: BEAM-7738
> URL: https://issues.apache.org/jira/browse/BEAM-7738
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp, runner-flink, sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Labels: portability
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Now that KafkaIO is supported via the external transform API (BEAM-7029) we 
> should add support for PubSub.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7880) Upgrade Jackson databind to version 2.9.9.3

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7880?focusedWorklogId=290182=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290182
 ]

ASF GitHub Bot logged work on BEAM-7880:


Author: ASF GitHub Bot
Created on: 07/Aug/19 03:00
Start Date: 07/Aug/19 03:00
Worklog Time Spent: 10m 
  Work Description: wenbin9 commented on issue #9229: [BEAM-7880] Upgrade 
Jackson databind to version 2.9.9.3
URL: https://github.com/apache/beam/pull/9229#issuecomment-518919971
 
 
   Any plans to release 2.9.9.3?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290182)
Time Spent: 3h 10m  (was: 3h)

> Upgrade Jackson databind to version 2.9.9.3
> ---
>
> Key: BEAM-7880
> URL: https://issues.apache.org/jira/browse/BEAM-7880
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system, sdk-java-core
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Jackson databind 2.9.9 and earlier versions have multiple CVEs:
> https://www.cvedetails.com/cve/CVE-2019-12814
> https://www.cvedetails.com/cve/CVE-2019-12384
> https://www.cvedetails.com/cve/CVE-2019-14379
> https://www.cvedetails.com/cve/CVE-2019-14439



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290171=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290171
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 02:10
Start Date: 07/Aug/19 02:10
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311338409
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
 
 Review comment:
   Agreed, since they both aren't used yet - removed from the constructor, as 
suggested.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290171)
Time Spent: 35h  (was: 34h 50m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 35h
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290166=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290166
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 02:08
Start Date: 07/Aug/19 02:08
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311339684
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'start_key': start_key,
+  'end_key': end_key,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
+.instance(self._beam_options['instance_id'])\
+.table(self._beam_options['table_id'])
+
+  def process(self, element, **kwargs):
 
 Review comment:
   Good point! Fixed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290166)
Time Spent: 34.5h  (was: 34h 20m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 34.5h
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290167=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290167
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 02:08
Start Date: 07/Aug/19 02:08
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311340039
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'start_key': start_key,
+  'end_key': end_key,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
+.instance(self._beam_options['instance_id'])\
+.table(self._beam_options['table_id'])
+
+  def process(self, element, **kwargs):
+for row in self.table.read_rows(start_key=self._beam_options['start_key'],
+end_key=self._beam_options['end_key'],
+filter_=self._beam_options['filter_']):
+  self.written.inc()
+  yield row
+
+  def get_initial_restriction(self, element):
+pass
+
+  def finish_bundle(self):
+  pass
+
+  def display_data(self):
+return {'projectId': DisplayDataItem(self._beam_options['project_id'],
+ label='Bigtable Project Id'),
+'instanceId': DisplayDataItem(self._beam_options['instance_id'],
+  label='Bigtable Instance Id'),
+'tableId': DisplayDataItem(self._beam_options['table_id'],
+   label='Bigtable Table Id'),
+'filter_': DisplayDataItem(self._beam_options['filter_'],
+   label='Bigtable Filter')
+}
+
+
+class ReadFromBigTable(beam.PTransform):
+  def __init__(self, project_id, instance_id, table_id, filter_=b''):
+""" The PTransform to access the Bigtable Read connector
+
+Args:
+  project_id: [str] GCP Project of to read the Rows
+  instance_id): [str] GCP Instance to read the Rows
+  table_id): [str] GCP Table to read the Rows
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._beam_options = {'project_id': project_id,
+

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290168=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290168
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 02:08
Start Date: 07/Aug/19 02:08
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311341218
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio_it_test.py
 ##
 @@ -0,0 +1,187 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+""" Integration test for GCP Bigtable testing."""
+from __future__ import absolute_import
+
+import argparse
+import datetime
+import logging
+import random
+import string
+import time
+import unittest
+
+import apache_beam as beam
+from apache_beam.metrics.metric import MetricsFilter
+from apache_beam.options.pipeline_options import PipelineOptions
+from apache_beam.runners.runner import PipelineState
+from apache_beam.testing.util import assert_that, equal_to
+from apache_beam.transforms.combiners import Count
+
+try:
+  from google.cloud.bigtable import enums, row, column_family, Client
+except ImportError:
+  Client = None
+
+import bigtableio
+
+class GenerateTestRows(beam.PTransform):
+  """ A PTransform to generate dummy rows to write to a Bigtable Table.
+
+  A PTransform that generates a list of `DirectRow` and writes it to a 
Bigtable Table.
+  """
+  def __init__(self):
+super(self.__class__, self).__init__()
+self.beam_options = {'project_id': PROJECT_ID,
+ 'instance_id': INSTANCE_ID,
+ 'table_id': TABLE_ID}
+
+  def _generate(self):
+for i in range(ROW_COUNT):
+  key = "key_%s" % ('{0:012}'.format(i))
+  test_row = row.DirectRow(row_key=key)
+  value = ''.join(random.choice(LETTERS_AND_DIGITS) for _ in 
range(CELL_SIZE))
+  for j in range(COLUMN_COUNT):
+test_row.set_cell(column_family_id=COLUMN_FAMILY_ID,
+  column=('field%s' % j).encode('utf-8'),
+  value=value,
+  timestamp=datetime.datetime.now())
+  yield test_row
+
+  def expand(self, pvalue):
+return (pvalue
+| beam.Create(self._generate())
+| 
bigtableio.WriteToBigTable(project_id=self.beam_options['project_id'],
+ 
instance_id=self.beam_options['instance_id'],
+ 
table_id=self.beam_options['table_id']))
+
+@unittest.skipIf(Client is None, 'GCP Bigtable dependencies are not installed')
+class BigtableIOTest(unittest.TestCase):
+  """ Bigtable IO Connector Test
+
+  This tests the connector both ways, first writing rows to a new table, then 
reading them and comparing the counters
+  """
+  def setUp(self):
+self.result = None
+self.table = Client(project=PROJECT_ID, admin=True)\
+.instance(instance_id=INSTANCE_ID)\
+.table(TABLE_ID)
+
+if not self.table.exists():
+  column_families = {COLUMN_FAMILY_ID: column_family.MaxVersionsGCRule(2)}
+  self.table.create(column_families=column_families)
+  logging.info('Table {} has been created!'.format(TABLE_ID))
+
+  def test_bigtable_io(self):
 
 Review comment:
   Again, thanks for the tip. Done and noted for the future.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290168)
Time Spent: 34h 50m  (was: 34h 40m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290165=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290165
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 02:08
Start Date: 07/Aug/19 02:08
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311339147
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'start_key': start_key,
+  'end_key': end_key,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
+.instance(self._beam_options['instance_id'])\
+.table(self._beam_options['table_id'])
+
+  def process(self, element, **kwargs):
+for row in self.table.read_rows(start_key=self._beam_options['start_key'],
+end_key=self._beam_options['end_key'],
+filter_=self._beam_options['filter_']):
+  self.written.inc()
+  yield row
+
+  def get_initial_restriction(self, element):
+pass
+
+  def finish_bundle(self):
 
 Review comment:
   Done.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290165)
Time Spent: 34h 20m  (was: 34h 10m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 34h 20m
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290163=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290163
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 02:07
Start Date: 07/Aug/19 02:07
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311336617
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'start_key': start_key,
+  'end_key': end_key,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
+.instance(self._beam_options['instance_id'])\
+.table(self._beam_options['table_id'])
+
+  def process(self, element, **kwargs):
+for row in self.table.read_rows(start_key=self._beam_options['start_key'],
+end_key=self._beam_options['end_key'],
+filter_=self._beam_options['filter_']):
+  self.written.inc()
+  yield row
+
+  def get_initial_restriction(self, element):
+pass
+
+  def finish_bundle(self):
+  pass
+
+  def display_data(self):
+return {'projectId': DisplayDataItem(self._beam_options['project_id'],
+ label='Bigtable Project Id'),
+'instanceId': DisplayDataItem(self._beam_options['instance_id'],
+  label='Bigtable Instance Id'),
+'tableId': DisplayDataItem(self._beam_options['table_id'],
+   label='Bigtable Table Id'),
+'filter_': DisplayDataItem(self._beam_options['filter_'],
+   label='Bigtable Filter')
+}
+
+
+class ReadFromBigTable(beam.PTransform):
+  def __init__(self, project_id, instance_id, table_id, filter_=b''):
+""" The PTransform to access the Bigtable Read connector
+
+Args:
+  project_id: [str] GCP Project of to read the Rows
+  instance_id): [str] GCP Instance to read the Rows
+  table_id): [str] GCP Table to read the Rows
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._beam_options = {'project_id': project_id,
+

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290164=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290164
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 02:07
Start Date: 07/Aug/19 02:07
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311339073
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'start_key': start_key,
+  'end_key': end_key,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
+.instance(self._beam_options['instance_id'])\
+.table(self._beam_options['table_id'])
+
+  def process(self, element, **kwargs):
+for row in self.table.read_rows(start_key=self._beam_options['start_key'],
+end_key=self._beam_options['end_key'],
+filter_=self._beam_options['filter_']):
+  self.written.inc()
+  yield row
+
+  def get_initial_restriction(self, element):
 
 Review comment:
   Totally, "inherited" from attempts to implement a splittable `DoFn`. Removed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290164)
Time Spent: 34h 10m  (was: 34h)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 34h 10m
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290162=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290162
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 02:07
Start Date: 07/Aug/19 02:07
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311336063
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'start_key': start_key,
+  'end_key': end_key,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
+.instance(self._beam_options['instance_id'])\
+.table(self._beam_options['table_id'])
+
+  def process(self, element, **kwargs):
+for row in self.table.read_rows(start_key=self._beam_options['start_key'],
+end_key=self._beam_options['end_key'],
+filter_=self._beam_options['filter_']):
+  self.written.inc()
+  yield row
+
+  def get_initial_restriction(self, element):
+pass
+
+  def finish_bundle(self):
+  pass
+
+  def display_data(self):
+return {'projectId': DisplayDataItem(self._beam_options['project_id'],
+ label='Bigtable Project Id'),
+'instanceId': DisplayDataItem(self._beam_options['instance_id'],
+  label='Bigtable Instance Id'),
+'tableId': DisplayDataItem(self._beam_options['table_id'],
+   label='Bigtable Table Id'),
+'filter_': DisplayDataItem(self._beam_options['filter_'],
 
 Review comment:
   Done.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290162)
Time Spent: 33h 50m  (was: 33h 40m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL:

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290161=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290161
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 02:07
Start Date: 07/Aug/19 02:07
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311335283
 
 

 ##
 File path: .gitignore
 ##
 @@ -85,5 +85,3 @@ sdks/python/postcommit_requirements.txt
 # This is typically in files named 'src.xml' throughout this repository.
 
 # JetBrains Education files
-!**/study_project.xml
-**/.coursecreator/**/*
 
 Review comment:
   Must've been erased accidentally while resolving conflicts. My bad. Restored.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290161)
Time Spent: 33h 40m  (was: 33.5h)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 33h 40m
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290159=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290159
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 02:06
Start Date: 07/Aug/19 02:06
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on issue #8457: [BEAM-3342] Create a 
Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#issuecomment-518910019
 
 
   @pabloem Thanks for the detailed review! Also removed a couple of imports 
that no longer seem to be used. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290159)
Time Spent: 33h 20m  (was: 33h 10m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 33h 20m
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290160=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290160
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 02:06
Start Date: 07/Aug/19 02:06
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on issue #8457: [BEAM-3342] Create a 
Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#issuecomment-518910019
 
 
   @pabloem Thanks for the detailed review! Also removed a couple of imports 
that no longer seem to be used. Please have a look.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290160)
Time Spent: 33.5h  (was: 33h 20m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 33.5h
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290157=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290157
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 02:02
Start Date: 07/Aug/19 02:02
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311342404
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio_it_test.py
 ##
 @@ -0,0 +1,187 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+""" Integration test for GCP Bigtable testing."""
+from __future__ import absolute_import
+
+import argparse
+import datetime
+import logging
+import random
+import string
+import time
+import unittest
+
+import apache_beam as beam
+from apache_beam.metrics.metric import MetricsFilter
+from apache_beam.options.pipeline_options import PipelineOptions
+from apache_beam.runners.runner import PipelineState
+from apache_beam.testing.util import assert_that, equal_to
+from apache_beam.transforms.combiners import Count
+
+try:
+  from google.cloud.bigtable import enums, row, column_family, Client
+except ImportError:
+  Client = None
+
+import bigtableio
+
+class GenerateTestRows(beam.PTransform):
+  """ A PTransform to generate dummy rows to write to a Bigtable Table.
+
+  A PTransform that generates a list of `DirectRow` and writes it to a 
Bigtable Table.
+  """
+  def __init__(self):
+super(self.__class__, self).__init__()
+self.beam_options = {'project_id': PROJECT_ID,
+ 'instance_id': INSTANCE_ID,
+ 'table_id': TABLE_ID}
+
+  def _generate(self):
+for i in range(ROW_COUNT):
+  key = "key_%s" % ('{0:012}'.format(i))
+  test_row = row.DirectRow(row_key=key)
+  value = ''.join(random.choice(LETTERS_AND_DIGITS) for _ in 
range(CELL_SIZE))
+  for j in range(COLUMN_COUNT):
+test_row.set_cell(column_family_id=COLUMN_FAMILY_ID,
+  column=('field%s' % j).encode('utf-8'),
+  value=value,
+  timestamp=datetime.datetime.now())
+  yield test_row
+
+  def expand(self, pvalue):
+return (pvalue
+| beam.Create(self._generate())
+| 
bigtableio.WriteToBigTable(project_id=self.beam_options['project_id'],
+ 
instance_id=self.beam_options['instance_id'],
+ 
table_id=self.beam_options['table_id']))
+
+@unittest.skipIf(Client is None, 'GCP Bigtable dependencies are not installed')
+class BigtableIOTest(unittest.TestCase):
+  """ Bigtable IO Connector Test
+
+  This tests the connector both ways, first writing rows to a new table, then 
reading them and comparing the counters
+  """
+  def setUp(self):
+self.result = None
+self.table = Client(project=PROJECT_ID, admin=True)\
+.instance(instance_id=INSTANCE_ID)\
+.table(TABLE_ID)
+
+if not self.table.exists():
+  column_families = {COLUMN_FAMILY_ID: column_family.MaxVersionsGCRule(2)}
+  self.table.create(column_families=column_families)
+  logging.info('Table {} has been created!'.format(TABLE_ID))
+
+  def test_bigtable_io(self):
+print 'Project ID: ', PROJECT_ID
+print 'Instance ID:', INSTANCE_ID
+print 'Table ID:   ', TABLE_ID
+
+pipeline_options = 
PipelineOptions(pipeline_parameters(job_name=make_job_name()))
+p = beam.Pipeline(options=pipeline_options)
+_ = (p | 'Write Test Rows' >> GenerateTestRows())
+
+self.result = p.run()
+self.result.wait_until_finish()
+
+assert self.result.state == PipelineState.DONE
+
+if not hasattr(self.result, 'has_job') or self.result.has_job:
+  query_result = 
self.result.metrics().query(MetricsFilter().with_name('Written Row'))
+  if query_result['counters']:
+read_counter = query_result['counters'][0]
+logging.info('Number of Rows written: %d', read_counter.committed)
+

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290155=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290155
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:55
Start Date: 07/Aug/19 01:55
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311341218
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio_it_test.py
 ##
 @@ -0,0 +1,187 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+""" Integration test for GCP Bigtable testing."""
+from __future__ import absolute_import
+
+import argparse
+import datetime
+import logging
+import random
+import string
+import time
+import unittest
+
+import apache_beam as beam
+from apache_beam.metrics.metric import MetricsFilter
+from apache_beam.options.pipeline_options import PipelineOptions
+from apache_beam.runners.runner import PipelineState
+from apache_beam.testing.util import assert_that, equal_to
+from apache_beam.transforms.combiners import Count
+
+try:
+  from google.cloud.bigtable import enums, row, column_family, Client
+except ImportError:
+  Client = None
+
+import bigtableio
+
+class GenerateTestRows(beam.PTransform):
+  """ A PTransform to generate dummy rows to write to a Bigtable Table.
+
+  A PTransform that generates a list of `DirectRow` and writes it to a 
Bigtable Table.
+  """
+  def __init__(self):
+super(self.__class__, self).__init__()
+self.beam_options = {'project_id': PROJECT_ID,
+ 'instance_id': INSTANCE_ID,
+ 'table_id': TABLE_ID}
+
+  def _generate(self):
+for i in range(ROW_COUNT):
+  key = "key_%s" % ('{0:012}'.format(i))
+  test_row = row.DirectRow(row_key=key)
+  value = ''.join(random.choice(LETTERS_AND_DIGITS) for _ in 
range(CELL_SIZE))
+  for j in range(COLUMN_COUNT):
+test_row.set_cell(column_family_id=COLUMN_FAMILY_ID,
+  column=('field%s' % j).encode('utf-8'),
+  value=value,
+  timestamp=datetime.datetime.now())
+  yield test_row
+
+  def expand(self, pvalue):
+return (pvalue
+| beam.Create(self._generate())
+| 
bigtableio.WriteToBigTable(project_id=self.beam_options['project_id'],
+ 
instance_id=self.beam_options['instance_id'],
+ 
table_id=self.beam_options['table_id']))
+
+@unittest.skipIf(Client is None, 'GCP Bigtable dependencies are not installed')
+class BigtableIOTest(unittest.TestCase):
+  """ Bigtable IO Connector Test
+
+  This tests the connector both ways, first writing rows to a new table, then 
reading them and comparing the counters
+  """
+  def setUp(self):
+self.result = None
+self.table = Client(project=PROJECT_ID, admin=True)\
+.instance(instance_id=INSTANCE_ID)\
+.table(TABLE_ID)
+
+if not self.table.exists():
+  column_families = {COLUMN_FAMILY_ID: column_family.MaxVersionsGCRule(2)}
+  self.table.create(column_families=column_families)
+  logging.info('Table {} has been created!'.format(TABLE_ID))
+
+  def test_bigtable_io(self):
 
 Review comment:
   Again, thanks for the tip. Done [pending next commit] and noted for the 
future.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290155)
Time Spent: 33h  (was: 32h 50m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
>

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290152=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290152
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:48
Start Date: 07/Aug/19 01:48
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311340039
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'start_key': start_key,
+  'end_key': end_key,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
+.instance(self._beam_options['instance_id'])\
+.table(self._beam_options['table_id'])
+
+  def process(self, element, **kwargs):
+for row in self.table.read_rows(start_key=self._beam_options['start_key'],
+end_key=self._beam_options['end_key'],
+filter_=self._beam_options['filter_']):
+  self.written.inc()
+  yield row
+
+  def get_initial_restriction(self, element):
+pass
+
+  def finish_bundle(self):
+  pass
+
+  def display_data(self):
+return {'projectId': DisplayDataItem(self._beam_options['project_id'],
+ label='Bigtable Project Id'),
+'instanceId': DisplayDataItem(self._beam_options['instance_id'],
+  label='Bigtable Instance Id'),
+'tableId': DisplayDataItem(self._beam_options['table_id'],
+   label='Bigtable Table Id'),
+'filter_': DisplayDataItem(self._beam_options['filter_'],
+   label='Bigtable Filter')
+}
+
+
+class ReadFromBigTable(beam.PTransform):
+  def __init__(self, project_id, instance_id, table_id, filter_=b''):
+""" The PTransform to access the Bigtable Read connector
+
+Args:
+  project_id: [str] GCP Project of to read the Rows
+  instance_id): [str] GCP Instance to read the Rows
+  table_id): [str] GCP Table to read the Rows
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._beam_options = {'project_id': project_id,
+

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290151=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290151
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:45
Start Date: 07/Aug/19 01:45
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311339684
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'start_key': start_key,
+  'end_key': end_key,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
+.instance(self._beam_options['instance_id'])\
+.table(self._beam_options['table_id'])
+
+  def process(self, element, **kwargs):
 
 Review comment:
   Good point! Fixed. [pending next commit]
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290151)
Time Spent: 32h 40m  (was: 32.5h)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 32h 40m
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290146=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290146
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:42
Start Date: 07/Aug/19 01:42
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311339147
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'start_key': start_key,
+  'end_key': end_key,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
+.instance(self._beam_options['instance_id'])\
+.table(self._beam_options['table_id'])
+
+  def process(self, element, **kwargs):
+for row in self.table.read_rows(start_key=self._beam_options['start_key'],
+end_key=self._beam_options['end_key'],
+filter_=self._beam_options['filter_']):
+  self.written.inc()
+  yield row
+
+  def get_initial_restriction(self, element):
+pass
+
+  def finish_bundle(self):
 
 Review comment:
   Done. [pending next commit]
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290146)
Time Spent: 32.5h  (was: 32h 20m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 32.5h
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290145=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290145
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:41
Start Date: 07/Aug/19 01:41
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311339073
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'start_key': start_key,
+  'end_key': end_key,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
+.instance(self._beam_options['instance_id'])\
+.table(self._beam_options['table_id'])
+
+  def process(self, element, **kwargs):
+for row in self.table.read_rows(start_key=self._beam_options['start_key'],
+end_key=self._beam_options['end_key'],
+filter_=self._beam_options['filter_']):
+  self.written.inc()
+  yield row
+
+  def get_initial_restriction(self, element):
 
 Review comment:
   Totally, "inherited" from attempts to implement a splittable `DoFn`. 
Removed. [pending next commit]
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290145)
Time Spent: 32h 20m  (was: 32h 10m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 32h 20m
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290144=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290144
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:40
Start Date: 07/Aug/19 01:40
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311338836
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'start_key': start_key,
+  'end_key': end_key,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
+.instance(self._beam_options['instance_id'])\
+.table(self._beam_options['table_id'])
+
+  def process(self, element, **kwargs):
+for row in self.table.read_rows(start_key=self._beam_options['start_key'],
+end_key=self._beam_options['end_key'],
+filter_=self._beam_options['filter_']):
+  self.written.inc()
 
 Review comment:
   Yep, another accidental mixup, sorry about that. Corrected, this should read
   ```
 self.row_count.inc()
   ```
   where the `self.row_count` is defined in the `_initialize` method:
   ```
   self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290144)
Time Spent: 32h 10m  (was: 32h)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 32h 10m
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290143=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290143
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:37
Start Date: 07/Aug/19 01:37
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311338447
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +212,110 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size_in_mb, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size_in_mb < 1:
+  desired_chunk_size_in_mb = 1
+if start_pos >= end_pos:
+  # single document not splittable
+  return []
 with MongoClient(self.uri, **self.spec) as client:
-  size = client[self.db].command('collstats', self.coll).get('avgObjSize')
-  if size is None or size <= 0:
-raise ValueError(
-'Collection %s not found or average doc size is '
-'incorrect', self.coll)
-  return size
-
-  def _get_document_count(self):
+  name_space = '%s.%s' % (self.db, self.coll)
+  return (client[self.db].command(
+  'splitVector',
+  name_space,
+  keyPattern={'_id': 1},
+  min={'_id': start_pos},
+  max={'_id': end_pos},
+  maxChunkSize=desired_chunk_size_in_mb)['splitKeys'])
+
+  def _merge_id_filter(self, range_tracker):
+all_filters = self.filter.copy()
 
 Review comment:
   it will be set to {} by constructor.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290143)
Time Spent: 6h 40m  (was: 6.5h)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290142=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290142
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:37
Start Date: 07/Aug/19 01:37
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311338409
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
 
 Review comment:
   Agreed, since they both aren't used yet. Removed them from the constructor. 
On the other hand, since they aren't defined anywhere in this version of the 
code, maybe we can remove them from the process method too, what do you think?
   
   ```
 def process(self, element, **kwargs):
   for row in 
self.table.read_rows(start_key=self._beam_options['start_key'],
   end_key=self._beam_options['end_key'],
   filter_=self._beam_options['filter_']):
 self.written.inc()
 yield row
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290142)
Time Spent: 32h  (was: 31h 50m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 32h
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290141=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290141
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:35
Start Date: 07/Aug/19 01:35
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311338085
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,64 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_head_document_id(DESCENDING)
+  # increment last doc id binary value by 1 to make sure the last document
+  # is not excluded
+  stop_position = _ObjectIdHelper.increment_id(last_doc_id, 1)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  if bundle_start is not None or bundle_start >= stop_position:
+break
+  bundle_end = min(stop_position, split_key_id)
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
 source=self,
 start_position=bundle_start,
 stop_position=bundle_end)
   bundle_start = bundle_end
+# add range of last split_key to stop_position
+if bundle_start < stop_position:
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
+source=self,
+start_position=bundle_start,
+stop_position=stop_position)
 
   def get_range_tracker(self, start_position, stop_position):
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 
 Review comment:
   is there a situation where one of the position is none and the other isn't?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290141)
Time Spent: 6.5h  (was: 6h 20m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290140=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290140
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:35
Start Date: 07/Aug/19 01:35
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311338009
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,64 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_head_document_id(DESCENDING)
+  # increment last doc id binary value by 1 to make sure the last document
+  # is not excluded
+  stop_position = _ObjectIdHelper.increment_id(last_doc_id, 1)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  if bundle_start is not None or bundle_start >= stop_position:
+break
+  bundle_end = min(stop_position, split_key_id)
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
 source=self,
 start_position=bundle_start,
 stop_position=bundle_end)
   bundle_start = bundle_end
+# add range of last split_key to stop_position
+if bundle_start < stop_position:
 
 Review comment:
   yes
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290140)
Time Spent: 6h 20m  (was: 6h 10m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290138=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290138
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:33
Start Date: 07/Aug/19 01:33
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311337745
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,64 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
 
 Review comment:
   normally means the collection is invalid or empty.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290138)
Time Spent: 6h 10m  (was: 6h)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290136=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290136
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:28
Start Date: 07/Aug/19 01:28
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311336933
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio_test.py
 ##
 @@ -30,38 +34,136 @@
 from apache_beam.io.mongodbio import _BoundedMongoSource
 from apache_beam.io.mongodbio import _GenerateObjectIdFn
 from apache_beam.io.mongodbio import _MongoSink
+from apache_beam.io.mongodbio import _ObjectIdHelper
+from apache_beam.io.mongodbio import _ObjectIdRangeTracker
 from apache_beam.io.mongodbio import _WriteMongoFn
 from apache_beam.testing.test_pipeline import TestPipeline
 from apache_beam.testing.util import assert_that
 from apache_beam.testing.util import equal_to
 
 
+class _MockMongoColl(object):
+  """Fake mongodb collection cursor."""
+
+  def __init__(self, docs):
+self.docs = docs
+
+  def _filter(self, filter):
+match = []
+if not filter:
+  return self
+start = filter['_id'].get('$gte')
+end = filter['_id'].get('$lt')
+assert start is not None
+assert end is not None
+for doc in self.docs:
+  if start and doc['_id'] < start:
+continue
+  if end and doc['_id'] >= end:
+continue
+  match.append(doc)
+return match
+
+  def find(self, filter=None, **kwargs):
+return _MockMongoColl(self._filter(filter))
+
+  def sort(self, sort_items):
+key, order = sort_items[0]
+self.docs = sorted(self.docs,
+   key=lambda x: x[key],
+   reverse=(order != ASCENDING))
+return self
+
+  def limit(self, num):
+return _MockMongoColl(self.docs[0:num])
+
+  def count_documents(self, filter):
+return len(self._filter(filter))
+
+  def __getitem__(self, index):
+return self.docs[index]
+
+
+class _MockMongoDb(object):
+  """Fake Mongo Db."""
+
+  def __init__(self, docs):
+self.docs = docs
+
+  def __getitem__(self, coll_name):
+return _MockMongoColl(self.docs)
+
+  def command(self, command, *args, **kwargs):
+if command == 'collstats':
+  return {'size': 5, 'avgSize': 1}
+elif command == 'splitVector':
+  return self.get_split_key(command, *args, **kwargs)
+
+  def get_split_key(self, command, ns, min, max, maxChunkSize, **kwargs):
+# simulate mongo db splitVector command, return split keys base on chunk
+# size, assuming every doc is of size 1mb
+start_id = min['_id']
 
 Review comment:
   these are the argument key required by mongo client
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290136)
Time Spent: 6h  (was: 5h 50m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
>

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290135=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290135
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:26
Start Date: 07/Aug/19 01:26
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311336617
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'start_key': start_key,
+  'end_key': end_key,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
+.instance(self._beam_options['instance_id'])\
+.table(self._beam_options['table_id'])
+
+  def process(self, element, **kwargs):
+for row in self.table.read_rows(start_key=self._beam_options['start_key'],
+end_key=self._beam_options['end_key'],
+filter_=self._beam_options['filter_']):
+  self.written.inc()
+  yield row
+
+  def get_initial_restriction(self, element):
+pass
+
+  def finish_bundle(self):
+  pass
+
+  def display_data(self):
+return {'projectId': DisplayDataItem(self._beam_options['project_id'],
+ label='Bigtable Project Id'),
+'instanceId': DisplayDataItem(self._beam_options['instance_id'],
+  label='Bigtable Instance Id'),
+'tableId': DisplayDataItem(self._beam_options['table_id'],
+   label='Bigtable Table Id'),
+'filter_': DisplayDataItem(self._beam_options['filter_'],
+   label='Bigtable Filter')
+}
+
+
+class ReadFromBigTable(beam.PTransform):
+  def __init__(self, project_id, instance_id, table_id, filter_=b''):
+""" The PTransform to access the Bigtable Read connector
+
+Args:
+  project_id: [str] GCP Project of to read the Rows
+  instance_id): [str] GCP Instance to read the Rows
+  table_id): [str] GCP Table to read the Rows
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._beam_options = {'project_id': project_id,
+

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290133=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290133
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:25
Start Date: 07/Aug/19 01:25
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311336422
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,64 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_head_document_id(DESCENDING)
+  # increment last doc id binary value by 1 to make sure the last document
+  # is not excluded
+  stop_position = _ObjectIdHelper.increment_id(last_doc_id, 1)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  if bundle_start is not None or bundle_start >= stop_position:
 
 Review comment:
   thought I fixed this earlier, weird.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290133)
Time Spent: 5h 40m  (was: 5.5h)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290134=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290134
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:25
Start Date: 07/Aug/19 01:25
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311336434
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,64 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_head_document_id(DESCENDING)
+  # increment last doc id binary value by 1 to make sure the last document
+  # is not excluded
+  stop_position = _ObjectIdHelper.increment_id(last_doc_id, 1)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  if bundle_start is not None or bundle_start >= stop_position:
+break
+  bundle_end = min(stop_position, split_key_id)
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
 source=self,
 start_position=bundle_start,
 stop_position=bundle_end)
   bundle_start = bundle_end
+# add range of last split_key to stop_position
+if bundle_start < stop_position:
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
+source=self,
+start_position=bundle_start,
+stop_position=stop_position)
 
   def get_range_tracker(self, start_position, stop_position):
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 if stop_position is None:
-  stop_position = self.doc_count
-return OffsetRangeTracker(start_position, stop_position)
+  last_doc_id = self._get_head_document_id(DESCENDING)
+  # increment last doc id binary value by 1 to make sure the last document
+  # is not excluded
+  stop_position = _ObjectIdHelper.increment_id(last_doc_id, 1)
+return _ObjectIdRangeTracker(start_position, stop_position)
 
 Review comment:
   yes
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290134)
Time Spent: 5h 50m  (was: 5h 40m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290132=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290132
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:23
Start Date: 07/Aug/19 01:23
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311336063
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio.py
 ##
 @@ -122,22 +129,148 @@ class WriteToBigTable(beam.PTransform):
   A PTransform that write a list of `DirectRow` into the Bigtable Table
 
   """
-  def __init__(self, project_id=None, instance_id=None,
-   table_id=None):
+  def __init__(self, project_id=None, instance_id=None, table_id=None):
 """ The PTransform to access the Bigtable Write connector
 Args:
   project_id(str): GCP Project of to write the Rows
   instance_id(str): GCP Instance to write the Rows
   table_id(str): GCP Table to write the `DirectRows`
 """
 super(WriteToBigTable, self).__init__()
-self.beam_options = {'project_id': project_id,
+self._beam_options = {'project_id': project_id,
  'instance_id': instance_id,
  'table_id': table_id}
 
   def expand(self, pvalue):
-beam_options = self.beam_options
+beam_options = self._beam_options
 return (pvalue
 | beam.ParDo(_BigTableWriteFn(beam_options['project_id'],
   beam_options['instance_id'],
   beam_options['table_id'])))
+
+
+class _BigtableReadFn(beam.DoFn):
+  """ Creates the connector that can read rows for Beam pipeline
+
+  Args:
+project_id(str): GCP Project ID
+instance_id(str): GCP Instance ID
+table_id(str): GCP Table ID
+
+  """
+
+  def __init__(self, project_id, instance_id, table_id, start_key=None, 
end_key=None, filter_=b''):
+""" Constructor of the Read connector of Bigtable
+
+Args:
+  project_id: [str] GCP Project of to write the Rows
+  instance_id: [str] GCP Instance to write the Rows
+  table_id: [str] GCP Table to write the `DirectRows`
+  filter_: [RowFilter] Filter to apply to columns in a row.
+"""
+super(self.__class__, self).__init__()
+self._initialize({'project_id': project_id,
+  'instance_id': instance_id,
+  'table_id': table_id,
+  'start_key': start_key,
+  'end_key': end_key,
+  'filter_': filter_})
+
+  def __getstate__(self):
+return self._beam_options
+
+  def __setstate__(self, options):
+self._initialize(options)
+
+  def _initialize(self, options):
+self._beam_options = options
+self.table = None
+self.sample_row_keys = None
+self.row_count = Metrics.counter(self.__class__.__name__, 'Rows read')
+
+  def start_bundle(self):
+if self.table is None:
+  self.table = Client(project=self._beam_options['project_id'])\
+.instance(self._beam_options['instance_id'])\
+.table(self._beam_options['table_id'])
+
+  def process(self, element, **kwargs):
+for row in self.table.read_rows(start_key=self._beam_options['start_key'],
+end_key=self._beam_options['end_key'],
+filter_=self._beam_options['filter_']):
+  self.written.inc()
+  yield row
+
+  def get_initial_restriction(self, element):
+pass
+
+  def finish_bundle(self):
+  pass
+
+  def display_data(self):
+return {'projectId': DisplayDataItem(self._beam_options['project_id'],
+ label='Bigtable Project Id'),
+'instanceId': DisplayDataItem(self._beam_options['instance_id'],
+  label='Bigtable Instance Id'),
+'tableId': DisplayDataItem(self._beam_options['table_id'],
+   label='Bigtable Table Id'),
+'filter_': DisplayDataItem(self._beam_options['filter_'],
 
 Review comment:
   Done. [pending next commit]
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290132)
Time Spent: 31h 40m  (was: 31.5h)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL:

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290131=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290131
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:18
Start Date: 07/Aug/19 01:18
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311335283
 
 

 ##
 File path: .gitignore
 ##
 @@ -85,5 +85,3 @@ sdks/python/postcommit_requirements.txt
 # This is typically in files named 'src.xml' throughout this repository.
 
 # JetBrains Education files
-!**/study_project.xml
-**/.coursecreator/**/*
 
 Review comment:
   Must've been erased accidentally while resolving conflicts. My bad. 
Restored. [pending next commit]
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290131)
Time Spent: 31.5h  (was: 31h 20m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 31.5h
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=290130=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290130
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 07/Aug/19 01:18
Start Date: 07/Aug/19 01:18
Worklog Time Spent: 10m 
  Work Description: mf2199 commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r311335283
 
 

 ##
 File path: .gitignore
 ##
 @@ -85,5 +85,3 @@ sdks/python/postcommit_requirements.txt
 # This is typically in files named 'src.xml' throughout this repository.
 
 # JetBrains Education files
-!**/study_project.xml
-**/.coursecreator/**/*
 
 Review comment:
   Must've been erased accidentally while resolving conflicts. My bad. 
Restored, pending next commit.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290130)
Time Spent: 31h 20m  (was: 31h 10m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 31h 20m
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290110=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290110
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311324039
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +212,110 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size_in_mb, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size_in_mb < 1:
+  desired_chunk_size_in_mb = 1
+if start_pos >= end_pos:
+  # single document not splittable
+  return []
 with MongoClient(self.uri, **self.spec) as client:
-  size = client[self.db].command('collstats', self.coll).get('avgObjSize')
-  if size is None or size <= 0:
-raise ValueError(
-'Collection %s not found or average doc size is '
-'incorrect', self.coll)
-  return size
-
-  def _get_document_count(self):
+  name_space = '%s.%s' % (self.db, self.coll)
+  return (client[self.db].command(
+  'splitVector',
+  name_space,
+  keyPattern={'_id': 1},
+  min={'_id': start_pos},
+  max={'_id': end_pos},
+  maxChunkSize=desired_chunk_size_in_mb)['splitKeys'])
+
+  def _merge_id_filter(self, range_tracker):
+all_filters = self.filter.copy()
+if '_id' in all_filters:
+  id_filter = all_filters['_id']
+  id_filter['$gte'] = (
+  max(id_filter['$gte'], range_tracker.start_position())
+  if '$gte' in id_filter else range_tracker.start_position())
+
+  id_filter['$lt'] = (min(id_filter['$lt'], range_tracker.stop_position())
+  if '$lt' in id_filter else
+  range_tracker.stop_position())
+else:
+  all_filters.update({
+  '_id': {
+  '$gte': range_tracker.start_position(),
+  '$lt': range_tracker.stop_position()
+  }
+  })
+return all_filters
+
+  def _get_head_document_id(self, sort_order):
 with MongoClient(self.uri, **self.spec) as client:
-  return max(client[self.db][self.coll].count_documents(self.filter), 0)
+  cursor = client[self.db][self.coll].find(filter={}, projection=[]).sort([
+  ('_id', sort_order)
+  ]).limit(1)
+  try:
+return cursor[0]['_id']
+  except IndexError:
+raise ValueError('Empty Mongodb collection')
 
 Review comment:
   Or it just didn't contain key '_id' for some reason ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290110)
Time Spent: 3h 50m  (was: 3h 40m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290107=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290107
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311311504
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,64 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
 
 Review comment:
   Is this an invalid state or it is just that the size cannot be determined ? 
If latter, you can just return None: 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L134
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290107)
Time Spent: 3h 20m  (was: 3h 10m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290121=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290121
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311325179
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +212,110 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size_in_mb, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size_in_mb < 1:
+  desired_chunk_size_in_mb = 1
+if start_pos >= end_pos:
+  # single document not splittable
+  return []
 with MongoClient(self.uri, **self.spec) as client:
-  size = client[self.db].command('collstats', self.coll).get('avgObjSize')
-  if size is None or size <= 0:
-raise ValueError(
-'Collection %s not found or average doc size is '
-'incorrect', self.coll)
-  return size
-
-  def _get_document_count(self):
+  name_space = '%s.%s' % (self.db, self.coll)
+  return (client[self.db].command(
+  'splitVector',
+  name_space,
+  keyPattern={'_id': 1},
+  min={'_id': start_pos},
+  max={'_id': end_pos},
+  maxChunkSize=desired_chunk_size_in_mb)['splitKeys'])
+
+  def _merge_id_filter(self, range_tracker):
+all_filters = self.filter.copy()
+if '_id' in all_filters:
+  id_filter = all_filters['_id']
+  id_filter['$gte'] = (
+  max(id_filter['$gte'], range_tracker.start_position())
+  if '$gte' in id_filter else range_tracker.start_position())
+
+  id_filter['$lt'] = (min(id_filter['$lt'], range_tracker.stop_position())
+  if '$lt' in id_filter else
+  range_tracker.stop_position())
+else:
+  all_filters.update({
+  '_id': {
+  '$gte': range_tracker.start_position(),
+  '$lt': range_tracker.stop_position()
+  }
+  })
+return all_filters
+
+  def _get_head_document_id(self, sort_order):
 with MongoClient(self.uri, **self.spec) as client:
-  return max(client[self.db][self.coll].count_documents(self.filter), 0)
+  cursor = client[self.db][self.coll].find(filter={}, projection=[]).sort([
+  ('_id', sort_order)
+  ]).limit(1)
+  try:
+return cursor[0]['_id']
+  except IndexError:
+raise ValueError('Empty Mongodb collection')
+
+
+class _ObjectIdHelper(object):
+  """A Utility class to bson object ids."""
+
+  @classmethod
+  def id_to_int(cls, id):
+# converts object id binary to integer
+# id object is bytes type with size of 12
+ints = struct.unpack('>III', id.binary)
+return (ints[0] << 64) + (ints[1] << 32) + ints[2]
+
+  @classmethod
+  def int_to_id(cls, number):
+# converts integer value to object id. Int value should be less than
+# (2 ^ 96) so it can be convert to 12 bytes required by object id.
+if number < 0 or number >= (1 << 96):
+  raise ValueError('number value must be within [0, %s)' % (1 << 96))
+ints = [(number & 0x) >> 64,
+(number & 0x) >> 32,
+number & 0x]
+
+bytes = struct.pack('>III', *ints)
+return objectid.ObjectId(bytes)
+
+  @classmethod
+  def increment_id(cls, object_id, inc):
+# increment object_id binary value by inc value and return new object id.
+id_number = _ObjectIdHelper.id_to_int(object_id)
+new_number = id_number + inc
+if new_number < 0 or new_number >= (1 << 96):
+  raise ValueError('invalid incremental, inc value must be within ['
 
 Review comment:
   Seems like we already do this validation inside 'int_to_id' function invoked 
below ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290121)
Time Spent: 5.5h  (was: 5h 20m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290111=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290111
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311323214
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +212,110 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size_in_mb, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size_in_mb < 1:
+  desired_chunk_size_in_mb = 1
+if start_pos >= end_pos:
+  # single document not splittable
+  return []
 with MongoClient(self.uri, **self.spec) as client:
-  size = client[self.db].command('collstats', self.coll).get('avgObjSize')
-  if size is None or size <= 0:
-raise ValueError(
-'Collection %s not found or average doc size is '
-'incorrect', self.coll)
-  return size
-
-  def _get_document_count(self):
+  name_space = '%s.%s' % (self.db, self.coll)
+  return (client[self.db].command(
+  'splitVector',
+  name_space,
+  keyPattern={'_id': 1},
+  min={'_id': start_pos},
+  max={'_id': end_pos},
+  maxChunkSize=desired_chunk_size_in_mb)['splitKeys'])
+
+  def _merge_id_filter(self, range_tracker):
+all_filters = self.filter.copy()
+if '_id' in all_filters:
 
 Review comment:
   What does key '_id' contain ? Please add a comment. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290111)
Time Spent: 4h  (was: 3h 50m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290115=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290115
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311320582
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,64 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_head_document_id(DESCENDING)
+  # increment last doc id binary value by 1 to make sure the last document
+  # is not excluded
+  stop_position = _ObjectIdHelper.increment_id(last_doc_id, 1)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  if bundle_start is not None or bundle_start >= stop_position:
+break
+  bundle_end = min(stop_position, split_key_id)
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
 source=self,
 start_position=bundle_start,
 stop_position=bundle_end)
   bundle_start = bundle_end
+# add range of last split_key to stop_position
+if bundle_start < stop_position:
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
+source=self,
+start_position=bundle_start,
+stop_position=stop_position)
 
   def get_range_tracker(self, start_position, stop_position):
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 
 Review comment:
   We had the same logic above. Can we move start and end position computation 
logic to a util function ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290115)
Time Spent: 4.5h  (was: 4h 20m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290114=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290114
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311321842
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +212,110 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size_in_mb, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size_in_mb < 1:
+  desired_chunk_size_in_mb = 1
+if start_pos >= end_pos:
+  # single document not splittable
+  return []
 with MongoClient(self.uri, **self.spec) as client:
-  size = client[self.db].command('collstats', self.coll).get('avgObjSize')
-  if size is None or size <= 0:
-raise ValueError(
-'Collection %s not found or average doc size is '
-'incorrect', self.coll)
-  return size
-
-  def _get_document_count(self):
+  name_space = '%s.%s' % (self.db, self.coll)
+  return (client[self.db].command(
+  'splitVector',
+  name_space,
+  keyPattern={'_id': 1},
+  min={'_id': start_pos},
+  max={'_id': end_pos},
+  maxChunkSize=desired_chunk_size_in_mb)['splitKeys'])
+
+  def _merge_id_filter(self, range_tracker):
 
 Review comment:
   Can you add a comment on what this function does (prob. we need more 
comments in general clarifying non-obvious parts of the code).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290114)
Time Spent: 4h 20m  (was: 4h 10m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290109=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290109
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311324493
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +212,110 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size_in_mb, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size_in_mb < 1:
+  desired_chunk_size_in_mb = 1
+if start_pos >= end_pos:
+  # single document not splittable
+  return []
 with MongoClient(self.uri, **self.spec) as client:
-  size = client[self.db].command('collstats', self.coll).get('avgObjSize')
-  if size is None or size <= 0:
-raise ValueError(
-'Collection %s not found or average doc size is '
-'incorrect', self.coll)
-  return size
-
-  def _get_document_count(self):
+  name_space = '%s.%s' % (self.db, self.coll)
+  return (client[self.db].command(
+  'splitVector',
+  name_space,
+  keyPattern={'_id': 1},
+  min={'_id': start_pos},
+  max={'_id': end_pos},
+  maxChunkSize=desired_chunk_size_in_mb)['splitKeys'])
+
+  def _merge_id_filter(self, range_tracker):
+all_filters = self.filter.copy()
+if '_id' in all_filters:
+  id_filter = all_filters['_id']
+  id_filter['$gte'] = (
+  max(id_filter['$gte'], range_tracker.start_position())
+  if '$gte' in id_filter else range_tracker.start_position())
+
+  id_filter['$lt'] = (min(id_filter['$lt'], range_tracker.stop_position())
+  if '$lt' in id_filter else
+  range_tracker.stop_position())
+else:
+  all_filters.update({
+  '_id': {
+  '$gte': range_tracker.start_position(),
+  '$lt': range_tracker.stop_position()
+  }
+  })
+return all_filters
+
+  def _get_head_document_id(self, sort_order):
 with MongoClient(self.uri, **self.spec) as client:
-  return max(client[self.db][self.coll].count_documents(self.filter), 0)
+  cursor = client[self.db][self.coll].find(filter={}, projection=[]).sort([
+  ('_id', sort_order)
+  ]).limit(1)
+  try:
+return cursor[0]['_id']
+  except IndexError:
+raise ValueError('Empty Mongodb collection')
+
+
+class _ObjectIdHelper(object):
+  """A Utility class to bson object ids."""
+
+  @classmethod
+  def id_to_int(cls, id):
+# converts object id binary to integer
 
 Review comment:
   Please add a proper doc comment including variables if possible. (even 
though containing class is private, better to properly describe these functions 
that contain non-trivial byte manipulations).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290109)
Time Spent: 3h 40m  (was: 3.5h)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290117=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290117
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311329296
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,64 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_head_document_id(DESCENDING)
+  # increment last doc id binary value by 1 to make sure the last document
+  # is not excluded
+  stop_position = _ObjectIdHelper.increment_id(last_doc_id, 1)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  if bundle_start is not None or bundle_start >= stop_position:
+break
+  bundle_end = min(stop_position, split_key_id)
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
 source=self,
 start_position=bundle_start,
 stop_position=bundle_end)
   bundle_start = bundle_end
+# add range of last split_key to stop_position
+if bundle_start < stop_position:
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
+source=self,
+start_position=bundle_start,
+stop_position=stop_position)
 
   def get_range_tracker(self, start_position, stop_position):
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 if stop_position is None:
-  stop_position = self.doc_count
-return OffsetRangeTracker(start_position, stop_position)
+  last_doc_id = self._get_head_document_id(DESCENDING)
+  # increment last doc id binary value by 1 to make sure the last document
+  # is not excluded
+  stop_position = _ObjectIdHelper.increment_id(last_doc_id, 1)
+return _ObjectIdRangeTracker(start_position, stop_position)
 
 Review comment:
   Are object IDs are directly comparable (as required by 
OrderedPositionRangeTracker) ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290117)
Time Spent: 4h 50m  (was: 4h 40m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
>

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290108=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290108
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311323133
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +212,110 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size_in_mb, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size_in_mb < 1:
+  desired_chunk_size_in_mb = 1
+if start_pos >= end_pos:
+  # single document not splittable
+  return []
 with MongoClient(self.uri, **self.spec) as client:
-  size = client[self.db].command('collstats', self.coll).get('avgObjSize')
-  if size is None or size <= 0:
-raise ValueError(
-'Collection %s not found or average doc size is '
-'incorrect', self.coll)
-  return size
-
-  def _get_document_count(self):
+  name_space = '%s.%s' % (self.db, self.coll)
+  return (client[self.db].command(
+  'splitVector',
+  name_space,
+  keyPattern={'_id': 1},
+  min={'_id': start_pos},
+  max={'_id': end_pos},
+  maxChunkSize=desired_chunk_size_in_mb)['splitKeys'])
+
+  def _merge_id_filter(self, range_tracker):
+all_filters = self.filter.copy()
 
 Review comment:
   Looks like self.filter can be None (default) ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290108)
Time Spent: 3.5h  (was: 3h 20m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290118=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290118
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311324139
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +212,110 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size_in_mb, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size_in_mb < 1:
+  desired_chunk_size_in_mb = 1
+if start_pos >= end_pos:
+  # single document not splittable
+  return []
 with MongoClient(self.uri, **self.spec) as client:
-  size = client[self.db].command('collstats', self.coll).get('avgObjSize')
-  if size is None or size <= 0:
-raise ValueError(
-'Collection %s not found or average doc size is '
-'incorrect', self.coll)
-  return size
-
-  def _get_document_count(self):
+  name_space = '%s.%s' % (self.db, self.coll)
+  return (client[self.db].command(
+  'splitVector',
+  name_space,
+  keyPattern={'_id': 1},
+  min={'_id': start_pos},
+  max={'_id': end_pos},
+  maxChunkSize=desired_chunk_size_in_mb)['splitKeys'])
+
+  def _merge_id_filter(self, range_tracker):
+all_filters = self.filter.copy()
+if '_id' in all_filters:
+  id_filter = all_filters['_id']
+  id_filter['$gte'] = (
+  max(id_filter['$gte'], range_tracker.start_position())
+  if '$gte' in id_filter else range_tracker.start_position())
+
+  id_filter['$lt'] = (min(id_filter['$lt'], range_tracker.stop_position())
+  if '$lt' in id_filter else
+  range_tracker.stop_position())
+else:
+  all_filters.update({
+  '_id': {
+  '$gte': range_tracker.start_position(),
+  '$lt': range_tracker.stop_position()
+  }
+  })
+return all_filters
+
+  def _get_head_document_id(self, sort_order):
 with MongoClient(self.uri, **self.spec) as client:
-  return max(client[self.db][self.coll].count_documents(self.filter), 0)
+  cursor = client[self.db][self.coll].find(filter={}, projection=[]).sort([
+  ('_id', sort_order)
+  ]).limit(1)
+  try:
+return cursor[0]['_id']
+  except IndexError:
+raise ValueError('Empty Mongodb collection')
+
+
+class _ObjectIdHelper(object):
+  """A Utility class to bson object ids."""
 
 Review comment:
   "...to manipulate bson object ids."
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290118)
Time Spent: 5h  (was: 4h 50m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290116=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290116
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311323636
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +212,110 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size_in_mb, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size_in_mb < 1:
+  desired_chunk_size_in_mb = 1
+if start_pos >= end_pos:
+  # single document not splittable
+  return []
 with MongoClient(self.uri, **self.spec) as client:
-  size = client[self.db].command('collstats', self.coll).get('avgObjSize')
-  if size is None or size <= 0:
-raise ValueError(
-'Collection %s not found or average doc size is '
-'incorrect', self.coll)
-  return size
-
-  def _get_document_count(self):
+  name_space = '%s.%s' % (self.db, self.coll)
+  return (client[self.db].command(
+  'splitVector',
+  name_space,
+  keyPattern={'_id': 1},
+  min={'_id': start_pos},
+  max={'_id': end_pos},
+  maxChunkSize=desired_chunk_size_in_mb)['splitKeys'])
+
+  def _merge_id_filter(self, range_tracker):
+all_filters = self.filter.copy()
+if '_id' in all_filters:
+  id_filter = all_filters['_id']
+  id_filter['$gte'] = (
 
 Review comment:
   What does key '$gte' contain ? Please add a comment.  (same for other 
special keys used here).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290116)
Time Spent: 4h 40m  (was: 4.5h)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290112=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290112
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311320245
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,64 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_head_document_id(DESCENDING)
+  # increment last doc id binary value by 1 to make sure the last document
+  # is not excluded
+  stop_position = _ObjectIdHelper.increment_id(last_doc_id, 1)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  if bundle_start is not None or bundle_start >= stop_position:
+break
+  bundle_end = min(stop_position, split_key_id)
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
 source=self,
 start_position=bundle_start,
 stop_position=bundle_end)
   bundle_start = bundle_end
+# add range of last split_key to stop_position
+if bundle_start < stop_position:
 
 Review comment:
   Does this work if bundle_start == None ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290112)
Time Spent: 4h  (was: 3h 50m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290113=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290113
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311319659
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,64 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  start_position = self._get_head_document_id(ASCENDING)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_head_document_id(DESCENDING)
+  # increment last doc id binary value by 1 to make sure the last document
+  # is not excluded
+  stop_position = _ObjectIdHelper.increment_id(last_doc_id, 1)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  if bundle_start is not None or bundle_start >= stop_position:
 
 Review comment:
   Did you mean "if bundle_start is None or" ? (seems like this loop will just 
end after first iteration).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290113)
Time Spent: 4h 10m  (was: 4h)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290120=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290120
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311330436
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio_test.py
 ##
 @@ -30,38 +34,136 @@
 from apache_beam.io.mongodbio import _BoundedMongoSource
 from apache_beam.io.mongodbio import _GenerateObjectIdFn
 from apache_beam.io.mongodbio import _MongoSink
+from apache_beam.io.mongodbio import _ObjectIdHelper
+from apache_beam.io.mongodbio import _ObjectIdRangeTracker
 from apache_beam.io.mongodbio import _WriteMongoFn
 from apache_beam.testing.test_pipeline import TestPipeline
 from apache_beam.testing.util import assert_that
 from apache_beam.testing.util import equal_to
 
 
+class _MockMongoColl(object):
+  """Fake mongodb collection cursor."""
+
+  def __init__(self, docs):
+self.docs = docs
+
+  def _filter(self, filter):
+match = []
+if not filter:
+  return self
+start = filter['_id'].get('$gte')
+end = filter['_id'].get('$lt')
+assert start is not None
+assert end is not None
+for doc in self.docs:
+  if start and doc['_id'] < start:
+continue
+  if end and doc['_id'] >= end:
+continue
+  match.append(doc)
+return match
+
+  def find(self, filter=None, **kwargs):
+return _MockMongoColl(self._filter(filter))
+
+  def sort(self, sort_items):
+key, order = sort_items[0]
+self.docs = sorted(self.docs,
+   key=lambda x: x[key],
+   reverse=(order != ASCENDING))
+return self
+
+  def limit(self, num):
+return _MockMongoColl(self.docs[0:num])
+
+  def count_documents(self, filter):
+return len(self._filter(filter))
+
+  def __getitem__(self, index):
+return self.docs[index]
+
+
+class _MockMongoDb(object):
+  """Fake Mongo Db."""
+
+  def __init__(self, docs):
+self.docs = docs
+
+  def __getitem__(self, coll_name):
+return _MockMongoColl(self.docs)
+
+  def command(self, command, *args, **kwargs):
+if command == 'collstats':
+  return {'size': 5, 'avgSize': 1}
+elif command == 'splitVector':
+  return self.get_split_key(command, *args, **kwargs)
+
+  def get_split_key(self, command, ns, min, max, maxChunkSize, **kwargs):
+# simulate mongo db splitVector command, return split keys base on chunk
+# size, assuming every doc is of size 1mb
+start_id = min['_id']
 
 Review comment:
   Prob. use different variable names since these override system functions.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290120)
Time Spent: 5h 20m  (was: 5h 10m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=290119=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290119
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:53
Start Date: 07/Aug/19 00:53
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9233:  
[BEAM-7866] Fix python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r311324914
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +212,110 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size_in_mb, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size_in_mb < 1:
+  desired_chunk_size_in_mb = 1
+if start_pos >= end_pos:
+  # single document not splittable
+  return []
 with MongoClient(self.uri, **self.spec) as client:
-  size = client[self.db].command('collstats', self.coll).get('avgObjSize')
-  if size is None or size <= 0:
-raise ValueError(
-'Collection %s not found or average doc size is '
-'incorrect', self.coll)
-  return size
-
-  def _get_document_count(self):
+  name_space = '%s.%s' % (self.db, self.coll)
+  return (client[self.db].command(
+  'splitVector',
+  name_space,
+  keyPattern={'_id': 1},
+  min={'_id': start_pos},
+  max={'_id': end_pos},
+  maxChunkSize=desired_chunk_size_in_mb)['splitKeys'])
+
+  def _merge_id_filter(self, range_tracker):
+all_filters = self.filter.copy()
+if '_id' in all_filters:
+  id_filter = all_filters['_id']
+  id_filter['$gte'] = (
+  max(id_filter['$gte'], range_tracker.start_position())
+  if '$gte' in id_filter else range_tracker.start_position())
+
+  id_filter['$lt'] = (min(id_filter['$lt'], range_tracker.stop_position())
+  if '$lt' in id_filter else
+  range_tracker.stop_position())
+else:
+  all_filters.update({
+  '_id': {
+  '$gte': range_tracker.start_position(),
+  '$lt': range_tracker.stop_position()
+  }
+  })
+return all_filters
+
+  def _get_head_document_id(self, sort_order):
 with MongoClient(self.uri, **self.spec) as client:
-  return max(client[self.db][self.coll].count_documents(self.filter), 0)
+  cursor = client[self.db][self.coll].find(filter={}, projection=[]).sort([
+  ('_id', sort_order)
+  ]).limit(1)
+  try:
+return cursor[0]['_id']
+  except IndexError:
+raise ValueError('Empty Mongodb collection')
+
+
+class _ObjectIdHelper(object):
+  """A Utility class to bson object ids."""
+
+  @classmethod
+  def id_to_int(cls, id):
+# converts object id binary to integer
+# id object is bytes type with size of 12
+ints = struct.unpack('>III', id.binary)
+return (ints[0] << 64) + (ints[1] << 32) + ints[2]
+
+  @classmethod
+  def int_to_id(cls, number):
+# converts integer value to object id. Int value should be less than
+# (2 ^ 96) so it can be convert to 12 bytes required by object id.
+if number < 0 or number >= (1 << 96):
+  raise ValueError('number value must be within [0, %s)' % (1 << 96))
+ints = [(number & 0x) >> 64,
 
 Review comment:
   Please make sure that these functions are extensively tested using unit 
tests.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290119)
Time Spent: 5h 10m  (was: 5h)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in

[jira] [Commented] (BEAM-7906) Perf regression in SQL Query3 in Dataflow

2019-08-06 Thread Pablo Estrada (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901585#comment-16901585
 ] 

Pablo Estrada commented on BEAM-7906:
-

It's hard to find suspicious changes. There's:

https://github.com/apache/beam/pull/9186 - Changes to the worker are 
suspicious, but I'm not sure

> Perf regression in SQL Query3 in Dataflow
> -
>
> Key: BEAM-7906
> URL: https://issues.apache.org/jira/browse/BEAM-7906
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql, runner-dataflow
>Reporter: Anton Kedin
>Priority: Blocker
> Fix For: 2.15.0
>
> Attachments: dataflow.png, direct.png
>
>
> Nexmark shows perf regression in SQL Query3 starting on July 30 2019: 
> https://apache-beam-testing.appspot.com/explore?dashboard=5670405876482048
> There doesn't seem to be a lot of changes to SQL around that date and the one 
> that was there doesn't seem relevan to the query: 
> https://github.com/apache/beam/commits/master/sdks/java/extensions/sql
> Direct runner shows a slight perf decrease as well: 
> https://apache-beam-testing.appspot.com/explore?dashboard=5084698770407424 
> while Spark runner doesn't: 
> https://apache-beam-testing.appspot.com/explore?dashboard=5138380291571712
> The query in question is a join with a simple filter condition: 
> https://github.com/apache/beam/blob/b8aa8486f336df6fc9cf581f29040194edad3b87/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/sql/SqlQuery3.java#L69
> Other queries don't seem to be affected



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (BEAM-7906) Perf regression in SQL Query3 in Dataflow

2019-08-06 Thread Pablo Estrada (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901565#comment-16901565
 ] 

Pablo Estrada edited comment on BEAM-7906 at 8/7/19 12:38 AM:
--

List of PRs merged in this 29-30 of July:

{code:java}
https://github.com/apache/beam/pulls?utf8=✓=is%3Apr+is%3Aclosed+merged%3A2019-07-29..2019-07-30
{code}



was (Author: pabloem):
List of PRs merged in this 29-30 of July:
https://github.com/apache/beam/pulls?utf8=✓=is%3Apr+is%3Aclosed+merged%3A2019-07-29..2019-07-30

> Perf regression in SQL Query3 in Dataflow
> -
>
> Key: BEAM-7906
> URL: https://issues.apache.org/jira/browse/BEAM-7906
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql, runner-dataflow
>Reporter: Anton Kedin
>Priority: Blocker
> Fix For: 2.15.0
>
> Attachments: dataflow.png, direct.png
>
>
> Nexmark shows perf regression in SQL Query3 starting on July 30 2019: 
> https://apache-beam-testing.appspot.com/explore?dashboard=5670405876482048
> There doesn't seem to be a lot of changes to SQL around that date and the one 
> that was there doesn't seem relevan to the query: 
> https://github.com/apache/beam/commits/master/sdks/java/extensions/sql
> Direct runner shows a slight perf decrease as well: 
> https://apache-beam-testing.appspot.com/explore?dashboard=5084698770407424 
> while Spark runner doesn't: 
> https://apache-beam-testing.appspot.com/explore?dashboard=5138380291571712
> The query in question is a join with a simple filter condition: 
> https://github.com/apache/beam/blob/b8aa8486f336df6fc9cf581f29040194edad3b87/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/sql/SqlQuery3.java#L69
> Other queries don't seem to be affected



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7912) Optimize GroupIntoBatches for batch Dataflow pipelines

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7912?focusedWorklogId=290105=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290105
 ]

ASF GitHub Bot logged work on BEAM-7912:


Author: ASF GitHub Bot
Created on: 07/Aug/19 00:35
Start Date: 07/Aug/19 00:35
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on issue #9280: [BEAM-7912] Optimize 
GroupIntoBatches for batch Dataflow pipelines.
URL: https://github.com/apache/beam/pull/9280#issuecomment-518893274
 
 
   Run Dataflow ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290105)
Time Spent: 0.5h  (was: 20m)

> Optimize GroupIntoBatches for batch Dataflow pipelines
> --
>
> Key: BEAM-7912
> URL: https://issues.apache.org/jira/browse/BEAM-7912
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The GroupIntoBatches transform can be significantly optimized on Dataflow 
> since it always ensures that a key K appears in only one bundle after a 
> GroupByKey. This removes the usage of state and timers in the generic 
> GroupIntoBatches transform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=290077=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290077
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 06/Aug/19 23:57
Start Date: 06/Aug/19 23:57
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9278: [BEAM-7760] 
Added iBeam module
URL: https://github.com/apache/beam/pull/9278#discussion_r311320777
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,199 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current iBeam (interactive Beam) environment.
+
+The purpose of the module is to reduce the learning curve of iBeam users, 
+provide a single place for importing and add sugar syntax for all iBeam
+components. It gives users capability to manipulate existing environment for
+interactive beam, TODO(ningk) run interactive pipeline on selected runner as
+normal pipeline, create pipeline with interactive runner and visualize
+PCollections as bounded dataset.
+
+Note: iBeam works the same as normal Beam with DirectRunner when not in an
+interactively environment such as Jupyter lab or Jupyter Notebook. You can also
+run pipeline created by iBeam as normal Beam pipeline by run_pipeline() with
+desired runners.
+"""
+
+import importlib
+
+import apache_beam as beam
+from apache_beam.runners.interactive import interactive_runner
+
+_ibeam_env = None
+
+
+def watch(watchable):
+  """Watches a watchable so that iBeam can understand your pipeline.
 
 Review comment:
   Pydocs have quite a bit of documentation. I suggest reviewing with @rosetn 
for consistency.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290077)
Time Spent: 40m  (was: 0.5h)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >>

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=290078=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290078
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 06/Aug/19 23:57
Start Date: 06/Aug/19 23:57
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9278: [BEAM-7760] 
Added iBeam module
URL: https://github.com/apache/beam/pull/9278#discussion_r311320489
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,199 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current iBeam (interactive Beam) environment.
+
+The purpose of the module is to reduce the learning curve of iBeam users, 
+provide a single place for importing and add sugar syntax for all iBeam
+components. It gives users capability to manipulate existing environment for
+interactive beam, TODO(ningk) run interactive pipeline on selected runner as
+normal pipeline, create pipeline with interactive runner and visualize
+PCollections as bounded dataset.
+
+Note: iBeam works the same as normal Beam with DirectRunner when not in an
+interactively environment such as Jupyter lab or Jupyter Notebook. You can also
+run pipeline created by iBeam as normal Beam pipeline by run_pipeline() with
+desired runners.
+"""
+
+import importlib
+
+import apache_beam as beam
+from apache_beam.runners.interactive import interactive_runner
+
+_ibeam_env = None
+
+
+def watch(watchable):
+  """Watches a watchable so that iBeam can understand your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  iBeam. However, if your Beam pipeline is defined in some module other than
+  __main__, e.g., inside a class function or a unit test, you can watch() the
+  scope to instruct iBeam to apply magic to your pipeline when running pipeline
+  interactively.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = create_pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+iBeam will cache init_pcoll for the first run. You can use:
+
+visualize(init_pcoll)
+
+To visualize data from init_pcoll once the pipeline is executed. And if you
+make change to the original pipeline by adding:
+
+squares = init_pcoll | 'Square' >> beam.Map(lambda x: x*x)
+
+When you re-run the pipeline from the line you just added, squares will
+use the init_pcoll data cached so you can have an interactive experience.
+
+  Currently the implementation mainly watches for PCollection variables defined
+  in user code. A watchable can be a dictionary of variable metadata such as
+  locals(), a str name of a module, a module object or an instance of a class.
+  The variable can come from any scope even local variables in a method of a
+  class defined in a module.
+
+Below are all valid:
+
+watch(__main__)  # if import __main__ is already invoked
+watch('__main__')  # does not require invoking import __main__ beforehand
+watch(self)  # inside a class
+watch(SomeInstance())  # an instance of a class
+watch(locals())  # inside a function, watching local variables within
+  """
+  current_env().watch(watchable)
+
+
+def create_pipeline(runner=None, options=None, argv=None):
+  """Creates a pipeline with interactive runner by default.
+
+  You can use run_pipeline() provided within this module to execute the iBeam
+  pipeline with other runners.
+
+  Args:
+runner (~apache_beam.runners.runner.PipelineRunner): An object of
+  type :class:`~apache_beam.runners.runner.PipelineRunner` that will be
+  used to execute the pipeline. For registered runners, the runner name
+  can be specified, otherwise a runner object must be supplied.
+options (~apache_beam.options.pipeline_options.PipelineOptions):
+  A configured
+

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=290075=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290075
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 06/Aug/19 23:57
Start Date: 06/Aug/19 23:57
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9278: [BEAM-7760] 
Added iBeam module
URL: https://github.com/apache/beam/pull/9278#discussion_r311320224
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,199 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current iBeam (interactive Beam) environment.
+
+The purpose of the module is to reduce the learning curve of iBeam users, 
+provide a single place for importing and add sugar syntax for all iBeam
+components. It gives users capability to manipulate existing environment for
+interactive beam, TODO(ningk) run interactive pipeline on selected runner as
+normal pipeline, create pipeline with interactive runner and visualize
+PCollections as bounded dataset.
+
+Note: iBeam works the same as normal Beam with DirectRunner when not in an
+interactively environment such as Jupyter lab or Jupyter Notebook. You can also
+run pipeline created by iBeam as normal Beam pipeline by run_pipeline() with
+desired runners.
+"""
+
+import importlib
+
+import apache_beam as beam
+from apache_beam.runners.interactive import interactive_runner
+
+_ibeam_env = None
+
+
+def watch(watchable):
+  """Watches a watchable so that iBeam can understand your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  iBeam. However, if your Beam pipeline is defined in some module other than
+  __main__, e.g., inside a class function or a unit test, you can watch() the
+  scope to instruct iBeam to apply magic to your pipeline when running pipeline
+  interactively.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = create_pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+iBeam will cache init_pcoll for the first run. You can use:
+
+visualize(init_pcoll)
+
+To visualize data from init_pcoll once the pipeline is executed. And if you
+make change to the original pipeline by adding:
+
+squares = init_pcoll | 'Square' >> beam.Map(lambda x: x*x)
+
+When you re-run the pipeline from the line you just added, squares will
+use the init_pcoll data cached so you can have an interactive experience.
+
+  Currently the implementation mainly watches for PCollection variables defined
+  in user code. A watchable can be a dictionary of variable metadata such as
+  locals(), a str name of a module, a module object or an instance of a class.
+  The variable can come from any scope even local variables in a method of a
+  class defined in a module.
+
+Below are all valid:
+
+watch(__main__)  # if import __main__ is already invoked
+watch('__main__')  # does not require invoking import __main__ beforehand
+watch(self)  # inside a class
+watch(SomeInstance())  # an instance of a class
+watch(locals())  # inside a function, watching local variables within
+  """
+  current_env().watch(watchable)
+
+
+def create_pipeline(runner=None, options=None, argv=None):
+  """Creates a pipeline with interactive runner by default.
+
+  You can use run_pipeline() provided within this module to execute the iBeam
+  pipeline with other runners.
+
+  Args:
+runner (~apache_beam.runners.runner.PipelineRunner): An object of
+  type :class:`~apache_beam.runners.runner.PipelineRunner` that will be
+  used to execute the pipeline. For registered runners, the runner name
+  can be specified, otherwise a runner object must be supplied.
+options (~apache_beam.options.pipeline_options.PipelineOptions):
+  A configured
+

[jira] [Commented] (BEAM-7906) Perf regression in SQL Query3 in Dataflow

2019-08-06 Thread Pablo Estrada (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901565#comment-16901565
 ] 

Pablo Estrada commented on BEAM-7906:
-

List of PRs merged in this 29-30 of July:
https://github.com/apache/beam/pulls?utf8=✓=is%3Apr+is%3Aclosed+merged%3A2019-07-29..2019-07-30

> Perf regression in SQL Query3 in Dataflow
> -
>
> Key: BEAM-7906
> URL: https://issues.apache.org/jira/browse/BEAM-7906
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql, runner-dataflow
>Reporter: Anton Kedin
>Priority: Blocker
> Fix For: 2.15.0
>
> Attachments: dataflow.png, direct.png
>
>
> Nexmark shows perf regression in SQL Query3 starting on July 30 2019: 
> https://apache-beam-testing.appspot.com/explore?dashboard=5670405876482048
> There doesn't seem to be a lot of changes to SQL around that date and the one 
> that was there doesn't seem relevan to the query: 
> https://github.com/apache/beam/commits/master/sdks/java/extensions/sql
> Direct runner shows a slight perf decrease as well: 
> https://apache-beam-testing.appspot.com/explore?dashboard=5084698770407424 
> while Spark runner doesn't: 
> https://apache-beam-testing.appspot.com/explore?dashboard=5138380291571712
> The query in question is a join with a simple filter condition: 
> https://github.com/apache/beam/blob/b8aa8486f336df6fc9cf581f29040194edad3b87/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/sql/SqlQuery3.java#L69
> Other queries don't seem to be affected



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=290076=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290076
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 06/Aug/19 23:57
Start Date: 06/Aug/19 23:57
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9278: [BEAM-7760] 
Added iBeam module
URL: https://github.com/apache/beam/pull/9278#discussion_r311320074
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,199 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current iBeam (interactive Beam) environment.
+
+The purpose of the module is to reduce the learning curve of iBeam users, 
+provide a single place for importing and add sugar syntax for all iBeam
+components. It gives users capability to manipulate existing environment for
+interactive beam, TODO(ningk) run interactive pipeline on selected runner as
+normal pipeline, create pipeline with interactive runner and visualize
+PCollections as bounded dataset.
+
+Note: iBeam works the same as normal Beam with DirectRunner when not in an
+interactively environment such as Jupyter lab or Jupyter Notebook. You can also
+run pipeline created by iBeam as normal Beam pipeline by run_pipeline() with
+desired runners.
+"""
+
+import importlib
+
+import apache_beam as beam
+from apache_beam.runners.interactive import interactive_runner
+
+_ibeam_env = None
+
+
+def watch(watchable):
+  """Watches a watchable so that iBeam can understand your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  iBeam. However, if your Beam pipeline is defined in some module other than
+  __main__, e.g., inside a class function or a unit test, you can watch() the
+  scope to instruct iBeam to apply magic to your pipeline when running pipeline
+  interactively.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = create_pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+iBeam will cache init_pcoll for the first run. You can use:
+
+visualize(init_pcoll)
+
+To visualize data from init_pcoll once the pipeline is executed. And if you
+make change to the original pipeline by adding:
+
+squares = init_pcoll | 'Square' >> beam.Map(lambda x: x*x)
+
+When you re-run the pipeline from the line you just added, squares will
+use the init_pcoll data cached so you can have an interactive experience.
+
+  Currently the implementation mainly watches for PCollection variables defined
+  in user code. A watchable can be a dictionary of variable metadata such as
+  locals(), a str name of a module, a module object or an instance of a class.
+  The variable can come from any scope even local variables in a method of a
+  class defined in a module.
+
+Below are all valid:
+
+watch(__main__)  # if import __main__ is already invoked
+watch('__main__')  # does not require invoking import __main__ beforehand
+watch(self)  # inside a class
+watch(SomeInstance())  # an instance of a class
+watch(locals())  # inside a function, watching local variables within
+  """
+  current_env().watch(watchable)
+
+
+def create_pipeline(runner=None, options=None, argv=None):
+  """Creates a pipeline with interactive runner by default.
+
+  You can use run_pipeline() provided within this module to execute the iBeam
+  pipeline with other runners.
+
+  Args:
+runner (~apache_beam.runners.runner.PipelineRunner): An object of
+  type :class:`~apache_beam.runners.runner.PipelineRunner` that will be
+  used to execute the pipeline. For registered runners, the runner name
+  can be specified, otherwise a runner object must be supplied.
+options (~apache_beam.options.pipeline_options.PipelineOptions):
+  A configured
+

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=290080=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290080
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 06/Aug/19 23:57
Start Date: 06/Aug/19 23:57
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9278: [BEAM-7760] 
Added iBeam module
URL: https://github.com/apache/beam/pull/9278#discussion_r311320333
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,199 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current iBeam (interactive Beam) environment.
+
+The purpose of the module is to reduce the learning curve of iBeam users, 
+provide a single place for importing and add sugar syntax for all iBeam
+components. It gives users capability to manipulate existing environment for
+interactive beam, TODO(ningk) run interactive pipeline on selected runner as
+normal pipeline, create pipeline with interactive runner and visualize
+PCollections as bounded dataset.
+
+Note: iBeam works the same as normal Beam with DirectRunner when not in an
+interactively environment such as Jupyter lab or Jupyter Notebook. You can also
+run pipeline created by iBeam as normal Beam pipeline by run_pipeline() with
+desired runners.
+"""
+
+import importlib
+
+import apache_beam as beam
+from apache_beam.runners.interactive import interactive_runner
+
+_ibeam_env = None
+
+
+def watch(watchable):
+  """Watches a watchable so that iBeam can understand your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  iBeam. However, if your Beam pipeline is defined in some module other than
+  __main__, e.g., inside a class function or a unit test, you can watch() the
+  scope to instruct iBeam to apply magic to your pipeline when running pipeline
+  interactively.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = create_pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+iBeam will cache init_pcoll for the first run. You can use:
+
+visualize(init_pcoll)
+
+To visualize data from init_pcoll once the pipeline is executed. And if you
+make change to the original pipeline by adding:
+
+squares = init_pcoll | 'Square' >> beam.Map(lambda x: x*x)
+
+When you re-run the pipeline from the line you just added, squares will
+use the init_pcoll data cached so you can have an interactive experience.
+
+  Currently the implementation mainly watches for PCollection variables defined
+  in user code. A watchable can be a dictionary of variable metadata such as
+  locals(), a str name of a module, a module object or an instance of a class.
+  The variable can come from any scope even local variables in a method of a
+  class defined in a module.
+
+Below are all valid:
+
+watch(__main__)  # if import __main__ is already invoked
+watch('__main__')  # does not require invoking import __main__ beforehand
+watch(self)  # inside a class
+watch(SomeInstance())  # an instance of a class
+watch(locals())  # inside a function, watching local variables within
+  """
+  current_env().watch(watchable)
+
+
+def create_pipeline(runner=None, options=None, argv=None):
+  """Creates a pipeline with interactive runner by default.
+
+  You can use run_pipeline() provided within this module to execute the iBeam
+  pipeline with other runners.
+
+  Args:
+runner (~apache_beam.runners.runner.PipelineRunner): An object of
+  type :class:`~apache_beam.runners.runner.PipelineRunner` that will be
+  used to execute the pipeline. For registered runners, the runner name
+  can be specified, otherwise a runner object must be supplied.
+options (~apache_beam.options.pipeline_options.PipelineOptions):
+  A configured
+

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=290079=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290079
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 06/Aug/19 23:57
Start Date: 06/Aug/19 23:57
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #9278: [BEAM-7760] 
Added iBeam module
URL: https://github.com/apache/beam/pull/9278#discussion_r311319990
 
 

 ##
 File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
 ##
 @@ -0,0 +1,199 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module of the current iBeam (interactive Beam) environment.
+
+The purpose of the module is to reduce the learning curve of iBeam users, 
+provide a single place for importing and add sugar syntax for all iBeam
+components. It gives users capability to manipulate existing environment for
+interactive beam, TODO(ningk) run interactive pipeline on selected runner as
+normal pipeline, create pipeline with interactive runner and visualize
+PCollections as bounded dataset.
+
+Note: iBeam works the same as normal Beam with DirectRunner when not in an
+interactively environment such as Jupyter lab or Jupyter Notebook. You can also
+run pipeline created by iBeam as normal Beam pipeline by run_pipeline() with
+desired runners.
+"""
+
+import importlib
+
+import apache_beam as beam
+from apache_beam.runners.interactive import interactive_runner
+
+_ibeam_env = None
+
+
+def watch(watchable):
+  """Watches a watchable so that iBeam can understand your pipeline.
+
+  If you write Beam pipeline in a notebook or __main__ module directly, since
+  __main__ module is always watched by default, you don't have to instruct
+  iBeam. However, if your Beam pipeline is defined in some module other than
+  __main__, e.g., inside a class function or a unit test, you can watch() the
+  scope to instruct iBeam to apply magic to your pipeline when running pipeline
+  interactively.
+
+For example:
+
+class Foo(object)
+  def build_pipeline(self):
+p = create_pipeline()
+init_pcoll = p |  'Init Create' >> beam.Create(range(10))
+watch(locals())
+return p
+Foo().build_pipeline().run()
+
+iBeam will cache init_pcoll for the first run. You can use:
+
+visualize(init_pcoll)
+
+To visualize data from init_pcoll once the pipeline is executed. And if you
+make change to the original pipeline by adding:
+
+squares = init_pcoll | 'Square' >> beam.Map(lambda x: x*x)
+
+When you re-run the pipeline from the line you just added, squares will
+use the init_pcoll data cached so you can have an interactive experience.
+
+  Currently the implementation mainly watches for PCollection variables defined
+  in user code. A watchable can be a dictionary of variable metadata such as
+  locals(), a str name of a module, a module object or an instance of a class.
+  The variable can come from any scope even local variables in a method of a
+  class defined in a module.
+
+Below are all valid:
+
+watch(__main__)  # if import __main__ is already invoked
+watch('__main__')  # does not require invoking import __main__ beforehand
+watch(self)  # inside a class
+watch(SomeInstance())  # an instance of a class
+watch(locals())  # inside a function, watching local variables within
+  """
+  current_env().watch(watchable)
+
+
+def create_pipeline(runner=None, options=None, argv=None):
 
 Review comment:
   This is not really very different than beam.Pipeline(...). It might be 
better to avoid adding. Using existing constructs will have the advantage of 
existing documentation and user knowledge, and user's can copy contents out of 
notebooks without much change.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking

[jira] [Work logged] (BEAM-6907) Standardize Gradle projects/tasks structure for Python SDK

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6907?focusedWorklogId=290066=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290066
 ]

ASF GitHub Bot logged work on BEAM-6907:


Author: ASF GitHub Bot
Created on: 06/Aug/19 23:37
Start Date: 06/Aug/19 23:37
Worklog Time Spent: 10m 
  Work Description: markflyhigh commented on issue #9277: [BEAM-6907] Reuse 
Python tarball in tox & dataflow integration tests
URL: https://github.com/apache/beam/pull/9277#issuecomment-518882354
 
 
   Run Python 3.7 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290066)
Time Spent: 1h 20m  (was: 1h 10m)

> Standardize Gradle projects/tasks structure for Python SDK
> --
>
> Key: BEAM-6907
> URL: https://issues.apache.org/jira/browse/BEAM-6907
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> As Gradle parallelism applied to Python tests and more python versions added 
> to tests, the way Gradle manages projects/tasks changed a lot. Frictions are 
> generated during Gradle refactor since some projects defined separate build 
> script under source directory. Thus, It will be better to standardize how we 
> use Gradle. This will help to manage Python tests/builds/tasks across 
> different versions and runners, and also easy for people to learn/use/develop.
> In general, we may want to:
> - Apply parallel execution
> - Share common tasks
> - Centralize test related tasks
> - Have a clear Gradle structure for projects/tasks



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-6907) Standardize Gradle projects/tasks structure for Python SDK

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6907?focusedWorklogId=290064=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290064
 ]

ASF GitHub Bot logged work on BEAM-6907:


Author: ASF GitHub Bot
Created on: 06/Aug/19 23:32
Start Date: 06/Aug/19 23:32
Worklog Time Spent: 10m 
  Work Description: markflyhigh commented on issue #9277: [BEAM-6907] Reuse 
Python tarball in tox & dataflow integration tests
URL: https://github.com/apache/beam/pull/9277#issuecomment-518881390
 
 
   Run Python 2 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290064)
Time Spent: 1h 10m  (was: 1h)

> Standardize Gradle projects/tasks structure for Python SDK
> --
>
> Key: BEAM-6907
> URL: https://issues.apache.org/jira/browse/BEAM-6907
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> As Gradle parallelism applied to Python tests and more python versions added 
> to tests, the way Gradle manages projects/tasks changed a lot. Frictions are 
> generated during Gradle refactor since some projects defined separate build 
> script under source directory. Thus, It will be better to standardize how we 
> use Gradle. This will help to manage Python tests/builds/tasks across 
> different versions and runners, and also easy for people to learn/use/develop.
> In general, we may want to:
> - Apply parallel execution
> - Share common tasks
> - Centralize test related tasks
> - Have a clear Gradle structure for projects/tasks



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-6907) Standardize Gradle projects/tasks structure for Python SDK

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6907?focusedWorklogId=290062=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290062
 ]

ASF GitHub Bot logged work on BEAM-6907:


Author: ASF GitHub Bot
Created on: 06/Aug/19 23:31
Start Date: 06/Aug/19 23:31
Worklog Time Spent: 10m 
  Work Description: markflyhigh commented on issue #9277: [BEAM-6907] Reuse 
Python tarball in tox & dataflow integration tests
URL: https://github.com/apache/beam/pull/9277#issuecomment-518881084
 
 
   beam_PreCommit_Python_Commit 
[#7937](https://builds.apache.org/job/beam_PreCommit_Python_Commit/7937/) 
passed. Tox tests are fixed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290062)
Time Spent: 1h  (was: 50m)

> Standardize Gradle projects/tasks structure for Python SDK
> --
>
> Key: BEAM-6907
> URL: https://issues.apache.org/jira/browse/BEAM-6907
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> As Gradle parallelism applied to Python tests and more python versions added 
> to tests, the way Gradle manages projects/tasks changed a lot. Frictions are 
> generated during Gradle refactor since some projects defined separate build 
> script under source directory. Thus, It will be better to standardize how we 
> use Gradle. This will help to manage Python tests/builds/tasks across 
> different versions and runners, and also easy for people to learn/use/develop.
> In general, we may want to:
> - Apply parallel execution
> - Share common tasks
> - Centralize test related tasks
> - Have a clear Gradle structure for projects/tasks



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7760) Interactive Beam Caching PCollections bound to user defined vars in notebook

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7760?focusedWorklogId=290040=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290040
 ]

ASF GitHub Bot logged work on BEAM-7760:


Author: ASF GitHub Bot
Created on: 06/Aug/19 22:49
Start Date: 06/Aug/19 22:49
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9278: [BEAM-7760] Added 
iBeam module
URL: https://github.com/apache/beam/pull/9278#issuecomment-518872095
 
 
   R: @aaltay 
   retest this please
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290040)
Time Spent: 20m  (was: 10m)

> Interactive Beam Caching PCollections bound to user defined vars in notebook
> 
>
> Key: BEAM-7760
> URL: https://issues.apache.org/jira/browse/BEAM-7760
> Project: Beam
>  Issue Type: New Feature
>  Components: examples-python
>Reporter: Ning Kang
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Cache only PCollections bound to user defined variables in a pipeline when 
> running pipeline with interactive runner in jupyter notebooks.
> [Interactive 
> Beam|[https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/interactive]]
>  has been caching and using caches of "leaf" PCollections for interactive 
> execution in jupyter notebooks.
> The interactive execution is currently supported so that when appending new 
> transforms to existing pipeline for a new run, executed part of the pipeline 
> doesn't need to be re-executed. 
> A PCollection is "leaf" when it is never used as input in any PTransform in 
> the pipeline.
> The problem with building caches and pipeline to execute around "leaf" is 
> that when a PCollection is consumed by a sink with no output, the pipeline to 
> execute built will miss the subgraph generating and consuming that 
> PCollection.
> An example, "ReadFromPubSub --> WirteToPubSub" will result in an empty 
> pipeline.
> Caching around PCollections bound to user defined variables and replacing 
> transforms with source and sink of caches could resolve the pipeline to 
> execute properly under the interactive execution scenario. Also, cached 
> PCollection now can trace back to user code and can be used for user data 
> visualization if user wants to do it.
> E.g.,
> {code:java}
> // ...
> p = beam.Pipeline(interactive_runner.InteractiveRunner(),
>   options=pipeline_options)
> messages = p | "Read" >> beam.io.ReadFromPubSub(subscription='...')
> messages | "Write" >> beam.io.WriteToPubSub(topic_path)
> result = p.run()
> // ...
> visualize(messages){code}
>  The interactive runner automatically figures out that PCollection
> {code:java}
> messages{code}
> created by
> {code:java}
> p | "Read" >> beam.io.ReadFromPubSub(subscription='...'){code}
> should be cached and reused if the notebook user appends more transforms.
>  And once the pipeline gets executed, the user could use any 
> visualize(PCollection) module to visualize the data statically (batch) or 
> dynamically (stream)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7899) beam_PostCommit_Py_VR_Dataflow failure: build/apache-beam.tar.gz cannot be found

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7899?focusedWorklogId=290033=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290033
 ]

ASF GitHub Bot logged work on BEAM-7899:


Author: ASF GitHub Bot
Created on: 06/Aug/19 22:35
Start Date: 06/Aug/19 22:35
Worklog Time Spent: 10m 
  Work Description: markflyhigh commented on pull request #9269: 
[BEAM-7899] Fix Python Dataflow VR tests by specify sdk_location
URL: https://github.com/apache/beam/pull/9269
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290033)
Time Spent: 1h 10m  (was: 1h)

> beam_PostCommit_Py_VR_Dataflow failure: build/apache-beam.tar.gz cannot be 
> found
> 
>
> Key: BEAM-7899
> URL: https://issues.apache.org/jira/browse/BEAM-7899
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Udi Meiri
>Assignee: Mark Liu
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Lots of tests consistently failing with errors like:
> {code}
> 09:56:15 
> ==
> 09:56:15 ERROR: test_as_list_twice 
> (apache_beam.transforms.sideinputs_test.SideInputsTest)
> 09:56:15 
> --
> 09:56:15 Traceback (most recent call last):
> 09:56:15   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/transforms/sideinputs_test.py",
>  line 274, in test_as_list_twice
> 09:56:15 pipeline.run()
> 09:56:15   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/testing/test_pipeline.py",
>  line 107, in run
> 09:56:15 else test_runner_api))
> 09:56:15   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/pipeline.py",
>  line 406, in run
> 09:56:15 self._options).run(False)
> 09:56:15   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/pipeline.py",
>  line 419, in run
> 09:56:15 return self.runner.run_pipeline(self, self._options)
> 09:56:15   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/runners/dataflow/test_dataflow_runner.py",
>  line 53, in run_pipeline
> 09:56:15 pipeline, options)
> 09:56:15   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py",
>  line 484, in run_pipeline
> 09:56:15 self.dataflow_client.create_job(self.job), self)
> 09:56:15   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/utils/retry.py",
>  line 206, in wrapper
> 09:56:15 return fun(*args, **kwargs)
> 09:56:15   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py",
>  line 530, in create_job
> 09:56:15 self.create_job_description(job)
> 09:56:15   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py",
>  line 560, in create_job_description
> 09:56:15 resources = self._stage_resources(job.options)
> 09:56:15   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py",
>  line 490, in _stage_resources
> 09:56:15 staging_location=google_cloud_options.staging_location)
> 09:56:15   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_VR_Dataflow/src/sdks/python/apache_beam/runners/portability/stager.py",
>  line 273, in stage_job_resources
> 09:56:15 'the --sdk_location command-line option.' % sdk_path)
> 09:56:15 RuntimeError: The file "build/apache-beam.tar.gz" cannot be found. 
> Its location was specified by the --sdk_location command-line option.
> {code}
> https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/4207/console
> [~markflyhigh]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7915) show cross-language validate runner Flink badge on github PR template

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7915?focusedWorklogId=290028=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290028
 ]

ASF GitHub Bot logged work on BEAM-7915:


Author: ASF GitHub Bot
Created on: 06/Aug/19 22:09
Start Date: 06/Aug/19 22:09
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #9282: [BEAM-7915] show 
cross-language validate runner Flink badge on github PR template
URL: https://github.com/apache/beam/pull/9282#issuecomment-518862760
 
 
   R: @aaltay 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290028)
Time Spent: 20m  (was: 10m)

> show cross-language validate runner Flink badge on github PR template
> -
>
> Key: BEAM-7915
> URL: https://issues.apache.org/jira/browse/BEAM-7915
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> show cross-language validate runner Flink badge on github template



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7915) show cross-language validate runner Flink badge on github PR template

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7915?focusedWorklogId=290027=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290027
 ]

ASF GitHub Bot logged work on BEAM-7915:


Author: ASF GitHub Bot
Created on: 06/Aug/19 22:08
Start Date: 06/Aug/19 22:08
Worklog Time Spent: 10m 
  Work Description: ihji commented on pull request #9282: [BEAM-7915] show 
cross-language validate runner Flink badge on github PR template
URL: https://github.com/apache/beam/pull/9282
 
 
   show cross-language validate runner Flink badge on github PR template
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build

[jira] [Updated] (BEAM-7915) show cross-language validate runner Flink badge on github PR template

2019-08-06 Thread Heejong Lee (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heejong Lee updated BEAM-7915:
--
Summary: show cross-language validate runner Flink badge on github PR 
template  (was: show cross-language validate runner Flink badge on github 
template)

> show cross-language validate runner Flink badge on github PR template
> -
>
> Key: BEAM-7915
> URL: https://issues.apache.org/jira/browse/BEAM-7915
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>
> show cross-language validate runner Flink badge on github template



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7915) show cross-language validate runner Flink badge on github template

2019-08-06 Thread Heejong Lee (JIRA)

Heejong Lee created BEAM-7915:
-

 Summary: show cross-language validate runner Flink badge on github 
template
 Key: BEAM-7915
 URL: https://issues.apache.org/jira/browse/BEAM-7915
 Project: Beam
  Issue Type: Improvement
  Components: website
Reporter: Heejong Lee


show cross-language validate runner Flink badge on github template



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (BEAM-7915) show cross-language validate runner Flink badge on github template

2019-08-06 Thread Heejong Lee (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heejong Lee reassigned BEAM-7915:
-

Assignee: Heejong Lee

> show cross-language validate runner Flink badge on github template
> --
>
> Key: BEAM-7915
> URL: https://issues.apache.org/jira/browse/BEAM-7915
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>
> show cross-language validate runner Flink badge on github template



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7862) Local testing utilities for BigQuery pipelines in releases

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7862?focusedWorklogId=290022=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290022
 ]

ASF GitHub Bot logged work on BEAM-7862:


Author: ASF GitHub Bot
Created on: 06/Aug/19 21:56
Start Date: 06/Aug/19 21:56
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9206: [BEAM-7862] 
Moving FakeBigQueryServices to published artifacts
URL: https://github.com/apache/beam/pull/9206
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290022)
Time Spent: 1h 20m  (was: 1h 10m)

> Local testing utilities for BigQuery pipelines in releases
> --
>
> Key: BEAM-7862
> URL: https://issues.apache.org/jira/browse/BEAM-7862
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7856) BigQuery table creation race condition error when executing pipeline on multiple workers

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7856?focusedWorklogId=290019=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290019
 ]

ASF GitHub Bot logged work on BEAM-7856:


Author: ASF GitHub Bot
Created on: 06/Aug/19 21:52
Start Date: 06/Aug/19 21:52
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9204: 
[BEAM-7856] Suppress error on table bigquery table already exists
URL: https://github.com/apache/beam/pull/9204#discussion_r311291293
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py
 ##
 @@ -659,12 +659,19 @@ def get_or_create_table(
 if found_table and write_disposition != BigQueryDisposition.WRITE_TRUNCATE:
   return found_table
 else:
-  created_table = self._create_table(
-  project_id=project_id,
-  dataset_id=dataset_id,
-  table_id=table_id,
-  schema=schema or found_table.schema,
-  additional_parameters=additional_create_parameters)
+  created_table = None
+  try:
+created_table = self._create_table(
+project_id=project_id,
+dataset_id=dataset_id,
+table_id=table_id,
+schema=schema or found_table.schema,
+additional_parameters=additional_create_parameters)
+  except HttpError as exn:
+if exn.status_code == 409:
 
 Review comment:
   Instead of suppressing the error, can we move table creation to a step that 
preceeds writing ? That seems cleaner.
   
   Java SDK seems to be doing something like this based on a quick look: 
https://github.com/apache/beam/blob/08d0146791e38be4641ff80ffb2539cdc81f5b6d/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingInserts.java#L178
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290019)
Time Spent: 50m  (was: 40m)

> BigQuery table creation race condition error when executing pipeline on 
> multiple workers
> 
>
> Key: BEAM-7856
> URL: https://issues.apache.org/jira/browse/BEAM-7856
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Reporter: Ankur Goenka
>Assignee: Ankur Goenka
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This is non-fatal issue and just prints error in the logs as far as I can 
> tell.
> The issue is when we check and create big query table on multiple workers at 
> the same time. This causes the race condition.
>  
> {noformat}
> File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 157, in _execute response = task() File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 190, in  self._execute(lambda: worker.do_instruction(work), 
> work) File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 342, in do_instruction request.instruction_id) File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 368, in process_bundle bundle_processor.process_bundle(instruction_id)) 
> File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/bundle_processor.py",
>  line 593, in process_bundle data.ptransform_id].process_encoded(data.data) 
> File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/bundle_processor.py",
>  line 143, in process_encoded self.output(decoded_value) File 
> "apache_beam/runners/worker/operations.py", line 255, in 
> apache_beam.runners.worker.operations.Operation.output def output(self, 
> windowed_value, output_index=0): File 
> "apache_beam/runners/worker/operations.py", line 256, in 
> apache_beam.runners.worker.operations.Operation.output cython.cast(Receiver, 
> self.receivers[output_index]).receive(windowed_value) File 
> "apache_beam/runners/worker/operations.py", line 143, in 
> apache_beam.runners.worker.operations.SingletonConsumerSet.receive 
> self.consumer.process(windowed_value) File 
> "apache_beam/runners/worker/operations.py", line 593, in 
> apache_beam.runners.worker.operations.DoOperation.process with 
> self.scoped_process_state: File "apache_beam/runners/worker/operations.py", 
> line 594, in apache_beam.runners.worker.operations.DoOperation.process 
> delayed_application = self.dofn_receiver.receive(o) File 
> "apache_beam/runners/common.py", line 799, in 
>

[jira] [Work logged] (BEAM-6907) Standardize Gradle projects/tasks structure for Python SDK

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6907?focusedWorklogId=290017=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290017
 ]

ASF GitHub Bot logged work on BEAM-6907:


Author: ASF GitHub Bot
Created on: 06/Aug/19 21:50
Start Date: 06/Aug/19 21:50
Worklog Time Spent: 10m 
  Work Description: markflyhigh commented on issue #9277: [BEAM-6907] Reuse 
Python tarball in tox & dataflow integration tests
URL: https://github.com/apache/beam/pull/9277#issuecomment-518857405
 
 
   Great to see `ModuleNotFoundError` disappeared in 
beam_PostCommit_Py_VR_Dataflow_PR[#119](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow_PR/119/)
 and only one test failed (among 8 suites) which seems caused by pip service 
flaky. 
   
   I'll go ahead to fix tox broken and also run postcommit.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290017)
Time Spent: 40m  (was: 0.5h)

> Standardize Gradle projects/tasks structure for Python SDK
> --
>
> Key: BEAM-6907
> URL: https://issues.apache.org/jira/browse/BEAM-6907
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> As Gradle parallelism applied to Python tests and more python versions added 
> to tests, the way Gradle manages projects/tasks changed a lot. Frictions are 
> generated during Gradle refactor since some projects defined separate build 
> script under source directory. Thus, It will be better to standardize how we 
> use Gradle. This will help to manage Python tests/builds/tasks across 
> different versions and runners, and also easy for people to learn/use/develop.
> In general, we may want to:
> - Apply parallel execution
> - Share common tasks
> - Centralize test related tasks
> - Have a clear Gradle structure for projects/tasks



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-6907) Standardize Gradle projects/tasks structure for Python SDK

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6907?focusedWorklogId=290018=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290018
 ]

ASF GitHub Bot logged work on BEAM-6907:


Author: ASF GitHub Bot
Created on: 06/Aug/19 21:50
Start Date: 06/Aug/19 21:50
Worklog Time Spent: 10m 
  Work Description: markflyhigh commented on issue #9277: [BEAM-6907] Reuse 
Python tarball in tox & dataflow integration tests
URL: https://github.com/apache/beam/pull/9277#issuecomment-518857405
 
 
   Glad to see `ModuleNotFoundError` disappeared in 
beam_PostCommit_Py_VR_Dataflow_PR[#119](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow_PR/119/)
 and only one test failed (among 8 suites) which seems caused by pip service 
flaky. 
   
   I'll go ahead to fix tox broken and also run postcommit.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290018)
Time Spent: 50m  (was: 40m)

> Standardize Gradle projects/tasks structure for Python SDK
> --
>
> Key: BEAM-6907
> URL: https://issues.apache.org/jira/browse/BEAM-6907
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> As Gradle parallelism applied to Python tests and more python versions added 
> to tests, the way Gradle manages projects/tasks changed a lot. Frictions are 
> generated during Gradle refactor since some projects defined separate build 
> script under source directory. Thus, It will be better to standardize how we 
> use Gradle. This will help to manage Python tests/builds/tasks across 
> different versions and runners, and also easy for people to learn/use/develop.
> In general, we may want to:
> - Apply parallel execution
> - Share common tasks
> - Centralize test related tasks
> - Have a clear Gradle structure for projects/tasks



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7862) Local testing utilities for BigQuery pipelines in releases

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7862?focusedWorklogId=290016=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290016
 ]

ASF GitHub Bot logged work on BEAM-7862:


Author: ASF GitHub Bot
Created on: 06/Aug/19 21:43
Start Date: 06/Aug/19 21:43
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9206: [BEAM-7862] 
Moving FakeBigQueryServices to published artifacts
URL: https://github.com/apache/beam/pull/9206#discussion_r311288496
 
 

 ##
 File path: 
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java
 ##
 @@ -269,6 +269,12 @@ public static Schema fromTableSchema(TableSchema 
tableSchema) {
 return fromTableFieldSchema(tableSchema.getFields());
   }
 
+  /** Convert a list of BigQuery {@link TableFieldSchema} to Avro {@link 
org.apache.avro.Schema}. */
+  public static org.apache.avro.Schema toGenericAvroSchema(
 
 Review comment:
   This call works as proxy to `BigQueryAvroUtils`, which is not a public 
class, so it's not accessible from other packages. I did not want to make the 
whole class public, so instead I added this call to `BigQueryUtils`, which is a 
public class with similar calls for schema conversions. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290016)
Time Spent: 1h 10m  (was: 1h)

> Local testing utilities for BigQuery pipelines in releases
> --
>
> Key: BEAM-7862
> URL: https://issues.apache.org/jira/browse/BEAM-7862
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7862) Local testing utilities for BigQuery pipelines in releases

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7862?focusedWorklogId=290012=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290012
 ]

ASF GitHub Bot logged work on BEAM-7862:


Author: ASF GitHub Bot
Created on: 06/Aug/19 21:38
Start Date: 06/Aug/19 21:38
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #9206: [BEAM-7862] 
Moving FakeBigQueryServices to published artifacts
URL: https://github.com/apache/beam/pull/9206#discussion_r311286618
 
 

 ##
 File path: 
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java
 ##
 @@ -269,6 +269,12 @@ public static Schema fromTableSchema(TableSchema 
tableSchema) {
 return fromTableFieldSchema(tableSchema.getFields());
   }
 
+  /** Convert a list of BigQuery {@link TableFieldSchema} to Avro {@link 
org.apache.avro.Schema}. */
+  public static org.apache.avro.Schema toGenericAvroSchema(
 
 Review comment:
   Why adding an extra call?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290012)
Time Spent: 1h  (was: 50m)

> Local testing utilities for BigQuery pipelines in releases
> --
>
> Key: BEAM-7862
> URL: https://issues.apache.org/jira/browse/BEAM-7862
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7495) Add support for dynamic worker re-balancing when reading BigQuery data using Cloud Dataflow

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7495?focusedWorklogId=290010=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290010
 ]

ASF GitHub Bot logged work on BEAM-7495:


Author: ASF GitHub Bot
Created on: 06/Aug/19 21:36
Start Date: 06/Aug/19 21:36
Worklog Time Spent: 10m 
  Work Description: kmjung commented on issue #9156: [BEAM-7495] Add 
fine-grained progress reporting
URL: https://github.com/apache/beam/pull/9156#issuecomment-518853370
 
 
   Please hold off on merging this for now. It sounds like Aryan is concerned
   that there may be an issue related to splitting.
   
   On Tue, Aug 6, 2019 at 2:26 PM Chamikara Jayalath 
   wrote:
   
   > Run Java PreCommit
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > 
,
   > or mute the thread
   > 

   > .
   >
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290010)
Time Spent: 11h 20m  (was: 11h 10m)
Remaining Estimate: 492h 40m  (was: 492h 50m)

> Add support for dynamic worker re-balancing when reading BigQuery data using 
> Cloud Dataflow
> ---
>
> Key: BEAM-7495
> URL: https://issues.apache.org/jira/browse/BEAM-7495
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Aryan Naraghi
>Assignee: Aryan Naraghi
>Priority: Major
>   Original Estimate: 504h
>  Time Spent: 11h 20m
>  Remaining Estimate: 492h 40m
>
> Currently, the BigQuery connector for reading data using the BigQuery Storage 
> API does not support any of the facilities on the source for Dataflow to 
> split streams.
>  
> On the server side, the BigQuery Storage API supports splitting streams at a 
> fraction. By adding support to the connector, we enable Dataflow to split 
> streams, which unlocks dynamic worker re-balancing.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-6683) Add an integration test suite for cross-language transforms for Flink runner

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6683?focusedWorklogId=290007=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290007
 ]

ASF GitHub Bot logged work on BEAM-6683:


Author: ASF GitHub Bot
Created on: 06/Aug/19 21:28
Start Date: 06/Aug/19 21:28
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #8174: 
[BEAM-6683] add createCrossLanguageValidatesRunner task
URL: https://github.com/apache/beam/pull/8174
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290007)
Time Spent: 23.5h  (was: 23h 20m)

> Add an integration test suite for cross-language transforms for Flink runner
> 
>
> Key: BEAM-6683
> URL: https://issues.apache.org/jira/browse/BEAM-6683
> Project: Beam
>  Issue Type: Test
>  Components: testing
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 23.5h
>  Remaining Estimate: 0h
>
> We should add an integration test suite that covers following.
> (1) Currently available Java IO connectors that do not use UDFs work for 
> Python SDK on Flink runner.
> (2) Currently available Python IO connectors that do not use UDFs work for 
> Java SDK on Flink runner.
> (3) Currently available Java/Python pipelines work in a scalable manner for 
> cross-language pipelines (for example, try 10GB, 100GB input for 
> textio/avroio for Java and Python). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7855) Support logical types when translating the portable row coder representation

2019-08-06 Thread Yueyang Qiu (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901479#comment-16901479
 ] 

Yueyang Qiu commented on BEAM-7855:
---

I am wondering if we can create a "Component" for schema related issues. If not 
we should at least add them under an umbrella bug.

> Support logical types when translating the portable row coder representation
> 
>
> Key: BEAM-7855
> URL: https://issues.apache.org/jira/browse/BEAM-7855
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Brian Hulette
>Priority: Major
>
> Originally SchemaCoder and RowCoder relied on including serialized java 
> classes and functions to support logical types, but the [portable schema 
> representation|https://s.apache.org/beam-schema] only includes a URN for a 
> logical type. We will need to be able to re-construct logical types given 
> just a URN, presumably by creating some sort of logical type registry indexed 
> by URN.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-6683) Add an integration test suite for cross-language transforms for Flink runner

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6683?focusedWorklogId=290005=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290005
 ]

ASF GitHub Bot logged work on BEAM-6683:


Author: ASF GitHub Bot
Created on: 06/Aug/19 21:27
Start Date: 06/Aug/19 21:27
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on issue #8174: [BEAM-6683] add 
createCrossLanguageValidatesRunner task
URL: https://github.com/apache/beam/pull/8174#issuecomment-518850800
 
 
   LGTM. Thanks.
   
   I'll go ahead and merge since a JIRA was created for the Python3 issue.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290005)
Time Spent: 23h 20m  (was: 23h 10m)

> Add an integration test suite for cross-language transforms for Flink runner
> 
>
> Key: BEAM-6683
> URL: https://issues.apache.org/jira/browse/BEAM-6683
> Project: Beam
>  Issue Type: Test
>  Components: testing
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 23h 20m
>  Remaining Estimate: 0h
>
> We should add an integration test suite that covers following.
> (1) Currently available Java IO connectors that do not use UDFs work for 
> Python SDK on Flink runner.
> (2) Currently available Python IO connectors that do not use UDFs work for 
> Java SDK on Flink runner.
> (3) Currently available Java/Python pipelines work in a scalable manner for 
> cross-language pipelines (for example, try 10GB, 100GB input for 
> textio/avroio for Java and Python). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7495) Add support for dynamic worker re-balancing when reading BigQuery data using Cloud Dataflow

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7495?focusedWorklogId=290003=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290003
 ]

ASF GitHub Bot logged work on BEAM-7495:


Author: ASF GitHub Bot
Created on: 06/Aug/19 21:25
Start Date: 06/Aug/19 21:25
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on issue #9156: [BEAM-7495] Add 
fine-grained progress reporting
URL: https://github.com/apache/beam/pull/9156#issuecomment-518850157
 
 
   Run Java PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290003)
Time Spent: 11h 10m  (was: 11h)
Remaining Estimate: 492h 50m  (was: 493h)

> Add support for dynamic worker re-balancing when reading BigQuery data using 
> Cloud Dataflow
> ---
>
> Key: BEAM-7495
> URL: https://issues.apache.org/jira/browse/BEAM-7495
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Aryan Naraghi
>Assignee: Aryan Naraghi
>Priority: Major
>   Original Estimate: 504h
>  Time Spent: 11h 10m
>  Remaining Estimate: 492h 50m
>
> Currently, the BigQuery connector for reading data using the BigQuery Storage 
> API does not support any of the facilities on the source for Dataflow to 
> split streams.
>  
> On the server side, the BigQuery Storage API supports splitting streams at a 
> fraction. By adding support to the connector, we enable Dataflow to split 
> streams, which unlocks dynamic worker re-balancing.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7495) Add support for dynamic worker re-balancing when reading BigQuery data using Cloud Dataflow

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7495?focusedWorklogId=290001=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290001
 ]

ASF GitHub Bot logged work on BEAM-7495:


Author: ASF GitHub Bot
Created on: 06/Aug/19 21:25
Start Date: 06/Aug/19 21:25
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9156: 
[BEAM-7495] Add fine-grained progress reporting
URL: https://github.com/apache/beam/pull/9156#discussion_r311282205
 
 

 ##
 File path: 
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryStorageStreamSource.java
 ##
 @@ -210,20 +219,50 @@ private synchronized boolean readNextRecord() throws 
IOException {
   return false;
 }
 
-// N.B.: For simplicity, we update fractionConsumed once a new 
response is fetched, not
-// when we reach the end of the current response. In practice, this 
choice is not
-// consequential.
-fractionConsumed = fractionConsumedFromLastResponse;
-ReadRowsResponse nextResponse = responseIterator.next();
+fractionConsumedFromPreviousResponse = 
fractionConsumedFromCurrentResponse;
+ReadRowsResponse currentResponse = responseIterator.next();
 decoder =
 DecoderFactory.get()
 .binaryDecoder(
-
nextResponse.getAvroRows().getSerializedBinaryRows().toByteArray(), decoder);
-fractionConsumedFromLastResponse = getFractionConsumed(nextResponse);
+
currentResponse.getAvroRows().getSerializedBinaryRows().toByteArray(), decoder);
+
+// Since we now have a new response, reset the row counter for the 
current response.
+rowsReadFromCurrentResponse = 0L;
+
+totalRowCountFromCurrentResponse = 
currentResponse.getAvroRows().getRowCount();
+fractionConsumedFromCurrentResponse = 
getFractionConsumed(currentResponse);
+
+Preconditions.checkArgument(
+totalRowCountFromCurrentResponse > 0L,
+"Row count from current response (%s) must be greater than one.",
+totalRowCountFromCurrentResponse);
+Preconditions.checkArgument(
+0f <= fractionConsumedFromCurrentResponse && 
fractionConsumedFromCurrentResponse <= 1f,
+"Fraction consumed from current response (%s) is not in the range 
[0.0, 1.0].",
+fractionConsumedFromCurrentResponse);
+Preconditions.checkArgument(
+fractionConsumedFromPreviousResponse < 
fractionConsumedFromCurrentResponse,
+"Fraction consumed from previous response (%s) is not less than 
fraction consumed "
++ "from current response (%s).",
+fractionConsumedFromPreviousResponse,
+fractionConsumedFromCurrentResponse);
   }
 
   record = datumReader.read(record, decoder);
   current = parseFn.apply(new SchemaAndRecord(record, tableSchema));
+
+  // Updates the fraction consumed value. This value is calculated by 
summing the fraction
+  // consumed value from the previous server response (or zero if we're 
consuming the first
+  // response) and by interpolating the fractional value in the current 
response based on how
+  // many rows have been consumed.
+  rowsReadFromCurrentResponse++;
+  fractionConsumed =
 
 Review comment:
   Ok. Sg.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290001)
Time Spent: 10h 50m  (was: 10h 40m)
Remaining Estimate: 493h 10m  (was: 493h 20m)

> Add support for dynamic worker re-balancing when reading BigQuery data using 
> Cloud Dataflow
> ---
>
> Key: BEAM-7495
> URL: https://issues.apache.org/jira/browse/BEAM-7495
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Aryan Naraghi
>Assignee: Aryan Naraghi
>Priority: Major
>   Original Estimate: 504h
>  Time Spent: 10h 50m
>  Remaining Estimate: 493h 10m
>
> Currently, the BigQuery connector for reading data using the BigQuery Storage 
> API does not support any of the facilities on the source for Dataflow to 
> split streams.
>  
> On the server side, the BigQuery Storage API supports splitting streams at a 
> fraction. By adding support to the connector, we enable Dataflow to split 
> streams, which unlocks dynamic worker re-balancing.



--

[jira] [Work logged] (BEAM-7495) Add support for dynamic worker re-balancing when reading BigQuery data using Cloud Dataflow

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7495?focusedWorklogId=290002=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-290002
 ]

ASF GitHub Bot logged work on BEAM-7495:


Author: ASF GitHub Bot
Created on: 06/Aug/19 21:25
Start Date: 06/Aug/19 21:25
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on issue #9156: [BEAM-7495] Add 
fine-grained progress reporting
URL: https://github.com/apache/beam/pull/9156#issuecomment-518850121
 
 
   LGTM. Thanks.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 290002)
Time Spent: 11h  (was: 10h 50m)
Remaining Estimate: 493h  (was: 493h 10m)

> Add support for dynamic worker re-balancing when reading BigQuery data using 
> Cloud Dataflow
> ---
>
> Key: BEAM-7495
> URL: https://issues.apache.org/jira/browse/BEAM-7495
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Aryan Naraghi
>Assignee: Aryan Naraghi
>Priority: Major
>   Original Estimate: 504h
>  Time Spent: 11h
>  Remaining Estimate: 493h
>
> Currently, the BigQuery connector for reading data using the BigQuery Storage 
> API does not support any of the facilities on the source for Dataflow to 
> split streams.
>  
> On the server side, the BigQuery Storage API supports splitting streams at a 
> fraction. By adding support to the connector, we enable Dataflow to split 
> streams, which unlocks dynamic worker re-balancing.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7667) report GCS throttling time to Dataflow autoscaler

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7667?focusedWorklogId=289987=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-289987
 ]

ASF GitHub Bot logged work on BEAM-7667:


Author: ASF GitHub Bot
Created on: 06/Aug/19 20:58
Start Date: 06/Aug/19 20:58
Worklog Time Spent: 10m 
  Work Description: ihji commented on pull request #8973: [BEAM-7667] 
report GCS throttling time to Dataflow autoscaler
URL: https://github.com/apache/beam/pull/8973#discussion_r311271852
 
 

 ##
 File path: 
sdks/python/apache_beam/io/gcp/internal/clients/storage/storage_v1_client.py
 ##
 @@ -20,10 +20,17 @@
 
 from __future__ import absolute_import
 
+import logging
+import time
 
 Review comment:
   Thanks for the comment. Updated.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 289987)
Time Spent: 1h 40m  (was: 1.5h)

> report GCS throttling time to Dataflow autoscaler
> -
>
> Key: BEAM-7667
> URL: https://issues.apache.org/jira/browse/BEAM-7667
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> report GCS throttling time to Dataflow autoscaler.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-6907) Standardize Gradle projects/tasks structure for Python SDK

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6907?focusedWorklogId=289986=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-289986
 ]

ASF GitHub Bot logged work on BEAM-6907:


Author: ASF GitHub Bot
Created on: 06/Aug/19 20:48
Start Date: 06/Aug/19 20:48
Worklog Time Spent: 10m 
  Work Description: markflyhigh commented on issue #9277: [BEAM-6907] Reuse 
Python tarball in tox & dataflow integration tests
URL: https://github.com/apache/beam/pull/9277#issuecomment-518838159
 
 
   Run Python Dataflow ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 289986)
Time Spent: 0.5h  (was: 20m)

> Standardize Gradle projects/tasks structure for Python SDK
> --
>
> Key: BEAM-6907
> URL: https://issues.apache.org/jira/browse/BEAM-6907
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As Gradle parallelism applied to Python tests and more python versions added 
> to tests, the way Gradle manages projects/tasks changed a lot. Frictions are 
> generated during Gradle refactor since some projects defined separate build 
> script under source directory. Thus, It will be better to standardize how we 
> use Gradle. This will help to manage Python tests/builds/tasks across 
> different versions and runners, and also easy for people to learn/use/develop.
> In general, we may want to:
> - Apply parallel execution
> - Share common tasks
> - Centralize test related tasks
> - Have a clear Gradle structure for projects/tasks



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-6683) Add an integration test suite for cross-language transforms for Flink runner

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6683?focusedWorklogId=289981=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-289981
 ]

ASF GitHub Bot logged work on BEAM-6683:


Author: ASF GitHub Bot
Created on: 06/Aug/19 20:42
Start Date: 06/Aug/19 20:42
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #8174: [BEAM-6683] add 
createCrossLanguageValidatesRunner task
URL: https://github.com/apache/beam/pull/8174#issuecomment-518836109
 
 
   Run XVR_Flink PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 289981)
Time Spent: 23h 10m  (was: 23h)

> Add an integration test suite for cross-language transforms for Flink runner
> 
>
> Key: BEAM-6683
> URL: https://issues.apache.org/jira/browse/BEAM-6683
> Project: Beam
>  Issue Type: Test
>  Components: testing
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 23h 10m
>  Remaining Estimate: 0h
>
> We should add an integration test suite that covers following.
> (1) Currently available Java IO connectors that do not use UDFs work for 
> Python SDK on Flink runner.
> (2) Currently available Python IO connectors that do not use UDFs work for 
> Java SDK on Flink runner.
> (3) Currently available Java/Python pipelines work in a scalable manner for 
> cross-language pipelines (for example, try 10GB, 100GB input for 
> textio/avroio for Java and Python). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-6683) Add an integration test suite for cross-language transforms for Flink runner

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6683?focusedWorklogId=289980=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-289980
 ]

ASF GitHub Bot logged work on BEAM-6683:


Author: ASF GitHub Bot
Created on: 06/Aug/19 20:38
Start Date: 06/Aug/19 20:38
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #8174: [BEAM-6683] add 
createCrossLanguageValidatesRunner task
URL: https://github.com/apache/beam/pull/8174#issuecomment-518834437
 
 
   @tvalentyn This PR has nothing to do with `sdks/python/build.gradle`, thus 
using the same setting for both Python 2 and Python 3 is not easy. I've created 
a separate Jira issue: https://issues.apache.org/jira/browse/BEAM-7914
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 289980)
Time Spent: 23h  (was: 22h 50m)

> Add an integration test suite for cross-language transforms for Flink runner
> 
>
> Key: BEAM-6683
> URL: https://issues.apache.org/jira/browse/BEAM-6683
> Project: Beam
>  Issue Type: Test
>  Components: testing
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 23h
>  Remaining Estimate: 0h
>
> We should add an integration test suite that covers following.
> (1) Currently available Java IO connectors that do not use UDFs work for 
> Python SDK on Flink runner.
> (2) Currently available Python IO connectors that do not use UDFs work for 
> Java SDK on Flink runner.
> (3) Currently available Java/Python pipelines work in a scalable manner for 
> cross-language pipelines (for example, try 10GB, 100GB input for 
> textio/avroio for Java and Python). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7912) Optimize GroupIntoBatches for batch Dataflow pipelines

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7912?focusedWorklogId=289978=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-289978
 ]

ASF GitHub Bot logged work on BEAM-7912:


Author: ASF GitHub Bot
Created on: 06/Aug/19 20:33
Start Date: 06/Aug/19 20:33
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on issue #9280: [BEAM-7912] Optimize 
GroupIntoBatches for batch Dataflow pipelines.
URL: https://github.com/apache/beam/pull/9280#issuecomment-518832735
 
 
   Run Dataflow ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 289978)
Time Spent: 20m  (was: 10m)

> Optimize GroupIntoBatches for batch Dataflow pipelines
> --
>
> Key: BEAM-7912
> URL: https://issues.apache.org/jira/browse/BEAM-7912
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The GroupIntoBatches transform can be significantly optimized on Dataflow 
> since it always ensures that a key K appears in only one bundle after a 
> GroupByKey. This removes the usage of state and timers in the generic 
> GroupIntoBatches transform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7914) add python 3 test in crossLanguageValidateRunner task

2019-08-06 Thread Heejong Lee (JIRA)

Heejong Lee created BEAM-7914:
-

 Summary: add python 3 test in crossLanguageValidateRunner task
 Key: BEAM-7914
 URL: https://issues.apache.org/jira/browse/BEAM-7914
 Project: Beam
  Issue Type: Improvement
  Components: testing
Reporter: Heejong Lee


add python 3 test in crossLanguageValidateRunner task



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7912) Optimize GroupIntoBatches for batch Dataflow pipelines

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7912?focusedWorklogId=289976=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-289976
 ]

ASF GitHub Bot logged work on BEAM-7912:


Author: ASF GitHub Bot
Created on: 06/Aug/19 20:31
Start Date: 06/Aug/19 20:31
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on pull request #9280: [BEAM-7912] 
Optimize GroupIntoBatches for batch Dataflow pipelines.
URL: https://github.com/apache/beam/pull/9280
 
 
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)
 | --- |

[jira] [Commented] (BEAM-7913) Add drain() to DataflowPipelineJob

2019-08-06 Thread Sam Whittle (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901431#comment-16901431
 ] 

Sam Whittle commented on BEAM-7913:
---

Current way to drain from a DataflowPipelineJob and DataflowPipelineOptions:
  
DataflowPipelineJob job = (DataflowPipelineJob)p.run();
...
DataflowPipelineOptions options = ...;
DataflowClient client = DataflowClient.create(options);
Job content = new Job();
content.setProjectId(job.getProjectId());
content.setId(job.getJobId());
content.setRequestedState("JOB_STATE_DRAINING");
client.updateJob(job.getJobId(), content);


> Add drain() to DataflowPipelineJob
> --
>
> Key: BEAM-7913
> URL: https://issues.apache.org/jira/browse/BEAM-7913
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Sam Whittle
>Assignee: Sam Whittle
>Priority: Minor
>
> Dataflow supports draining jobs but there is no easy programatic way to do it.
> I propose adding a drain() method to DataflowPipelineJob similar to the 
> existing cancel() method (inherited from PipelineResult).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7913) Add drain() to DataflowPipelineJob

2019-08-06 Thread Sam Whittle (JIRA)

Sam Whittle created BEAM-7913:
-

 Summary: Add drain() to DataflowPipelineJob
 Key: BEAM-7913
 URL: https://issues.apache.org/jira/browse/BEAM-7913
 Project: Beam
  Issue Type: Improvement
  Components: runner-dataflow
Reporter: Sam Whittle
Assignee: Sam Whittle


Dataflow supports draining jobs but there is no easy programatic way to do it.
I propose adding a drain() method to DataflowPipelineJob similar to the 
existing cancel() method (inherited from PipelineResult).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-6683) Add an integration test suite for cross-language transforms for Flink runner

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6683?focusedWorklogId=289970=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-289970
 ]

ASF GitHub Bot logged work on BEAM-6683:


Author: ASF GitHub Bot
Created on: 06/Aug/19 20:15
Start Date: 06/Aug/19 20:15
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #8174: [BEAM-6683] add 
createCrossLanguageValidatesRunner task
URL: https://github.com/apache/beam/pull/8174#issuecomment-518826637
 
 
   run seed job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 289970)
Time Spent: 22h 50m  (was: 22h 40m)

> Add an integration test suite for cross-language transforms for Flink runner
> 
>
> Key: BEAM-6683
> URL: https://issues.apache.org/jira/browse/BEAM-6683
> Project: Beam
>  Issue Type: Test
>  Components: testing
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 22h 50m
>  Remaining Estimate: 0h
>
> We should add an integration test suite that covers following.
> (1) Currently available Java IO connectors that do not use UDFs work for 
> Python SDK on Flink runner.
> (2) Currently available Python IO connectors that do not use UDFs work for 
> Java SDK on Flink runner.
> (3) Currently available Java/Python pipelines work in a scalable manner for 
> cross-language pipelines (for example, try 10GB, 100GB input for 
> textio/avroio for Java and Python). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-6683) Add an integration test suite for cross-language transforms for Flink runner

2019-08-06 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6683?focusedWorklogId=289969=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-289969
 ]

ASF GitHub Bot logged work on BEAM-6683:


Author: ASF GitHub Bot
Created on: 06/Aug/19 20:15
Start Date: 06/Aug/19 20:15
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #8174: [BEAM-6683] add 
createCrossLanguageValidatesRunner task
URL: https://github.com/apache/beam/pull/8174#issuecomment-518826637
 
 
   run seed job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 289969)
Time Spent: 22h 40m  (was: 22.5h)

> Add an integration test suite for cross-language transforms for Flink runner
> 
>
> Key: BEAM-6683
> URL: https://issues.apache.org/jira/browse/BEAM-6683
> Project: Beam
>  Issue Type: Test
>  Components: testing
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 22h 40m
>  Remaining Estimate: 0h
>
> We should add an integration test suite that covers following.
> (1) Currently available Java IO connectors that do not use UDFs work for 
> Python SDK on Flink runner.
> (2) Currently available Python IO connectors that do not use UDFs work for 
> Java SDK on Flink runner.
> (3) Currently available Java/Python pipelines work in a scalable manner for 
> cross-language pipelines (for example, try 10GB, 100GB input for 
> textio/avroio for Java and Python). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-7912) Optimize GroupIntoBatches for batch Dataflow pipelines

2019-08-06 Thread Luke Cwik (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik updated BEAM-7912:

Status: Open  (was: Triage Needed)

> Optimize GroupIntoBatches for batch Dataflow pipelines
> --
>
> Key: BEAM-7912
> URL: https://issues.apache.org/jira/browse/BEAM-7912
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Minor
>
> The GroupIntoBatches transform can be significantly optimized on Dataflow 
> since it always ensures that a key K appears in only one bundle after a 
> GroupByKey. This removes the usage of state and timers in the generic 
> GroupIntoBatches transform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7912) Optimize GroupIntoBatches for batch Dataflow pipelines

2019-08-06 Thread Luke Cwik (JIRA)

Luke Cwik created BEAM-7912:
---

 Summary: Optimize GroupIntoBatches for batch Dataflow pipelines
 Key: BEAM-7912
 URL: https://issues.apache.org/jira/browse/BEAM-7912
 Project: Beam
  Issue Type: Improvement
  Components: runner-dataflow
Reporter: Luke Cwik
Assignee: Luke Cwik


The GroupIntoBatches transform can be significantly optimized on Dataflow since 
it always ensures that a key K appears in only one bundle after a GroupByKey. 
This removes the usage of state and timers in the generic GroupIntoBatches 
transform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7911) test_corrupted_file test flak

2019-08-06 Thread Ahmet Altay (JIRA)

Ahmet Altay created BEAM-7911:
-

 Summary: test_corrupted_file test flak
 Key: BEAM-7911
 URL: https://issues.apache.org/jira/browse/BEAM-7911
 Project: Beam
  Issue Type: Bug
  Components: test-failures
Reporter: Ahmet Altay
Assignee: Heejong Lee


Looks like a flake:


https://builds.apache.org/job/beam_PreCommit_Python_Commit/7924/consoleFull

11:22:24 
11:22:24 ==
11:22:24 ERROR: test_corrupted_file (apache_beam.io.avroio_test.TestFastAvro)
11:22:24 --
11:22:24 Traceback (most recent call last):
11:22:24   File 
"/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py36/build/srcs/sdks/python/apache_beam/io/avroio_test.py",
 line 380, in test_corrupted_file
11:22:24 self.assertEqual(0, exn.exception.message.find('Unexpected sync 
marker'))
11:22:24 AttributeError: '_AssertRaisesContext' object has no attribute 
'exception'
11:22:24  >> begin captured logging << 
11:22:24 apache_beam.io.filesystem: DEBUG: translate_pattern: 
'/tmp/tmpckgw8mk6' -> '\\/tmp\\/tmpckgw8mk6'
11:22:24 - >> end captured logging << -
11:22:24 
11:22:24 --
11:22:24 XML: 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py36/build/srcs/sdks/python/nosetests.xml
11:22:24 --
11:22:24 Ran 2455 tests in 1297.267s
11:22:24 
11:22:24 FAILED (SKIP=546, errors=1)
11:22:24 ERROR: InvocationError for command 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py36/build/srcs/sdks/python/target/.tox-py36/py36/bin/python
 setup.py nosetests (exited with code 1)
11:22:24 py36 run-test-post: commands[0] | 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py36/build/srcs/sdks/python/scripts/run_tox_cleanup.sh
11:22:24 ___ summary 

11:22:24 ERROR:   py36: commands failed
11:22:24 





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

1 2 3 >

1 - 100 of 232 matches

Mail list logo