subject:"\[jira\] \[Work logged\] \(BEAM\-1440\) Create a BigQuery source \(that implements iobase.BoundedSource\) for Python SDK"

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=361256=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-361256
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 18/Dec/19 01:51
Start Date: 18/Dec/19 01:51
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #9772: [BEAM-1440] Create 
a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-566830300
 
 
   Postcommits tests are failing with this change: 
https://issues.apache.org/jira/browse/BEAM-8988.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 361256)
Time Spent: 19.5h  (was: 19h 20m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 19.5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=361126=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-361126
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 17/Dec/19 21:30
Start Date: 17/Dec/19 21:30
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 361126)
Time Spent: 19h 20m  (was: 19h 10m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 19h 20m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=360873=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-360873
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 17/Dec/19 12:48
Start Date: 17/Dec/19 12:48
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r358770537
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -1274,3 +1463,139 @@ def display_data(self):
tableSpec)
   res['table'] = DisplayDataItem(tableSpec, label='Table')
 return res
+
+
+class _PassThroughThenCleanup(PTransform):
+  """A PTransform that invokes a DoFn after the input PCollection has been
+processed.
+  """
+  def __init__(self, cleanup_dofn):
+self.cleanup_dofn = cleanup_dofn
+
+  def expand(self, input):
+class PassThrough(beam.DoFn):
+  def process(self, element):
+yield element
+
+output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal',
+main='main')
+main_output = output['main']
+cleanup_signal = output['cleanup_signal']
+
+_ = (input.pipeline
+ | beam.Create([None])
+ | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton(
+ cleanup_signal)))
+
+return main_output
+
+
+@experimental()
+class _ReadFromBigQuery(PTransform):
+  """Read data from BigQuery.
+
+This PTransform uses a BigQuery export job to take a snapshot of the table
+on GCS, and then reads from each produced JSON file.
+
+Do note that currently this source does not work with DirectRunner.
+
+  Args:
+table (str, callable, ValueProvider): The ID of the table, or a callable
+  that returns it. The ID must contain only letters ``a-z``, ``A-Z``,
+  numbers ``0-9``, or underscores ``_``. If dataset argument is
+  :data:`None` then the table argument must contain the entire table
+  reference specified as: ``'DATASET.TABLE'``
+  or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one
+  argument representing an element to be written to BigQuery, and return
+  a TableReference, or a string table name as specified above.
+dataset (str): The ID of the dataset containing this table or
+  :data:`None` if the table reference is specified entirely by the table
+  argument.
+project (str): The ID of the project containing this table.
+query (str): A query to be used instead of arguments table, dataset, and
+  project.
+validate (bool): If :data:`True`, various checks will be done when source
+  gets initialized (e.g., is table present?). This should be
+  :data:`True` for most scenarios in order to catch errors as early as
+  possible (pipeline construction instead of pipeline execution). It
+  should be :data:`False` if the table is created during pipeline
+  execution by a previous step.
+coder (~apache_beam.coders.coders.Coder): The coder for the table
+  rows. If :data:`None`, then the default coder is
+  _JsonToDictCoder, which will interpret every row as a JSON
+  serialized dictionary.
+use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL
+  dialect for this query. The default value is :data:`False`.
+  If set to :data:`True`, the query will use BigQuery's updated SQL
+  dialect with improved standards compliance.
+  This parameter is ignored for table inputs.
+flatten_results (bool): Flattens all nested and repeated fields in the
+  query results. The default value is :data:`True`.
+kms_key (str): Experimental. Optional Cloud KMS key name for use when
+  creating new temporary tables.
+gcs_location (str): The name of the Google Cloud Storage bucket where
+  the extracted table should be written as a string or
+  a :class:`~apache_beam.options.value_provider.ValueProvider`. If
+  :data:`None`, then the temp_location parameter is used.
+   """
+  def __init__(self, gcs_location=None, validate=False, *args, **kwargs):
+if gcs_location:
+  if not isinstance(gcs_location, (str, unicode, ValueProvider)):
+raise TypeError('%s: gcs_location must be of type string'
+' or ValueProvider; got %r instead'
+% (self.__class__.__name__, type(gcs_location)))
+
+  if isinstance(gcs_location, (str, unicode)):
+gcs_location = StaticValueProvider(str, gcs_location)
+self.gcs_location = gcs_location
+self.validate = validate
+
+self._args = args
+self._kwargs = kwargs
+
+  def _get_destination_uri(self, temp_location):
+"""Returns

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=360641=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-360641
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 17/Dec/19 01:42
Start Date: 17/Dec/19 01:42
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9772: 
[BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for 
Python
URL: https://github.com/apache/beam/pull/9772#discussion_r358558194
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -1274,3 +1463,139 @@ def display_data(self):
tableSpec)
   res['table'] = DisplayDataItem(tableSpec, label='Table')
 return res
+
+
+class _PassThroughThenCleanup(PTransform):
+  """A PTransform that invokes a DoFn after the input PCollection has been
+processed.
+  """
+  def __init__(self, cleanup_dofn):
+self.cleanup_dofn = cleanup_dofn
+
+  def expand(self, input):
+class PassThrough(beam.DoFn):
+  def process(self, element):
+yield element
+
+output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal',
+main='main')
+main_output = output['main']
+cleanup_signal = output['cleanup_signal']
+
+_ = (input.pipeline
+ | beam.Create([None])
+ | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton(
+ cleanup_signal)))
+
+return main_output
+
+
+@experimental()
+class _ReadFromBigQuery(PTransform):
+  """Read data from BigQuery.
+
+This PTransform uses a BigQuery export job to take a snapshot of the table
+on GCS, and then reads from each produced JSON file.
+
+Do note that currently this source does not work with DirectRunner.
+
+  Args:
+table (str, callable, ValueProvider): The ID of the table, or a callable
+  that returns it. The ID must contain only letters ``a-z``, ``A-Z``,
+  numbers ``0-9``, or underscores ``_``. If dataset argument is
+  :data:`None` then the table argument must contain the entire table
+  reference specified as: ``'DATASET.TABLE'``
+  or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one
+  argument representing an element to be written to BigQuery, and return
+  a TableReference, or a string table name as specified above.
+dataset (str): The ID of the dataset containing this table or
+  :data:`None` if the table reference is specified entirely by the table
+  argument.
+project (str): The ID of the project containing this table.
+query (str): A query to be used instead of arguments table, dataset, and
+  project.
+validate (bool): If :data:`True`, various checks will be done when source
+  gets initialized (e.g., is table present?). This should be
+  :data:`True` for most scenarios in order to catch errors as early as
+  possible (pipeline construction instead of pipeline execution). It
+  should be :data:`False` if the table is created during pipeline
+  execution by a previous step.
+coder (~apache_beam.coders.coders.Coder): The coder for the table
+  rows. If :data:`None`, then the default coder is
+  _JsonToDictCoder, which will interpret every row as a JSON
+  serialized dictionary.
+use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL
+  dialect for this query. The default value is :data:`False`.
+  If set to :data:`True`, the query will use BigQuery's updated SQL
+  dialect with improved standards compliance.
+  This parameter is ignored for table inputs.
+flatten_results (bool): Flattens all nested and repeated fields in the
+  query results. The default value is :data:`True`.
+kms_key (str): Experimental. Optional Cloud KMS key name for use when
+  creating new temporary tables.
+gcs_location (str): The name of the Google Cloud Storage bucket where
+  the extracted table should be written as a string or
+  a :class:`~apache_beam.options.value_provider.ValueProvider`. If
+  :data:`None`, then the temp_location parameter is used.
+   """
+  def __init__(self, gcs_location=None, validate=False, *args, **kwargs):
+if gcs_location:
+  if not isinstance(gcs_location, (str, unicode, ValueProvider)):
+raise TypeError('%s: gcs_location must be of type string'
+' or ValueProvider; got %r instead'
+% (self.__class__.__name__, type(gcs_location)))
+
+  if isinstance(gcs_location, (str, unicode)):
+gcs_location = StaticValueProvider(str, gcs_location)
+self.gcs_location = gcs_location
+self.validate = validate
+
+self._args = args
+self._kwargs = kwargs
+
+  def _get_destination_uri(self, temp_location):
+

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=360536=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-360536
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 16/Dec/19 22:32
Start Date: 16/Dec/19 22:32
Worklog Time Spent: 10m 
  Work Description: robertwb commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r358501366
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _CustomBigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
 
 Review comment:
   It seems the reported estimated size should be the output bytes, but if 
that's not easy

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=360532=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-360532
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 16/Dec/19 22:30
Start Date: 16/Dec/19 22:30
Worklog Time Spent: 10m 
  Work Description: robertwb commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r358500733
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _CustomBigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=360529=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-360529
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 16/Dec/19 22:26
Start Date: 16/Dec/19 22:26
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r358499063
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -1274,3 +1463,139 @@ def display_data(self):
tableSpec)
   res['table'] = DisplayDataItem(tableSpec, label='Table')
 return res
+
+
+class _PassThroughThenCleanup(PTransform):
+  """A PTransform that invokes a DoFn after the input PCollection has been
+processed.
+  """
+  def __init__(self, cleanup_dofn):
+self.cleanup_dofn = cleanup_dofn
+
+  def expand(self, input):
+class PassThrough(beam.DoFn):
+  def process(self, element):
+yield element
+
+output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal',
+main='main')
+main_output = output['main']
+cleanup_signal = output['cleanup_signal']
+
+_ = (input.pipeline
+ | beam.Create([None])
+ | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton(
+ cleanup_signal)))
+
+return main_output
+
+
+@experimental()
+class _ReadFromBigQuery(PTransform):
+  """Read data from BigQuery.
+
+This PTransform uses a BigQuery export job to take a snapshot of the table
+on GCS, and then reads from each produced JSON file.
+
+Do note that currently this source does not work with DirectRunner.
+
+  Args:
+table (str, callable, ValueProvider): The ID of the table, or a callable
+  that returns it. The ID must contain only letters ``a-z``, ``A-Z``,
+  numbers ``0-9``, or underscores ``_``. If dataset argument is
+  :data:`None` then the table argument must contain the entire table
+  reference specified as: ``'DATASET.TABLE'``
+  or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one
+  argument representing an element to be written to BigQuery, and return
+  a TableReference, or a string table name as specified above.
+dataset (str): The ID of the dataset containing this table or
+  :data:`None` if the table reference is specified entirely by the table
+  argument.
+project (str): The ID of the project containing this table.
+query (str): A query to be used instead of arguments table, dataset, and
+  project.
+validate (bool): If :data:`True`, various checks will be done when source
+  gets initialized (e.g., is table present?). This should be
+  :data:`True` for most scenarios in order to catch errors as early as
+  possible (pipeline construction instead of pipeline execution). It
+  should be :data:`False` if the table is created during pipeline
+  execution by a previous step.
+coder (~apache_beam.coders.coders.Coder): The coder for the table
+  rows. If :data:`None`, then the default coder is
+  _JsonToDictCoder, which will interpret every row as a JSON
+  serialized dictionary.
+use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL
+  dialect for this query. The default value is :data:`False`.
+  If set to :data:`True`, the query will use BigQuery's updated SQL
+  dialect with improved standards compliance.
+  This parameter is ignored for table inputs.
+flatten_results (bool): Flattens all nested and repeated fields in the
+  query results. The default value is :data:`True`.
+kms_key (str): Experimental. Optional Cloud KMS key name for use when
+  creating new temporary tables.
+gcs_location (str): The name of the Google Cloud Storage bucket where
+  the extracted table should be written as a string or
+  a :class:`~apache_beam.options.value_provider.ValueProvider`. If
+  :data:`None`, then the temp_location parameter is used.
+   """
+  def __init__(self, gcs_location=None, validate=False, *args, **kwargs):
+if gcs_location:
+  if not isinstance(gcs_location, (str, unicode, ValueProvider)):
+raise TypeError('%s: gcs_location must be of type string'
+' or ValueProvider; got %r instead'
+% (self.__class__.__name__, type(gcs_location)))
+
+  if isinstance(gcs_location, (str, unicode)):
+gcs_location = StaticValueProvider(str, gcs_location)
+self.gcs_location = gcs_location
+self.validate = validate
+
+self._args = args
+self._kwargs = kwargs
+
+  def _get_destination_uri(self, temp_location):
+"""Returns

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=358350=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-358350
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 12/Dec/19 06:20
Start Date: 12/Dec/19 06:20
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9772: 
[BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for 
Python
URL: https://github.com/apache/beam/pull/9772#discussion_r356974540
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -1274,3 +1463,139 @@ def display_data(self):
tableSpec)
   res['table'] = DisplayDataItem(tableSpec, label='Table')
 return res
+
+
+class _PassThroughThenCleanup(PTransform):
+  """A PTransform that invokes a DoFn after the input PCollection has been
+processed.
+  """
+  def __init__(self, cleanup_dofn):
+self.cleanup_dofn = cleanup_dofn
+
+  def expand(self, input):
+class PassThrough(beam.DoFn):
+  def process(self, element):
+yield element
+
+output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal',
+main='main')
+main_output = output['main']
+cleanup_signal = output['cleanup_signal']
+
+_ = (input.pipeline
+ | beam.Create([None])
+ | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton(
+ cleanup_signal)))
+
+return main_output
+
+
+@experimental()
+class _ReadFromBigQuery(PTransform):
+  """Read data from BigQuery.
+
+This PTransform uses a BigQuery export job to take a snapshot of the table
+on GCS, and then reads from each produced JSON file.
+
+Do note that currently this source does not work with DirectRunner.
+
+  Args:
+table (str, callable, ValueProvider): The ID of the table, or a callable
+  that returns it. The ID must contain only letters ``a-z``, ``A-Z``,
+  numbers ``0-9``, or underscores ``_``. If dataset argument is
+  :data:`None` then the table argument must contain the entire table
+  reference specified as: ``'DATASET.TABLE'``
+  or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one
+  argument representing an element to be written to BigQuery, and return
+  a TableReference, or a string table name as specified above.
+dataset (str): The ID of the dataset containing this table or
+  :data:`None` if the table reference is specified entirely by the table
+  argument.
+project (str): The ID of the project containing this table.
+query (str): A query to be used instead of arguments table, dataset, and
+  project.
+validate (bool): If :data:`True`, various checks will be done when source
+  gets initialized (e.g., is table present?). This should be
+  :data:`True` for most scenarios in order to catch errors as early as
+  possible (pipeline construction instead of pipeline execution). It
+  should be :data:`False` if the table is created during pipeline
+  execution by a previous step.
+coder (~apache_beam.coders.coders.Coder): The coder for the table
+  rows. If :data:`None`, then the default coder is
+  _JsonToDictCoder, which will interpret every row as a JSON
+  serialized dictionary.
+use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL
+  dialect for this query. The default value is :data:`False`.
+  If set to :data:`True`, the query will use BigQuery's updated SQL
+  dialect with improved standards compliance.
+  This parameter is ignored for table inputs.
+flatten_results (bool): Flattens all nested and repeated fields in the
+  query results. The default value is :data:`True`.
+kms_key (str): Experimental. Optional Cloud KMS key name for use when
+  creating new temporary tables.
+gcs_location (str): The name of the Google Cloud Storage bucket where
+  the extracted table should be written as a string or
+  a :class:`~apache_beam.options.value_provider.ValueProvider`. If
+  :data:`None`, then the temp_location parameter is used.
+   """
+  def __init__(self, gcs_location=None, validate=False, *args, **kwargs):
+if gcs_location:
+  if not isinstance(gcs_location, (str, unicode, ValueProvider)):
+raise TypeError('%s: gcs_location must be of type string'
+' or ValueProvider; got %r instead'
+% (self.__class__.__name__, type(gcs_location)))
+
+  if isinstance(gcs_location, (str, unicode)):
+gcs_location = StaticValueProvider(str, gcs_location)
+self.gcs_location = gcs_location
+self.validate = validate
+
+self._args = args
+self._kwargs = kwargs
+
+  def _get_destination_uri(self, temp_location):
+

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=357857=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-357857
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 11/Dec/19 14:16
Start Date: 11/Dec/19 14:16
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-564562180
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 357857)
Time Spent: 18h 10m  (was: 18h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 18h 10m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=357807=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-357807
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 11/Dec/19 12:10
Start Date: 11/Dec/19 12:10
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-564512207
 
 
   Run Python 2 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 357807)
Time Spent: 18h  (was: 17h 50m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 18h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=357776=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-357776
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 11/Dec/19 11:08
Start Date: 11/Dec/19 11:08
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-564492437
 
 
   Run Python 2 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 357776)
Time Spent: 17h 50m  (was: 17h 40m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 17h 50m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=357537=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-357537
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 11/Dec/19 00:11
Start Date: 11/Dec/19 00:11
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-564317779
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 357537)
Time Spent: 17h 40m  (was: 17.5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 17h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=357069=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-357069
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 10/Dec/19 13:20
Start Date: 10/Dec/19 13:20
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-564028930
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 357069)
Time Spent: 17.5h  (was: 17h 20m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 17.5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=357004=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-357004
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 10/Dec/19 11:06
Start Date: 10/Dec/19 11:06
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-563983582
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 357004)
Time Spent: 17h 20m  (was: 17h 10m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 17h 20m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-10 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=356905=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-356905
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 10/Dec/19 09:01
Start Date: 10/Dec/19 09:01
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-563934421
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 356905)
Time Spent: 17h 10m  (was: 17h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 17h 10m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355311=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355311
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Dec/19 16:34
Start Date: 06/Dec/19 16:34
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-562644116
 
 
   Run Python 2 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 355311)
Time Spent: 17h  (was: 16h 50m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 17h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355257=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355257
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Dec/19 15:15
Start Date: 06/Dec/19 15:15
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-562611582
 
 
   Jira issue to switch to avro: https://issues.apache.org/jira/browse/BEAM-8910
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 355257)
Time Spent: 16h 50m  (was: 16h 40m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 16h 50m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355247=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355247
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Dec/19 15:01
Start Date: 06/Dec/19 15:01
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-562605478
 
 
   Thanks @robertwb for your comments!
   
   > Why does this not work on the direct runners. Is it an issue of needing to 
be split first?
   Yes. I've already created a jira for this: 
https://issues.apache.org/jira/browse/BEAM-8528
   
   > would it make sense to implement this as an SDF instead?
   
   My first attempt was a regular (non splittable) DoFn that triggers export 
job followed by `MatchAll` and `ReadMatches` transforms. This worked, but I had 
troubles with implementing the rest: waiting for query job, waiting for export 
job and removing json files after reading. Using Source API turned out to be 
simpler. 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 355247)
Time Spent: 16.5h  (was: 16h 20m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 16.5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355248=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355248
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Dec/19 15:01
Start Date: 06/Dec/19 15:01
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-562605478
 
 
   Thanks @robertwb for your comments!
   
   > Why does this not work on the direct runners. Is it an issue of needing to 
be split first?
   
   Yes. I've already created a jira for this: 
https://issues.apache.org/jira/browse/BEAM-8528
   
   > would it make sense to implement this as an SDF instead?
   
   My first attempt was a regular (non splittable) DoFn that triggers export 
job followed by `MatchAll` and `ReadMatches` transforms. This worked, but I had 
troubles with implementing the rest: waiting for query job, waiting for export 
job and removing json files after reading. Using Source API turned out to be 
simpler. 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 355248)
Time Spent: 16h 40m  (was: 16.5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 16h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355207=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355207
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Dec/19 14:18
Start Date: 06/Dec/19 14:18
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r354853510
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _CustomBigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
 
 Review comment:
   Input. And to be more precise - that's the number of bytes which must be 
read from

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355200=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355200
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Dec/19 14:07
Start Date: 06/Dec/19 14:07
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r354848637
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _CustomBigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
 
 Review comment:
   It was copied-pasted from the native BigQuery source (`BigQuerySource` 
class), because their interfaces are mostly the same. I'll check if this can be 
improved.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 355200)
Time Spent: 16h 10m  (was: 16h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 16h

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355196=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355196
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Dec/19 14:05
Start Date: 06/Dec/19 14:05
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r354848017
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _CustomBigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355189=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355189
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Dec/19 13:53
Start Date: 06/Dec/19 13:53
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r354842168
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _CustomBigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
 
 Review comment:
   This one does not block due to dry_run being `True`. When dry_run is true, 
the job is not actually ran. Instead, we get some processing statistics (one of 
them is the number of bytes read by the query)
   
   > Is estimate_size() called during pipeline construction? Is it guaranteed 
to be called only (exactly) once?
   
   I think a runner is

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355188=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355188
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Dec/19 13:45
Start Date: 06/Dec/19 13:45
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r354838895
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _CustomBigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354750=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354750
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 05/Dec/19 22:38
Start Date: 05/Dec/19 22:38
Worklog Time Spent: 10m 
  Work Description: robertwb commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r354545674
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _CustomBigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354752=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354752
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 05/Dec/19 22:38
Start Date: 05/Dec/19 22:38
Worklog Time Spent: 10m 
  Work Description: robertwb commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r354582319
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _CustomBigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354751=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354751
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 05/Dec/19 22:38
Start Date: 05/Dec/19 22:38
Worklog Time Spent: 10m 
  Work Description: robertwb commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r354581717
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _CustomBigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
 
 Review comment:
   Is this input or output bytes?

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354749=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354749
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 05/Dec/19 22:38
Start Date: 05/Dec/19 22:38
Worklog Time Spent: 10m 
  Work Description: robertwb commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r354545299
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _CustomBigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
 
 Review comment:
   Does this block? Is estimate_size() called during pipeline construction? Is 
it guaranteed to be called only (exactly) once? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354748=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354748
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 05/Dec/19 22:38
Start Date: 05/Dec/19 22:38
Worklog Time Spent: 10m 
  Work Description: robertwb commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r354545010
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _CustomBigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
 
 Review comment:
   Was this copied-pasted from somewhere? If so, can we share code?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 354748)
Time Spent: 15h  (was: 14h 50m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 15h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354565=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354565
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 05/Dec/19 18:33
Start Date: 05/Dec/19 18:33
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-562256730
 
 
   Run Python 3.5 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 354565)
Time Spent: 14h 50m  (was: 14h 40m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 14h 50m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354539=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354539
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 05/Dec/19 18:15
Start Date: 05/Dec/19 18:15
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-562249250
 
 
   as a final note, please create issues to switch to avro, and to enable tests 
in other runners (e.g. direct runner).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 354539)
Time Spent: 14h 40m  (was: 14.5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 14h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354533=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354533
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 05/Dec/19 18:04
Start Date: 05/Dec/19 18:04
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r354465023
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self, desired_bundle_size,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354493=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354493
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 05/Dec/19 17:34
Start Date: 05/Dec/19 17:34
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-562233145
 
 
   Run Python 3.5 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 354493)
Time Spent: 14h 20m  (was: 14h 10m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 14h 20m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354443=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354443
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 05/Dec/19 16:38
Start Date: 05/Dec/19 16:38
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-562209713
 
 
   I've tidied up commit history a bit. Also, I've renamed `ReadFromBigQuery` 
to `_ReadFromBigQuery` - I forgot to do this in an earlier commit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 354443)
Time Spent: 14h 10m  (was: 14h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 14h 10m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354439=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354439
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 05/Dec/19 16:35
Start Date: 05/Dec/19 16:35
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-562208319
 
 
   Run Python 2 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 354439)
Time Spent: 14h  (was: 13h 50m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 14h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354438=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354438
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 05/Dec/19 16:34
Start Date: 05/Dec/19 16:34
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-562208201
 
 
   Run Portable_Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 354438)
Time Spent: 13h 50m  (was: 13h 40m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 13h 50m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-05 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354147=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354147
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 05/Dec/19 08:57
Start Date: 05/Dec/19 08:57
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r354177031
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self, desired_bundle_size,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=353687=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-353687
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 04/Dec/19 19:28
Start Date: 04/Dec/19 19:28
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r353239750
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self, desired_bundle_size,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=353684=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-353684
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 04/Dec/19 19:27
Start Date: 04/Dec/19 19:27
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r353938086
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self, desired_bundle_size,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=353680=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-353680
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 04/Dec/19 19:26
Start Date: 04/Dec/19 19:26
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r353937613
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -1268,3 +1461,140 @@ def display_data(self):
tableSpec)
   res['table'] = DisplayDataItem(tableSpec, label='Table')
 return res
+
+
+@experimental()
+class PassThroughThenCleanup(PTransform):
+  """A PTransform that invokes a DoFn after the input PCollection has been
+processed.
+  """
+  def __init__(self, cleanup_dofn):
+self.cleanup_dofn = cleanup_dofn
+
+  def expand(self, input):
+class PassThrough(beam.DoFn):
+  def process(self, element):
+yield element
+
+output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal',
+main='main')
+main_output = output['main']
+cleanup_signal = output['cleanup_signal']
+
+_ = (input.pipeline
+ | beam.Create([None])
+ | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton(
+ cleanup_signal)))
+
+return main_output
+
+
+@experimental()
+class ReadFromBigQuery(PTransform):
 
 Review comment:
   Yes, that will be great to compare the Native Dataflow source with the new 
Beam source.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 353680)
Time Spent: 13h 10m  (was: 13h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 13h 10m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=353500=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-353500
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 04/Dec/19 14:50
Start Date: 04/Dec/19 14:50
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-561679198
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 353500)
Time Spent: 13h  (was: 12h 50m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 13h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=353297=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-353297
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 04/Dec/19 08:42
Start Date: 04/Dec/19 08:42
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-561536467
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 353297)
Time Spent: 12h 50m  (was: 12h 40m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 12h 50m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352698=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352698
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 03/Dec/19 15:22
Start Date: 03/Dec/19 15:22
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-561216098
 
 
   Run Python 2 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 352698)
Time Spent: 12h 40m  (was: 12.5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 12h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352696=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352696
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 03/Dec/19 15:18
Start Date: 03/Dec/19 15:18
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r353239750
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self, desired_bundle_size,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352667=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352667
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 03/Dec/19 14:42
Start Date: 03/Dec/19 14:42
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r353217656
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
 
 Review comment:
   Yes. I wanted to implement the same behavior as we already have when using 
default coder for the native BigQuerySource. There is a 
[test](https://github.com/apache/beam/blob/03f780c7329e0eca692baef44874056b7d263303/sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py#L232)
 that checks BigQuery data types conversions across both sources.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 352667)
Time Spent: 12h 20m  (was: 12h 10m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 12h 20m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352643=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352643
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 03/Dec/19 14:15
Start Date: 03/Dec/19 14:15
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r353202013
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -700,30 +675,19 @@ def _export_files(self, bq):
   bigquery.TableSchema instance, a list of FileMetadata instances
 """
 job_id = uuid.uuid4().hex
-destination = self._get_destination_uri(self.gcs_bucket_name, job_id)
-job_ref = bq.perform_extract_job([destination], job_id,
+job_ref = bq.perform_extract_job([self.gcs_location], job_id,
  self.table_reference,
  bigquery_tools.ExportFileFormat.JSON,
  include_header=False)
 bq.wait_for_bq_job(job_ref)
-metadata_list = FileSystems.match([destination])[0].metadata_list
+metadata_list = FileSystems.match([self.gcs_location])[0].metadata_list
 
 Review comment:
   Yes. The thing I used is called `Single wildcard URI`[1].  In this case, an 
extract job creates one or many files and all of them are created in the same 
directory.
   
   [1] 
https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 352643)
Time Spent: 12h  (was: 11h 50m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352644=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352644
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 03/Dec/19 14:15
Start Date: 03/Dec/19 14:15
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r353202269
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
 
 Review comment:
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 352644)
Time Spent: 12h 10m  (was: 12h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 12h 10m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352623=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352623
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 03/Dec/19 13:50
Start Date: 03/Dec/19 13:50
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r353188057
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -1268,3 +1461,140 @@ def display_data(self):
tableSpec)
   res['table'] = DisplayDataItem(tableSpec, label='Table')
 return res
+
+
+@experimental()
+class PassThroughThenCleanup(PTransform):
 
 Review comment:
   I don't think it needs to be public. I'll make it private then.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 352623)
Time Spent: 11h 50m  (was: 11h 40m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 11h 50m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352622=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352622
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 03/Dec/19 13:49
Start Date: 03/Dec/19 13:49
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r353187727
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -1268,3 +1461,140 @@ def display_data(self):
tableSpec)
   res['table'] = DisplayDataItem(tableSpec, label='Table')
 return res
+
+
+@experimental()
+class PassThroughThenCleanup(PTransform):
+  """A PTransform that invokes a DoFn after the input PCollection has been
+processed.
+  """
+  def __init__(self, cleanup_dofn):
+self.cleanup_dofn = cleanup_dofn
+
+  def expand(self, input):
+class PassThrough(beam.DoFn):
+  def process(self, element):
+yield element
+
+output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal',
+main='main')
+main_output = output['main']
+cleanup_signal = output['cleanup_signal']
+
+_ = (input.pipeline
+ | beam.Create([None])
+ | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton(
+ cleanup_signal)))
+
+return main_output
+
+
+@experimental()
+class ReadFromBigQuery(PTransform):
 
 Review comment:
   I'm totally fine with it. 
   Once this PR is merged, I'm going to make changes to the Chicago Taxi 
Example, so that the example would use this transform. That would be a great 
opportunity to check stability and measure performance of the transform
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 352622)
Time Spent: 11h 40m  (was: 11.5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-12-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352611=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352611
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 03/Dec/19 13:31
Start Date: 03/Dec/19 13:31
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r353178188
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self, desired_bundle_size,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=350123=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-350123
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 27/Nov/19 00:44
Start Date: 27/Nov/19 00:44
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9772: 
[BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for 
Python
URL: https://github.com/apache/beam/pull/9772#discussion_r351048663
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=350119=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-350119
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 27/Nov/19 00:33
Start Date: 27/Nov/19 00:33
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r351045732
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self, desired_bundle_size,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=350120=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-350120
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 27/Nov/19 00:33
Start Date: 27/Nov/19 00:33
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r351046099
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  def __init__(self, gcs_location=None, table=None, dataset=None,
+   project=None, query=None, validate=False, coder=None,
+   use_standard_sql=False, flatten_results=True, kms_key=None):
+if table is not None and query is not None:
+  raise ValueError('Both a BigQuery table and a query were specified.'
+   ' Please specify only one of these.')
+elif table is None and query is None:
+  raise ValueError('A BigQuery table or a query must be specified')
+elif table is not None:
+  self.table_reference = bigquery_tools.parse_table_reference(
+  table, dataset, project)
+  self.query = None
+  self.use_legacy_sql = True
+else:
+  self.query = query
+  # TODO(BEAM-1082): Change the internal flag to be standard_sql
+  self.use_legacy_sql = not use_standard_sql
+  self.table_reference = None
+
+self.gcs_location = gcs_location
+self.project = project
+self.validate = validate
+self.flatten_results = flatten_results
+self.coder = coder or _JsonToDictCoder
+self.kms_key = kms_key
+self.split_result = None
+
+  def estimate_size(self):
+bq = bigquery_tools.BigQueryWrapper()
+if self.table_reference is not None:
+  table = bq.get_table(self.table_reference.projectId,
+   self.table_reference.datasetId,
+   self.table_reference.tableId)
+  return int(table.numBytes)
+else:
+  self._setup_temporary_dataset(bq)
+  job = bq._start_query_job(self.project, self.query,
+self.use_legacy_sql, self.flatten_results,
+job_id=uuid.uuid4().hex, dry_run=True,
+kms_key=self.kms_key)
+  size = int(job.statistics.totalBytesProcessed)
+
+  bq.clean_up_temporary_dataset(self.project)
+
+  return size
+
+  def split(self, desired_bundle_size,

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=350104=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-350104
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 26/Nov/19 23:24
Start Date: 26/Nov/19 23:24
Worklog Time Spent: 10m 
  Work Description: ananvay commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-558860259
 
 
   @robertwb 
   /cc: @ananvay 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 350104)
Time Spent: 11h  (was: 10h 50m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=349937=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-349937
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 26/Nov/19 18:45
Start Date: 26/Nov/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r350905378
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
+  # No need to do any conversion
+  pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
 
 Review comment:
   Please rename it _CustomBigQuerySource - as we already have a 
BigQuerySource. To avoid confusion.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 349937)
Time Spent: 10h 40m  (was: 10.5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 10h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=349938=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-349938
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 26/Nov/19 18:45
Start Date: 26/Nov/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r350905744
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -1268,3 +1461,140 @@ def display_data(self):
tableSpec)
   res['table'] = DisplayDataItem(tableSpec, label='Table')
 return res
+
+
+@experimental()
+class PassThroughThenCleanup(PTransform):
 
 Review comment:
   Does this class need to be public? If it should be public, it should be in a 
different file. If not, let's make it private and only use it in this file.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 349938)
Time Spent: 10h 40m  (was: 10.5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 10h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=349936=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-349936
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 26/Nov/19 18:45
Start Date: 26/Nov/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r350918239
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
+TableFieldSchema instances.
+"""
+if not table_field_schemas:
+  return []
+
+return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name,
+x.type)
+for x in table_field_schemas]
+
+  def decode(self, value):
+value = json.loads(value)
+return self._decode_with_schema(value, self.fields)
+
+  def _decode_with_schema(self, value, schema_fields):
+for field in schema_fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  if field.type == 'RECORD':
+value[field.name] = self._decode_with_schema(value[field.name],
+ field.fields)
+  else:
+try:
+  converter = self._converters[field.type]
+  value[field.name] = converter(value[field.name])
+except KeyError:
 
 Review comment:
   Does this mean that for other data types, we pass them as they are? e.g. for 
datetime data, or other like that?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 349936)
Time Spent: 10h 40m  (was: 10.5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 10h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=349940=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-349940
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 26/Nov/19 18:45
Start Date: 26/Nov/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r346020240
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -700,30 +675,19 @@ def _export_files(self, bq):
   bigquery.TableSchema instance, a list of FileMetadata instances
 """
 job_id = uuid.uuid4().hex
-destination = self._get_destination_uri(self.gcs_bucket_name, job_id)
-job_ref = bq.perform_extract_job([destination], job_id,
+job_ref = bq.perform_extract_job([self.gcs_location], job_id,
  self.table_reference,
  bigquery_tools.ExportFileFormat.JSON,
  include_header=False)
 bq.wait_for_bq_job(job_ref)
-metadata_list = FileSystems.match([destination])[0].metadata_list
+metadata_list = FileSystems.match([self.gcs_location])[0].metadata_list
 
 Review comment:
   Is this enough to match the files in that location? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 349940)
Time Spent: 10h 50m  (was: 10h 40m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 10h 50m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=349939=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-349939
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 26/Nov/19 18:45
Start Date: 26/Nov/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r350906429
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -1268,3 +1461,140 @@ def display_data(self):
tableSpec)
   res['table'] = DisplayDataItem(tableSpec, label='Table')
 return res
+
+
+@experimental()
+class PassThroughThenCleanup(PTransform):
+  """A PTransform that invokes a DoFn after the input PCollection has been
+processed.
+  """
+  def __init__(self, cleanup_dofn):
+self.cleanup_dofn = cleanup_dofn
+
+  def expand(self, input):
+class PassThrough(beam.DoFn):
+  def process(self, element):
+yield element
+
+output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal',
+main='main')
+main_output = output['main']
+cleanup_signal = output['cleanup_signal']
+
+_ = (input.pipeline
+ | beam.Create([None])
+ | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton(
+ cleanup_signal)))
+
+return main_output
+
+
+@experimental()
+class ReadFromBigQuery(PTransform):
 
 Review comment:
   This is looking great. I've discussed with Cham, and let's rename this as 
`_ReadFromBigQuery` (with underscore) to prevent users from picking it up 
before we have tested it. We have some tests that we'll run on it, and once 
we're confident of performance/functionality, we can remove the underscore, and 
rename to `ReadFromBigQuery`. WDYT?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 349939)
Time Spent: 10h 50m  (was: 10h 40m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 10h 50m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=349897=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-349897
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 26/Nov/19 17:27
Start Date: 26/Nov/19 17:27
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-558735446
 
 
   Looking once more.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 349897)
Time Spent: 10.5h  (was: 10h 20m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=348410=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-348410
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 22/Nov/19 22:51
Start Date: 22/Nov/19 22:51
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-557723473
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 348410)
Time Spent: 10h 20m  (was: 10h 10m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=348409=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-348409
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 22/Nov/19 22:51
Start Date: 22/Nov/19 22:51
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-557723317
 
 
   Run Python PreCommit
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 348409)
Time Spent: 10h 10m  (was: 10h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=348408=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-348408
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 22/Nov/19 22:50
Start Date: 22/Nov/19 22:50
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-557723317
 
 
   Run Python PreCommit
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 348408)
Time Spent: 10h  (was: 9h 50m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=347572=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-347572
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 21/Nov/19 18:15
Start Date: 21/Nov/19 18:15
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-557208165
 
 
   Resolved merge conflicts. 
   @pabloem Could you take a look once again?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 347572)
Time Spent: 9h 50m  (was: 9h 40m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=341473=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-341473
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 11/Nov/19 21:03
Start Date: 11/Nov/19 21:03
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-552614792
 
 
   I had been traveling. I'll take look now.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 341473)
Time Spent: 9h 40m  (was: 9.5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 9h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339983=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339983
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 07/Nov/19 15:00
Start Date: 07/Nov/19 15:00
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r343697148
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -496,6 +506,189 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _JsonToDictCoder(coders.Coder):
+  """A coder for a JSON string to a Python dict."""
+
+  def __init__(self, table_schema):
+self.fields = self._convert_to_tuple(table_schema.fields)
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  @classmethod
+  def _convert_to_tuple(cls, table_field_schemas):
+"""Recursively converts the list of TableFieldSchema instances to the
+list of tuples to prevent errors when pickling and unpickling
 
 Review comment:
   This error was quite interesting. It seems that it's impossible to serialize 
and deserialize nested `bigquery.TableFieldSchema` instances:
   ```
   
   from apache_beam.internal import pickler
   from apache_beam.io.gcp.internal.clients import bigquery
   
   obj = bigquery.TableFieldSchema(fields=[bigquery.TableFieldSchema()])
   pickler.loads(pickler.dumps(obj))
   ```
   
   This snippet triggers the following exception:
   `AttributeError: 'FieldList' object has no attribute '_FieldList__field'`.
   
   My workaround was to rewrite all TableFieldSchema instances to an equivalent 
tuple, which can be serialized and deserialized without problems.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 339983)
Time Spent: 9.5h  (was: 9h 20m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339797=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339797
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 07/Nov/19 08:17
Start Date: 07/Nov/19 08:17
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-550972444
 
 
   Run Java PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 339797)
Time Spent: 9h 20m  (was: 9h 10m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339446=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339446
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Nov/19 16:10
Start Date: 06/Nov/19 16:10
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-550380662
 
 
   Run Java PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 339446)
Time Spent: 9h  (was: 8h 50m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339447=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339447
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Nov/19 16:10
Start Date: 06/Nov/19 16:10
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-550380716
 
 
   Run Python 2 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 339447)
Time Spent: 9h 10m  (was: 9h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339391=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339391
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Nov/19 14:48
Start Date: 06/Nov/19 14:48
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-550342694
 
 
   Run Python 2 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 339391)
Time Spent: 8h 50m  (was: 8h 40m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339328=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339328
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Nov/19 12:45
Start Date: 06/Nov/19 12:45
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-550293536
 
 
   @pabloem Could you take a look once again? In addition to what you had 
suggested, I've added the functionality of removing JSON file after being read.
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 339328)
Time Spent: 8.5h  (was: 8h 20m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339329=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339329
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Nov/19 12:45
Start Date: 06/Nov/19 12:45
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-550293679
 
 
   Run Python 3.7 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 339329)
Time Spent: 8h 40m  (was: 8.5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-11-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339324=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339324
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 06/Nov/19 12:41
Start Date: 06/Nov/19 12:41
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r343072498
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _BigQueryRowCoder(coders.Coder):
 
 Review comment:
   I looked at this and, sadly, it doesn't work in practice. For each field 
`TableRowJsonCoder` returns `JsonValue` instance which doesn't contain the name 
of the given field. This makes finding an appropriate field type and matching a 
conversion function quite hard.
   
   That's why I decided to stay with my own coder.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 339324)
Time Spent: 8h 20m  (was: 8h 10m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-31 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=336716=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-336716
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 31/Oct/19 10:35
Start Date: 31/Oct/19 10:35
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r341062618
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py
 ##
 @@ -86,23 +117,61 @@ def create_table(self, tablename):
 table_schema.fields.append(table_field)
 table = bigquery.Table(
 tableReference=bigquery.TableReference(
-projectId=self.project,
-datasetId=self.dataset_id,
-tableId=tablename),
+projectId=cls.project,
+datasetId=cls.dataset_id,
+tableId=table_name),
 schema=table_schema)
 request = bigquery.BigqueryTablesInsertRequest(
-projectId=self.project, datasetId=self.dataset_id, table=table)
-self.bigquery_client.client.tables.Insert(request)
+projectId=cls.project, datasetId=cls.dataset_id, table=table)
+cls.bigquery_client.client.tables.Insert(request)
 table_data = [
 {'number': 1, 'str': 'abc'},
 {'number': 2, 'str': 'def'},
 {'number': 3, 'str': u'你好'},
 {'number': 4, 'str': u'привет'}
 ]
-self.bigquery_client.insert_rows(
-self.project, self.dataset_id, tablename, table_data)
+cls.bigquery_client.insert_rows(
+cls.project, cls.dataset_id, table_name, table_data)
 
-  def create_table_new_types(self, table_name):
+  def get_expected_data(self):
+return [
+{'number': 1, 'str': 'abc'},
+{'number': 2, 'str': 'def'},
+{'number': 3, 'str': u'你好'},
+{'number': 4, 'str': u'привет'}
+]
+
+  @skip(['PortableRunner', 'FlinkRunner'])
+  @attr('IT')
+  def test_native_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource(
+  query=self.query, use_standard_sql=True)))
+  assert_that(result, equal_to(self.get_expected_data()))
+
+  @skip(['DirectRunner', 'TestDirectRunner'])
+  @attr('IT')
+  def test_iobase_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.ReadFromBigQuery(
+  query=self.query, use_standard_sql=True, project=self.project,
+  gcs_bucket_name='gs://temp-storage-for-end-to-end-tests'))
 
 Review comment:
   Done: https://issues.apache.org/jira/browse/BEAM-8528
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 336716)
Time Spent: 8h 10m  (was: 8h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335807=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335807
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 22:55
Start Date: 29/Oct/19 22:55
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-547665389
 
 
   thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335807)
Time Spent: 8h  (was: 7h 50m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335664=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335664
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 18:22
Start Date: 29/Oct/19 18:22
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r340250638
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py
 ##
 @@ -86,23 +117,61 @@ def create_table(self, tablename):
 table_schema.fields.append(table_field)
 table = bigquery.Table(
 tableReference=bigquery.TableReference(
-projectId=self.project,
-datasetId=self.dataset_id,
-tableId=tablename),
+projectId=cls.project,
+datasetId=cls.dataset_id,
+tableId=table_name),
 schema=table_schema)
 request = bigquery.BigqueryTablesInsertRequest(
-projectId=self.project, datasetId=self.dataset_id, table=table)
-self.bigquery_client.client.tables.Insert(request)
+projectId=cls.project, datasetId=cls.dataset_id, table=table)
+cls.bigquery_client.client.tables.Insert(request)
 table_data = [
 {'number': 1, 'str': 'abc'},
 {'number': 2, 'str': 'def'},
 {'number': 3, 'str': u'你好'},
 {'number': 4, 'str': u'привет'}
 ]
-self.bigquery_client.insert_rows(
-self.project, self.dataset_id, tablename, table_data)
+cls.bigquery_client.insert_rows(
+cls.project, cls.dataset_id, table_name, table_data)
 
-  def create_table_new_types(self, table_name):
+  def get_expected_data(self):
+return [
+{'number': 1, 'str': 'abc'},
+{'number': 2, 'str': 'def'},
+{'number': 3, 'str': u'你好'},
+{'number': 4, 'str': u'привет'}
+]
+
+  @skip(['PortableRunner', 'FlinkRunner'])
+  @attr('IT')
+  def test_native_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource(
+  query=self.query, use_standard_sql=True)))
+  assert_that(result, equal_to(self.get_expected_data()))
+
+  @skip(['DirectRunner', 'TestDirectRunner'])
+  @attr('IT')
+  def test_iobase_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.ReadFromBigQuery(
+  query=self.query, use_standard_sql=True, project=self.project,
+  gcs_bucket_name='gs://temp-storage-for-end-to-end-tests'))
 
 Review comment:
   I see. Can you file a bug for this?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335664)
Time Spent: 7h 50m  (was: 7h 40m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335564=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335564
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 15:25
Start Date: 29/Oct/19 15:25
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a 
BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#issuecomment-547477815
 
 
   @pabloem Thanks for your review. I've pushed first bunch of fixes. I'll keep 
you posted with on further progress.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335564)
Time Spent: 7h 40m  (was: 7.5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335559=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335559
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 15:18
Start Date: 29/Oct/19 15:18
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r340142519
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _BigQueryRowCoder(coders.Coder):
 
 Review comment:
   Looks interesting. If I combine this coder with this function[1], maybe that 
would be the solution. I'll investigate this further — first I need to solve 
the problem that `bigquery.TableSchema` is unpickable.
   
   [1] 
https://github.com/apache/beam/blob/03f780c7329e0eca692baef44874056b7d263303/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L892-L920
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335559)
Time Spent: 7.5h  (was: 7h 20m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335522=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335522
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 13:46
Start Date: 29/Oct/19 13:46
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r339969300
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py
 ##
 @@ -695,10 +769,12 @@ def get_or_create_table(
 
   def run_query(self, project_id, query, use_legacy_sql, flatten_results,
 dry_run=False):
-job_id, location = self._start_query_job(project_id, query,
- use_legacy_sql, flatten_results,
- job_id=uuid.uuid4().hex,
- dry_run=dry_run)
+job = self._start_query_job(project_id, query, use_legacy_sql,
 
 Review comment:
   Yes. It returns the whole job object because I needed its `statistics` 
property. See usage:
   
https://github.com/apache/beam/blob/03f780c7329e0eca692baef44874056b7d263303/sdks/python/apache_beam/io/gcp/bigquery.py#L655-L658
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335522)
Time Spent: 7h 20m  (was: 7h 10m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335514=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335514
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 13:12
Start Date: 29/Oct/19 13:12
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r340063953
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -1265,3 +1501,14 @@ def display_data(self):
tableSpec)
   res['table'] = DisplayDataItem(tableSpec, label='Table')
 return res
+
+
+@experimental()
+class ReadFromBigQuery(PTransform):
+  def __init__(self, *args, **kwargs):
 
 Review comment:
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335514)
Time Spent: 7h 10m  (was: 7h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335513=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335513
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 13:11
Start Date: 29/Oct/19 13:11
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r340063754
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _BigQueryRowCoder(coders.Coder):
+  """A coder for a table row (represented as a dict) from a JSON string which
+  applies additional conversions.
+  """
+
+  def __init__(self, table_schema):
+# bigquery.TableSchema type is unpickable so we must translate it to a
+# pickable type
+self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type)
+   for x in table_schema.fields]
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  def decode(self, value):
+value = json.loads(value)
+for field in self.fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  try:
+converter = self._converters[field.type]
+value[field.name] = converter(value[field.name])
+  except KeyError:
+# No need to do any conversion
+pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  """Read data from BigQuery.
+
+This source uses a BigQuery export job to take a snapshot of the table
+on GCS, and then reads from each produced JSON file.
+
+Do note that currently this source does not work with DirectRunner.
+
+  Args:
+table (str, callable, ValueProvider): The ID of the table, or a callable
+  that returns it. The ID must contain only letters ``a-z``, ``A-Z``,
+  numbers ``0-9``, or underscores ``_``. If dataset argument is
+  :data:`None` then the table argument must contain the entire table
+  reference specified as: ``'DATASET.TABLE'``
+  or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one
+  argument representing an element to be written to BigQuery, and return
+  a TableReference, or a string table name as specified above.
+dataset (str): The ID of the dataset containing this table or
+  :data:`None` if the table reference is specified entirely by the table
+  argument.
+project (str): The ID of the project containing this table.
+query (str): A query to be used instead of arguments table, dataset, and
+  project.
+  validate (bool): If :data:`True`, various checks will be done when source
+  gets initialized (e.g., is table present?). This should be
+  :data:`True` for most scenarios in order to catch errors as early as
+  possible (pipeline construction instead of pipeline execution). It
+  should be :data:`False` if the table is created during pipeline
+  execution by a previous step.
+coder (~apache_beam.coders.coders.Coder): The coder for the table
+  rows. If :data:`None`, then the default coder is
+  :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`,
+  which will interpret every line in a file as a JSON serialized
+  dictionary. This argument needs a value only in special cases when
+  returning table rows as dictionaries is not desirable.
+use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL
+  dialect for this query. The default value is :data:`False`.
+  If set to :data:`True`, the query will use BigQuery's updated SQL
+  dialect with improved standards compliance.
+  This parameter is ignored for table inputs.
+flatten_results (bool): Flattens all nested and repeated fields in the
+  query results. The default value is :data:`True`.
+kms_key (str): Experimental. Optional Cloud KMS key name for use when
+  creating new tables.
+

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335467=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335467
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 10:31
Start Date: 29/Oct/19 10:31
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r339997229
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _BigQueryRowCoder(coders.Coder):
+  """A coder for a table row (represented as a dict) from a JSON string which
+  applies additional conversions.
+  """
+
+  def __init__(self, table_schema):
+# bigquery.TableSchema type is unpickable so we must translate it to a
+# pickable type
+self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type)
+   for x in table_schema.fields]
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  def decode(self, value):
+value = json.loads(value)
+for field in self.fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  try:
+converter = self._converters[field.type]
+value[field.name] = converter(value[field.name])
+  except KeyError:
+# No need to do any conversion
+pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  """Read data from BigQuery.
+
+This source uses a BigQuery export job to take a snapshot of the table
+on GCS, and then reads from each produced JSON file.
+
+Do note that currently this source does not work with DirectRunner.
+
+  Args:
+table (str, callable, ValueProvider): The ID of the table, or a callable
+  that returns it. The ID must contain only letters ``a-z``, ``A-Z``,
+  numbers ``0-9``, or underscores ``_``. If dataset argument is
+  :data:`None` then the table argument must contain the entire table
+  reference specified as: ``'DATASET.TABLE'``
+  or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one
+  argument representing an element to be written to BigQuery, and return
+  a TableReference, or a string table name as specified above.
+dataset (str): The ID of the dataset containing this table or
+  :data:`None` if the table reference is specified entirely by the table
+  argument.
+project (str): The ID of the project containing this table.
+query (str): A query to be used instead of arguments table, dataset, and
+  project.
+  validate (bool): If :data:`True`, various checks will be done when source
+  gets initialized (e.g., is table present?). This should be
+  :data:`True` for most scenarios in order to catch errors as early as
+  possible (pipeline construction instead of pipeline execution). It
+  should be :data:`False` if the table is created during pipeline
+  execution by a previous step.
+coder (~apache_beam.coders.coders.Coder): The coder for the table
+  rows. If :data:`None`, then the default coder is
+  :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`,
+  which will interpret every line in a file as a JSON serialized
+  dictionary. This argument needs a value only in special cases when
+  returning table rows as dictionaries is not desirable.
+use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL
+  dialect for this query. The default value is :data:`False`.
+  If set to :data:`True`, the query will use BigQuery's updated SQL
+  dialect with improved standards compliance.
+  This parameter is ignored for table inputs.
+flatten_results (bool): Flattens all nested and repeated fields in the
+  query results. The default value is :data:`True`.
+kms_key (str): Experimental. Optional Cloud KMS key name for use when
+  creating new tables.
+

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335462=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335462
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 10:19
Start Date: 29/Oct/19 10:19
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r339992040
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py
 ##
 @@ -86,23 +117,61 @@ def create_table(self, tablename):
 table_schema.fields.append(table_field)
 table = bigquery.Table(
 tableReference=bigquery.TableReference(
-projectId=self.project,
-datasetId=self.dataset_id,
-tableId=tablename),
+projectId=cls.project,
+datasetId=cls.dataset_id,
+tableId=table_name),
 schema=table_schema)
 request = bigquery.BigqueryTablesInsertRequest(
-projectId=self.project, datasetId=self.dataset_id, table=table)
-self.bigquery_client.client.tables.Insert(request)
+projectId=cls.project, datasetId=cls.dataset_id, table=table)
+cls.bigquery_client.client.tables.Insert(request)
 table_data = [
 {'number': 1, 'str': 'abc'},
 {'number': 2, 'str': 'def'},
 {'number': 3, 'str': u'你好'},
 {'number': 4, 'str': u'привет'}
 ]
-self.bigquery_client.insert_rows(
-self.project, self.dataset_id, tablename, table_data)
+cls.bigquery_client.insert_rows(
+cls.project, cls.dataset_id, table_name, table_data)
 
-  def create_table_new_types(self, table_name):
+  def get_expected_data(self):
+return [
+{'number': 1, 'str': 'abc'},
+{'number': 2, 'str': 'def'},
+{'number': 3, 'str': u'你好'},
+{'number': 4, 'str': u'привет'}
+]
+
+  @skip(['PortableRunner', 'FlinkRunner'])
+  @attr('IT')
+  def test_native_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource(
+  query=self.query, use_standard_sql=True)))
+  assert_that(result, equal_to(self.get_expected_data()))
+
+  @skip(['DirectRunner', 'TestDirectRunner'])
+  @attr('IT')
+  def test_iobase_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.ReadFromBigQuery(
+  query=self.query, use_standard_sql=True, project=self.project,
+  gcs_bucket_name='gs://temp-storage-for-end-to-end-tests'))
 
 Review comment:
   > Why are we skipping this for DirectRunner? This should work there, right?
   Unfortunately, no. My solution doesn't work with DirectRunner.
   
   The direct cause is `get_range_tracker` and `read` methods aren't 
implemented in my source (they're raising NotImplementedError exception). This 
is purposeful — the runner is expected to call `split` instead. See Java 
implementation which works the same way: 
[link](https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceBase.java)
   
   It seems that DataflowRunner and Flink are able to catch these exceptions 
somehow, while DirectRunner is not. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335462)
Time Spent: 6h 40m  (was: 6.5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
>

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335457=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335457
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 09:55
Start Date: 29/Oct/19 09:55
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r339980081
 
 

 ##
 File path: sdks/python/test-suites/portable/py37/build.gradle
 ##
 @@ -30,3 +33,25 @@ task preCommitPy37() {
 dependsOn portableWordCountBatch
 dependsOn portableWordCountStreaming
 }
+
+task postCommitIT {
+  dependsOn 'installGcpTest'
+  dependsOn 'setupVirtualenv'
+  dependsOn ':runners:flink:1.8:job-server:shadowJar'
+
+  doLast {
+def tests = [
+"apache_beam.io.gcp.bigquery_read_it_test",
+]
+def testOpts = ["--tests=${tests.join(',')}"]
+def cmdArgs = mapToArgString([
+"test_opts": testOpts,
+"suite": "postCommitIT-flink-py37",
+"pipeline_opts": "--runner=FlinkRunner --project=apache-beam-testing 
--environment_type=LOOPBACK",
 
 Review comment:
   Yes, locally. I would phrase trigger a PostCommit check, but it is getting 
aborted almost always: 
https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Python37/
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335457)
Time Spent: 6.5h  (was: 6h 20m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335452=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335452
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 09:45
Start Date: 29/Oct/19 09:45
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r339975153
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py
 ##
 @@ -86,23 +117,61 @@ def create_table(self, tablename):
 table_schema.fields.append(table_field)
 table = bigquery.Table(
 tableReference=bigquery.TableReference(
-projectId=self.project,
-datasetId=self.dataset_id,
-tableId=tablename),
+projectId=cls.project,
+datasetId=cls.dataset_id,
+tableId=table_name),
 schema=table_schema)
 request = bigquery.BigqueryTablesInsertRequest(
-projectId=self.project, datasetId=self.dataset_id, table=table)
-self.bigquery_client.client.tables.Insert(request)
+projectId=cls.project, datasetId=cls.dataset_id, table=table)
+cls.bigquery_client.client.tables.Insert(request)
 table_data = [
 {'number': 1, 'str': 'abc'},
 {'number': 2, 'str': 'def'},
 {'number': 3, 'str': u'你好'},
 {'number': 4, 'str': u'привет'}
 ]
-self.bigquery_client.insert_rows(
-self.project, self.dataset_id, tablename, table_data)
+cls.bigquery_client.insert_rows(
+cls.project, cls.dataset_id, table_name, table_data)
 
-  def create_table_new_types(self, table_name):
+  def get_expected_data(self):
+return [
+{'number': 1, 'str': 'abc'},
+{'number': 2, 'str': 'def'},
+{'number': 3, 'str': u'你好'},
+{'number': 4, 'str': u'привет'}
+]
+
+  @skip(['PortableRunner', 'FlinkRunner'])
+  @attr('IT')
+  def test_native_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource(
+  query=self.query, use_standard_sql=True)))
+  assert_that(result, equal_to(self.get_expected_data()))
+
+  @skip(['DirectRunner', 'TestDirectRunner'])
+  @attr('IT')
+  def test_iobase_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.ReadFromBigQuery(
+  query=self.query, use_standard_sql=True, project=self.project,
+  gcs_bucket_name='gs://temp-storage-for-end-to-end-tests'))
 
 Review comment:
   > Why are we skipping this for DirectRunner? This should work there, right?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335452)
Time Spent: 6h  (was: 5h 50m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335454=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335454
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 09:45
Start Date: 29/Oct/19 09:45
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r339975250
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py
 ##
 @@ -86,23 +117,61 @@ def create_table(self, tablename):
 table_schema.fields.append(table_field)
 table = bigquery.Table(
 tableReference=bigquery.TableReference(
-projectId=self.project,
-datasetId=self.dataset_id,
-tableId=tablename),
+projectId=cls.project,
+datasetId=cls.dataset_id,
+tableId=table_name),
 schema=table_schema)
 request = bigquery.BigqueryTablesInsertRequest(
-projectId=self.project, datasetId=self.dataset_id, table=table)
-self.bigquery_client.client.tables.Insert(request)
+projectId=cls.project, datasetId=cls.dataset_id, table=table)
+cls.bigquery_client.client.tables.Insert(request)
 table_data = [
 {'number': 1, 'str': 'abc'},
 {'number': 2, 'str': 'def'},
 {'number': 3, 'str': u'你好'},
 {'number': 4, 'str': u'привет'}
 ]
-self.bigquery_client.insert_rows(
-self.project, self.dataset_id, tablename, table_data)
+cls.bigquery_client.insert_rows(
+cls.project, cls.dataset_id, table_name, table_data)
 
-  def create_table_new_types(self, table_name):
+  def get_expected_data(self):
+return [
+{'number': 1, 'str': 'abc'},
+{'number': 2, 'str': 'def'},
+{'number': 3, 'str': u'你好'},
+{'number': 4, 'str': u'привет'}
+]
+
+  @skip(['PortableRunner', 'FlinkRunner'])
+  @attr('IT')
+  def test_native_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource(
+  query=self.query, use_standard_sql=True)))
+  assert_that(result, equal_to(self.get_expected_data()))
+
+  @skip(['DirectRunner', 'TestDirectRunner'])
+  @attr('IT')
+  def test_iobase_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.ReadFromBigQuery(
+  query=self.query, use_standard_sql=True, project=self.project,
+  gcs_bucket_name='gs://temp-storage-for-end-to-end-tests'))
 
 Review comment:
   > Why are we skipping this for DirectRunner? This should work there, right?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335454)
Time Spent: 6h 20m  (was: 6h 10m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335453=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335453
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 09:45
Start Date: 29/Oct/19 09:45
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r339975250
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py
 ##
 @@ -86,23 +117,61 @@ def create_table(self, tablename):
 table_schema.fields.append(table_field)
 table = bigquery.Table(
 tableReference=bigquery.TableReference(
-projectId=self.project,
-datasetId=self.dataset_id,
-tableId=tablename),
+projectId=cls.project,
+datasetId=cls.dataset_id,
+tableId=table_name),
 schema=table_schema)
 request = bigquery.BigqueryTablesInsertRequest(
-projectId=self.project, datasetId=self.dataset_id, table=table)
-self.bigquery_client.client.tables.Insert(request)
+projectId=cls.project, datasetId=cls.dataset_id, table=table)
+cls.bigquery_client.client.tables.Insert(request)
 table_data = [
 {'number': 1, 'str': 'abc'},
 {'number': 2, 'str': 'def'},
 {'number': 3, 'str': u'你好'},
 {'number': 4, 'str': u'привет'}
 ]
-self.bigquery_client.insert_rows(
-self.project, self.dataset_id, tablename, table_data)
+cls.bigquery_client.insert_rows(
+cls.project, cls.dataset_id, table_name, table_data)
 
-  def create_table_new_types(self, table_name):
+  def get_expected_data(self):
+return [
+{'number': 1, 'str': 'abc'},
+{'number': 2, 'str': 'def'},
+{'number': 3, 'str': u'你好'},
+{'number': 4, 'str': u'привет'}
+]
+
+  @skip(['PortableRunner', 'FlinkRunner'])
+  @attr('IT')
+  def test_native_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource(
+  query=self.query, use_standard_sql=True)))
+  assert_that(result, equal_to(self.get_expected_data()))
+
+  @skip(['DirectRunner', 'TestDirectRunner'])
+  @attr('IT')
+  def test_iobase_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.ReadFromBigQuery(
+  query=self.query, use_standard_sql=True, project=self.project,
+  gcs_bucket_name='gs://temp-storage-for-end-to-end-tests'))
 
 Review comment:
   > Why are we skipping this for DirectRunner? This should work there, right?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335453)
Time Spent: 6h 10m  (was: 6h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335451=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335451
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 09:44
Start Date: 29/Oct/19 09:44
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r339975153
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py
 ##
 @@ -86,23 +117,61 @@ def create_table(self, tablename):
 table_schema.fields.append(table_field)
 table = bigquery.Table(
 tableReference=bigquery.TableReference(
-projectId=self.project,
-datasetId=self.dataset_id,
-tableId=tablename),
+projectId=cls.project,
+datasetId=cls.dataset_id,
+tableId=table_name),
 schema=table_schema)
 request = bigquery.BigqueryTablesInsertRequest(
-projectId=self.project, datasetId=self.dataset_id, table=table)
-self.bigquery_client.client.tables.Insert(request)
+projectId=cls.project, datasetId=cls.dataset_id, table=table)
+cls.bigquery_client.client.tables.Insert(request)
 table_data = [
 {'number': 1, 'str': 'abc'},
 {'number': 2, 'str': 'def'},
 {'number': 3, 'str': u'你好'},
 {'number': 4, 'str': u'привет'}
 ]
-self.bigquery_client.insert_rows(
-self.project, self.dataset_id, tablename, table_data)
+cls.bigquery_client.insert_rows(
+cls.project, cls.dataset_id, table_name, table_data)
 
-  def create_table_new_types(self, table_name):
+  def get_expected_data(self):
+return [
+{'number': 1, 'str': 'abc'},
+{'number': 2, 'str': 'def'},
+{'number': 3, 'str': u'你好'},
+{'number': 4, 'str': u'привет'}
+]
+
+  @skip(['PortableRunner', 'FlinkRunner'])
+  @attr('IT')
+  def test_native_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource(
+  query=self.query, use_standard_sql=True)))
+  assert_that(result, equal_to(self.get_expected_data()))
+
+  @skip(['DirectRunner', 'TestDirectRunner'])
+  @attr('IT')
+  def test_iobase_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.ReadFromBigQuery(
+  query=self.query, use_standard_sql=True, project=self.project,
+  gcs_bucket_name='gs://temp-storage-for-end-to-end-tests'))
 
 Review comment:
   > Why are we skipping this for DirectRunner? This should work there, right?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335451)
Time Spent: 5h 50m  (was: 5h 40m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335449=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335449
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 09:43
Start Date: 29/Oct/19 09:43
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r339974244
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py
 ##
 @@ -86,23 +117,61 @@ def create_table(self, tablename):
 table_schema.fields.append(table_field)
 table = bigquery.Table(
 tableReference=bigquery.TableReference(
-projectId=self.project,
-datasetId=self.dataset_id,
-tableId=tablename),
+projectId=cls.project,
+datasetId=cls.dataset_id,
+tableId=table_name),
 schema=table_schema)
 request = bigquery.BigqueryTablesInsertRequest(
-projectId=self.project, datasetId=self.dataset_id, table=table)
-self.bigquery_client.client.tables.Insert(request)
+projectId=cls.project, datasetId=cls.dataset_id, table=table)
+cls.bigquery_client.client.tables.Insert(request)
 table_data = [
 {'number': 1, 'str': 'abc'},
 {'number': 2, 'str': 'def'},
 {'number': 3, 'str': u'你好'},
 {'number': 4, 'str': u'привет'}
 ]
-self.bigquery_client.insert_rows(
-self.project, self.dataset_id, tablename, table_data)
+cls.bigquery_client.insert_rows(
+cls.project, cls.dataset_id, table_name, table_data)
 
-  def create_table_new_types(self, table_name):
+  def get_expected_data(self):
 
 Review comment:
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335449)
Time Spent: 5h 40m  (was: 5.5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335442=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335442
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 09:32
Start Date: 29/Oct/19 09:32
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r339969300
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py
 ##
 @@ -695,10 +769,12 @@ def get_or_create_table(
 
   def run_query(self, project_id, query, use_legacy_sql, flatten_results,
 dry_run=False):
-job_id, location = self._start_query_job(project_id, query,
- use_legacy_sql, flatten_results,
- job_id=uuid.uuid4().hex,
- dry_run=dry_run)
+job = self._start_query_job(project_id, query, use_legacy_sql,
 
 Review comment:
   Yes. It returns the whole job object because I needed its `statistics` 
property. See usage: bigquery.py:655
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335442)
Time Spent: 5.5h  (was: 5h 20m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335440=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335440
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 09:30
Start Date: 29/Oct/19 09:30
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r339962428
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py
 ##
 @@ -370,7 +383,37 @@ def _start_query_job(self, project_id, query, 
use_legacy_sql, flatten_results,
 jobReference=reference))
 
 response = self.client.jobs.Insert(request)
-return response.jobReference.jobId, response.jobReference.location
+return response
+
+  def wait_for_bq_job(self, job_reference, sleep_duration_sec=5,
+  max_retries=60):
+"""Poll job until it is DONE.
+
+Args:
+  job_reference: bigquery.JobReference instance.
+  sleep_duration_sec: Specifies the delay in seconds between retries.
+  max_retries: The total number of times to retry. If equals to 0,
+the function waits forever.
+
+Raises:
+  `RuntimeError`: If the job is FAILED or the number of retries has been
+reached.
+"""
+retry = 0
+while True:
+  retry += 1
+  job = self.get_job(job_reference.projectId, job_reference.jobId,
+ job_reference.location)
+  logging.info('Job status: %s', job.status.state)
+  if job.status.state == 'DONE' and job.status.errorResult:
+raise RuntimeError("BigQuery job %s failed. Error Result: %s",
 
 Review comment:
   Oh, you're right. In that case maybe I'll use the `format` method
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335440)
Time Spent: 5h 20m  (was: 5h 10m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335433=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335433
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 29/Oct/19 09:18
Start Date: 29/Oct/19 09:18
Worklog Time Spent: 10m 
  Work Description: kamilwu commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r339962428
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py
 ##
 @@ -370,7 +383,37 @@ def _start_query_job(self, project_id, query, 
use_legacy_sql, flatten_results,
 jobReference=reference))
 
 response = self.client.jobs.Insert(request)
-return response.jobReference.jobId, response.jobReference.location
+return response
+
+  def wait_for_bq_job(self, job_reference, sleep_duration_sec=5,
+  max_retries=60):
+"""Poll job until it is DONE.
+
+Args:
+  job_reference: bigquery.JobReference instance.
+  sleep_duration_sec: Specifies the delay in seconds between retries.
+  max_retries: The total number of times to retry. If equals to 0,
+the function waits forever.
+
+Raises:
+  `RuntimeError`: If the job is FAILED or the number of retries has been
+reached.
+"""
+retry = 0
+while True:
+  retry += 1
+  job = self.get_job(job_reference.projectId, job_reference.jobId,
+ job_reference.location)
+  logging.info('Job status: %s', job.status.state)
+  if job.status.state == 'DONE' and job.status.errorResult:
+raise RuntimeError("BigQuery job %s failed. Error Result: %s",
 
 Review comment:
   Oh, you're right. In that some maybe I'll use the `format` method
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 335433)
Time Spent: 5h 10m  (was: 5h)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=333650=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333650
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 24/Oct/19 18:28
Start Date: 24/Oct/19 18:28
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r338724547
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _BigQueryRowCoder(coders.Coder):
+  """A coder for a table row (represented as a dict) from a JSON string which
+  applies additional conversions.
+  """
+
+  def __init__(self, table_schema):
+# bigquery.TableSchema type is unpickable so we must translate it to a
+# pickable type
+self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type)
+   for x in table_schema.fields]
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  def decode(self, value):
+value = json.loads(value)
+for field in self.fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  try:
+converter = self._converters[field.type]
+value[field.name] = converter(value[field.name])
+  except KeyError:
+# No need to do any conversion
+pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  """Read data from BigQuery.
+
+This source uses a BigQuery export job to take a snapshot of the table
+on GCS, and then reads from each produced JSON file.
+
+Do note that currently this source does not work with DirectRunner.
+
+  Args:
+table (str, callable, ValueProvider): The ID of the table, or a callable
+  that returns it. The ID must contain only letters ``a-z``, ``A-Z``,
+  numbers ``0-9``, or underscores ``_``. If dataset argument is
+  :data:`None` then the table argument must contain the entire table
+  reference specified as: ``'DATASET.TABLE'``
+  or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one
+  argument representing an element to be written to BigQuery, and return
+  a TableReference, or a string table name as specified above.
+dataset (str): The ID of the dataset containing this table or
+  :data:`None` if the table reference is specified entirely by the table
+  argument.
+project (str): The ID of the project containing this table.
+query (str): A query to be used instead of arguments table, dataset, and
+  project.
+  validate (bool): If :data:`True`, various checks will be done when source
+  gets initialized (e.g., is table present?). This should be
+  :data:`True` for most scenarios in order to catch errors as early as
+  possible (pipeline construction instead of pipeline execution). It
+  should be :data:`False` if the table is created during pipeline
+  execution by a previous step.
+coder (~apache_beam.coders.coders.Coder): The coder for the table
+  rows. If :data:`None`, then the default coder is
+  :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`,
+  which will interpret every line in a file as a JSON serialized
+  dictionary. This argument needs a value only in special cases when
+  returning table rows as dictionaries is not desirable.
+use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL
+  dialect for this query. The default value is :data:`False`.
+  If set to :data:`True`, the query will use BigQuery's updated SQL
+  dialect with improved standards compliance.
+  This parameter is ignored for table inputs.
+flatten_results (bool): Flattens all nested and repeated fields in the
+  query results. The default value is :data:`True`.
+kms_key (str): Experimental. Optional Cloud KMS key name for use when
+  creating new tables.
+

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=333648=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333648
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 24/Oct/19 18:28
Start Date: 24/Oct/19 18:28
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r338725043
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -1265,3 +1501,14 @@ def display_data(self):
tableSpec)
   res['table'] = DisplayDataItem(tableSpec, label='Table')
 return res
+
+
+@experimental()
+class ReadFromBigQuery(PTransform):
+  def __init__(self, *args, **kwargs):
 
 Review comment:
   Since `ReadFromBigQuery` is the user-facing transform, it should have all 
the Pydoc.
   That being said, I'll defer @chamikaramj whether we want to expose this 
already.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 333648)
Time Spent: 4h 50m  (was: 4h 40m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=333649=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333649
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 24/Oct/19 18:28
Start Date: 24/Oct/19 18:28
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r338706927
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _BigQueryRowCoder(coders.Coder):
+  """A coder for a table row (represented as a dict) from a JSON string which
+  applies additional conversions.
+  """
+
+  def __init__(self, table_schema):
+# bigquery.TableSchema type is unpickable so we must translate it to a
+# pickable type
+self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type)
+   for x in table_schema.fields]
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  def decode(self, value):
+value = json.loads(value)
+for field in self.fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  try:
+converter = self._converters[field.type]
+value[field.name] = converter(value[field.name])
+  except KeyError:
+# No need to do any conversion
+pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  """Read data from BigQuery.
+
+This source uses a BigQuery export job to take a snapshot of the table
+on GCS, and then reads from each produced JSON file.
+
+Do note that currently this source does not work with DirectRunner.
+
+  Args:
+table (str, callable, ValueProvider): The ID of the table, or a callable
+  that returns it. The ID must contain only letters ``a-z``, ``A-Z``,
+  numbers ``0-9``, or underscores ``_``. If dataset argument is
+  :data:`None` then the table argument must contain the entire table
+  reference specified as: ``'DATASET.TABLE'``
+  or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one
+  argument representing an element to be written to BigQuery, and return
+  a TableReference, or a string table name as specified above.
+dataset (str): The ID of the dataset containing this table or
+  :data:`None` if the table reference is specified entirely by the table
+  argument.
+project (str): The ID of the project containing this table.
+query (str): A query to be used instead of arguments table, dataset, and
+  project.
+  validate (bool): If :data:`True`, various checks will be done when source
+  gets initialized (e.g., is table present?). This should be
+  :data:`True` for most scenarios in order to catch errors as early as
+  possible (pipeline construction instead of pipeline execution). It
+  should be :data:`False` if the table is created during pipeline
+  execution by a previous step.
+coder (~apache_beam.coders.coders.Coder): The coder for the table
+  rows. If :data:`None`, then the default coder is
+  :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`,
+  which will interpret every line in a file as a JSON serialized
+  dictionary. This argument needs a value only in special cases when
+  returning table rows as dictionaries is not desirable.
+use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL
+  dialect for this query. The default value is :data:`False`.
+  If set to :data:`True`, the query will use BigQuery's updated SQL
+  dialect with improved standards compliance.
+  This parameter is ignored for table inputs.
+flatten_results (bool): Flattens all nested and repeated fields in the
+  query results. The default value is :data:`True`.
+kms_key (str): Experimental. Optional Cloud KMS key name for use when
+  creating new tables.
+

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=333651=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333651
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 24/Oct/19 18:28
Start Date: 24/Oct/19 18:28
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r338724727
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _BigQueryRowCoder(coders.Coder):
+  """A coder for a table row (represented as a dict) from a JSON string which
+  applies additional conversions.
+  """
+
+  def __init__(self, table_schema):
+# bigquery.TableSchema type is unpickable so we must translate it to a
+# pickable type
+self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type)
+   for x in table_schema.fields]
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  def decode(self, value):
+value = json.loads(value)
+for field in self.fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  try:
+converter = self._converters[field.type]
+value[field.name] = converter(value[field.name])
+  except KeyError:
+# No need to do any conversion
+pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  """Read data from BigQuery.
+
+This source uses a BigQuery export job to take a snapshot of the table
+on GCS, and then reads from each produced JSON file.
+
+Do note that currently this source does not work with DirectRunner.
+
+  Args:
+table (str, callable, ValueProvider): The ID of the table, or a callable
+  that returns it. The ID must contain only letters ``a-z``, ``A-Z``,
+  numbers ``0-9``, or underscores ``_``. If dataset argument is
+  :data:`None` then the table argument must contain the entire table
+  reference specified as: ``'DATASET.TABLE'``
+  or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one
+  argument representing an element to be written to BigQuery, and return
+  a TableReference, or a string table name as specified above.
+dataset (str): The ID of the dataset containing this table or
+  :data:`None` if the table reference is specified entirely by the table
+  argument.
+project (str): The ID of the project containing this table.
+query (str): A query to be used instead of arguments table, dataset, and
+  project.
+  validate (bool): If :data:`True`, various checks will be done when source
+  gets initialized (e.g., is table present?). This should be
+  :data:`True` for most scenarios in order to catch errors as early as
+  possible (pipeline construction instead of pipeline execution). It
+  should be :data:`False` if the table is created during pipeline
+  execution by a previous step.
+coder (~apache_beam.coders.coders.Coder): The coder for the table
+  rows. If :data:`None`, then the default coder is
+  :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`,
+  which will interpret every line in a file as a JSON serialized
+  dictionary. This argument needs a value only in special cases when
+  returning table rows as dictionaries is not desirable.
+use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL
+  dialect for this query. The default value is :data:`False`.
+  If set to :data:`True`, the query will use BigQuery's updated SQL
+  dialect with improved standards compliance.
+  This parameter is ignored for table inputs.
+flatten_results (bool): Flattens all nested and repeated fields in the
+  query results. The default value is :data:`True`.
+kms_key (str): Experimental. Optional Cloud KMS key name for use when
+  creating new tables.
+

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=333652=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333652
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 24/Oct/19 18:28
Start Date: 24/Oct/19 18:28
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r338726341
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _BigQueryRowCoder(coders.Coder):
 
 Review comment:
   I feel silly - but didn't we have a coder for this? I see that 
`TableRowJsonCoder` does not do the full tablerow to dict conversion... does it 
make sense to extend it? Or not really?
   [1] 
https://github.com/apache/beam/blob/12d07745835e1b9c1e824b83beeeadf63ab4b234/sdks/python/apache_beam/io/gcp/bigquery.py#L312-L349
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 333652)
Time Spent: 5h  (was: 4h 50m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=333647=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333647
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 24/Oct/19 18:28
Start Date: 24/Oct/19 18:28
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r338706242
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _BigQueryRowCoder(coders.Coder):
+  """A coder for a table row (represented as a dict) from a JSON string which
+  applies additional conversions.
+  """
+
+  def __init__(self, table_schema):
+# bigquery.TableSchema type is unpickable so we must translate it to a
+# pickable type
+self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type)
+   for x in table_schema.fields]
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  def decode(self, value):
+value = json.loads(value)
+for field in self.fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  try:
+converter = self._converters[field.type]
+value[field.name] = converter(value[field.name])
+  except KeyError:
+# No need to do any conversion
+pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  """Read data from BigQuery.
+
+This source uses a BigQuery export job to take a snapshot of the table
+on GCS, and then reads from each produced JSON file.
+
+Do note that currently this source does not work with DirectRunner.
+
+  Args:
+table (str, callable, ValueProvider): The ID of the table, or a callable
+  that returns it. The ID must contain only letters ``a-z``, ``A-Z``,
+  numbers ``0-9``, or underscores ``_``. If dataset argument is
+  :data:`None` then the table argument must contain the entire table
+  reference specified as: ``'DATASET.TABLE'``
+  or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one
+  argument representing an element to be written to BigQuery, and return
+  a TableReference, or a string table name as specified above.
+dataset (str): The ID of the dataset containing this table or
+  :data:`None` if the table reference is specified entirely by the table
+  argument.
+project (str): The ID of the project containing this table.
+query (str): A query to be used instead of arguments table, dataset, and
+  project.
+  validate (bool): If :data:`True`, various checks will be done when source
+  gets initialized (e.g., is table present?). This should be
+  :data:`True` for most scenarios in order to catch errors as early as
+  possible (pipeline construction instead of pipeline execution). It
+  should be :data:`False` if the table is created during pipeline
+  execution by a previous step.
+coder (~apache_beam.coders.coders.Coder): The coder for the table
+  rows. If :data:`None`, then the default coder is
+  :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`,
+  which will interpret every line in a file as a JSON serialized
+  dictionary. This argument needs a value only in special cases when
+  returning table rows as dictionaries is not desirable.
+use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL
+  dialect for this query. The default value is :data:`False`.
+  If set to :data:`True`, the query will use BigQuery's updated SQL
+  dialect with improved standards compliance.
+  This parameter is ignored for table inputs.
+flatten_results (bool): Flattens all nested and repeated fields in the
+  query results. The default value is :data:`True`.
+kms_key (str): Experimental. Optional Cloud KMS key name for use when
+  creating new tables.
+

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332793=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332793
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:45
Start Date: 23/Oct/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r337775537
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py
 ##
 @@ -86,23 +117,61 @@ def create_table(self, tablename):
 table_schema.fields.append(table_field)
 table = bigquery.Table(
 tableReference=bigquery.TableReference(
-projectId=self.project,
-datasetId=self.dataset_id,
-tableId=tablename),
+projectId=cls.project,
+datasetId=cls.dataset_id,
+tableId=table_name),
 schema=table_schema)
 request = bigquery.BigqueryTablesInsertRequest(
-projectId=self.project, datasetId=self.dataset_id, table=table)
-self.bigquery_client.client.tables.Insert(request)
+projectId=cls.project, datasetId=cls.dataset_id, table=table)
+cls.bigquery_client.client.tables.Insert(request)
 table_data = [
 {'number': 1, 'str': 'abc'},
 {'number': 2, 'str': 'def'},
 {'number': 3, 'str': u'你好'},
 {'number': 4, 'str': u'привет'}
 ]
-self.bigquery_client.insert_rows(
-self.project, self.dataset_id, tablename, table_data)
+cls.bigquery_client.insert_rows(
+cls.project, cls.dataset_id, table_name, table_data)
 
-  def create_table_new_types(self, table_name):
+  def get_expected_data(self):
 
 Review comment:
   Make this a constant, and use it to create the table, and to match in the 
asserts.
   ```
   TABLE_DATA = [
   {'number': 1, 'str': 'abc'},
   {'number': 2, 'str': 'def'},
   {'number': 3, 'str': u'你好'},
   {'number': 4, 'str': u'привет'}
   ]
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332793)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332789=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332789
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:45
Start Date: 23/Oct/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r337778293
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py
 ##
 @@ -86,23 +117,61 @@ def create_table(self, tablename):
 table_schema.fields.append(table_field)
 table = bigquery.Table(
 tableReference=bigquery.TableReference(
-projectId=self.project,
-datasetId=self.dataset_id,
-tableId=tablename),
+projectId=cls.project,
+datasetId=cls.dataset_id,
+tableId=table_name),
 schema=table_schema)
 request = bigquery.BigqueryTablesInsertRequest(
-projectId=self.project, datasetId=self.dataset_id, table=table)
-self.bigquery_client.client.tables.Insert(request)
+projectId=cls.project, datasetId=cls.dataset_id, table=table)
+cls.bigquery_client.client.tables.Insert(request)
 table_data = [
 {'number': 1, 'str': 'abc'},
 {'number': 2, 'str': 'def'},
 {'number': 3, 'str': u'你好'},
 {'number': 4, 'str': u'привет'}
 ]
-self.bigquery_client.insert_rows(
-self.project, self.dataset_id, tablename, table_data)
+cls.bigquery_client.insert_rows(
+cls.project, cls.dataset_id, table_name, table_data)
 
-  def create_table_new_types(self, table_name):
+  def get_expected_data(self):
+return [
+{'number': 1, 'str': 'abc'},
+{'number': 2, 'str': 'def'},
+{'number': 3, 'str': u'你好'},
+{'number': 4, 'str': u'привет'}
+]
+
+  @skip(['PortableRunner', 'FlinkRunner'])
+  @attr('IT')
+  def test_native_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource(
+  query=self.query, use_standard_sql=True)))
+  assert_that(result, equal_to(self.get_expected_data()))
+
+  @skip(['DirectRunner', 'TestDirectRunner'])
+  @attr('IT')
+  def test_iobase_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.ReadFromBigQuery(
+  query=self.query, use_standard_sql=True, project=self.project,
+  gcs_bucket_name='gs://temp-storage-for-end-to-end-tests'))
 
 Review comment:
   Why are we skipping this for DirectRunner? This should work there, right?
   `gcs_bucket_name` may need to be passed testpipeline arguments, in case it 
runs in a project that does not have access to that bucket (we run it 
internally at Google).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332789)
Time Spent: 4.5h  (was: 4h 20m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 128 matches

Mail list logo